Advanced search
Start date

Noisy scene graph with self-supervised learning on graph neural network for visual question answering task

Grant number: 22/09849-8
Support Opportunities:Scholarships abroad - Research Internship - Master's degree
Effective date (Start): October 01, 2022
Effective date (End): March 31, 2023
Field of knowledge:Physical Sciences and Mathematics - Computer Science
Principal Investigator:Gerberth Adín Ramírez Rivera
Grantee:Bruno César de Oliveira Souza
Supervisor: Michael Christian Kampffmeyer
Host Institution: Instituto de Computação (IC). Universidade Estadual de Campinas (UNICAMP). Campinas , SP, Brazil
Research place: University of Oslo (UiO), Norway  
Associated to the scholarship:20/14452-4 - Visual question answering task with graph convolution networks, BP.MS


Visual question answering (VQA) is an attractive multi-model research field aiming to answer a free-form question based on an image. Its attractiveness lies in the fact that it combines two fields that are typically approached individually, computer vision (CV) and natural language processing (NLP). In order to achieve a comprehensive and semantically alignment between the two fields, some researchers normally focus on works that find biases in the data, and others focus on pre-trained models for visual languages, such as UNITER, or even neural networks modules. However, recent works increased the range of research by using Scene Graphs (SG) for the VQA task.SG provides a graph representation of the image, containing information about the objects and their possible relations. This representation can be more advantageous than the typical object features extracted since it carries information about the relationship and allows for greater interpretability. Although SG is closely related to VQA, SG-QA research remains relatively under-explored. Sporadic attempts in SG-VQA mostly propose various attention mechanisms designed primarily for fully-connected graphs, thereby failing to model and capture the important structural information of the SG. Previous works proposed pre-trained image-question architectures for use with scene graphs and evaluated various scene graph generation techniques for unseen images. However, normally, their works are limited to pre-trained Visual-Language models such as attention-based models. These models learn through large-scale pre-training over jointly image-text datasets and they are normally used to extract a cross-modal contextualized embedding for a given image and question. In other words, recent works do not leverage in their analyses models that are designed to be directly applied to graphs, such as Graph Neural Network (GNN). In this work, we explore the use of SG for solving the VQA task by models that handle graph-based representation through message passing techniques such as GNN.GNN is designed to perform inference on data described by graphs. The intuition is that in order to better understand the role of the use of SG for improving the VQA, it is necessary to verify the behavior of models that are designed for that kind of representation. The state-of-the-art in SG-VQA is reached when the SG from the images is obtained manually using the ground truth scene graph (GTSG). This project aimed to extend that version to an automatic generation scene graph of the image as annotated scene graphs are impractical in the real world. Moreover, we intend to leverage question-guided generation that may lead the SG generation to present particular distribution related to the type of question grounded on a given image. Without the GTSG, we propose to apply self-supervised learning (SSL) with contrastive learning as the pretext task. SSL may improve the learning representation through supervisory signals learning from unlabeled data and contrastive learning aim at maximizing the agreement of representations between similar graph instances while the agreement between dissimilar graph instances is minimized. Therefore, SSL for GNN could improve the graph embedding representation by maximizing the mutual information (MI) between augmented views generated from the same SG while minimizing the MI of the graph embeddings between different SG. So, unlike previous works, we examined the behavior of GNN-based models with noisy scene graphs generated in the context of the VQA task. We extend the scene graphs generated approach from raw images using a pre-trained scene graph generator (SGG) which is more general and practical. In addition, we intend to jointly train the model in an SSL manner in order for the model to learn a better representation of the SG and for the VQA task aiming to correctly answer the question. (AU)

News published in Agência FAPESP Newsletter about the scholarship:
Articles published in other media outlets (0 total):
More itemsLess items

Please report errors in scientific publications list using this form.