ICDDA - 2024

Title: Visual Question Answering with Dual Attention and Question Categorization

Speaker: Dr. Prithwijit Guha, IIT Guwahati

ABSTRACT:

Visual Question Answering (VQA) is a challenging multi-modal Artificial Intelligence task involving computer vision, natural language processing and commonsense reasoning. It has vast potential applications for several human-computer interaction tasks, including assisting visually impaired individuals, AI-based personal assistants etc. Furthermore, it is also considered an AI-complete task. This work aims to enhance the performance of VQA models by overcoming two challenges -- cross-domain interaction and reasoning in large answer space. This work proposes a VQA model consisting of a Dual Attention mechanism and Question Categorizer. The dual attention mechanism allows the VQA model to obtain improved cross-domain (image and text-domains) semantic representation. Furthermore, question type identification for answer space reduction coupled with a dual attention mechanism improved or obtained competitive performance compared to state-of-art models on two VQA datasets.

BIO:

Dr. Prithwijit Guha is an Associate Professor in the Department of Electronics and Electrical Engineering, IIT Guwahati. He is also an Associated Faculty member of the Centre for Linguistic Science and Technology (CLST) and Centre for Intelligent Cyber Physical Systems (CICPS) at IIT Guwahati. He received his B.E. in Electrical Engineering from Jadavpur University followed by M.Tech. in Signal Processing and Ph.D. in Electrical Engineering from IIT Kanpur. He was a Team Leader of the Computer Vision group at the TCS Innovation Labs, New Delhi (2010-2012). His research interests are in the broad areas of Computer Vision, Machine Learning and Signal Processing with emphasis on Broadcast Analytics, Video Analytics and Joint Vision-Language Tasks.