2022-05351 - Post-Doctoral Research Visit F/M Multimodal entity
representation and disambiguation
Contract type : Fixed-term contract
Level of qualifications required : PhD or equivalent
Fonction : Post-Doctoral Research Visit
About the research centre or Inria department
The Inria Rennes - Bretagne Atlantique Centre is one of Inria's eight centres
and has more than thirty research teams. The Inria Center is a major and
recognized player in the field of digital sciences. It is at the heart of a
rich R&D and innovation ecosystem: highly innovative PMEs, large industrial
groups, competitiveness clusters, research and higher education players,
laboratories of excellence, technological research institute, etc.
Exploiting multimedia content often relies on the correct identification of
entities in text and images. A major difficulty for understanding a multimedia
content lies in its ambiguity with regard to the actual user needs, for
instance when identifying an entity from a given textual mention or matching a
visual object to a query expressed through language.
The MEERQAT (https: // www. meerqat.fr) project addresses the problem of
analyzing ambiguous visual and textual content by learning and combining their
representations and by taking into account the existing knowledge about
entities. It aims at solving the Multimedia Question Answering (MQA) task,
which requires answering a textual question associated with a visual input
like an image, given a knowledge base (KB) containing millions of unique
entities and associated text.
The candidate will be hired by CEA (Palaiseau, near Paris, France) for a
18-month post-doc. A stay of 6 months at INRIA (Rennes, France) is planned
during this period, provided that the health context allows it. The additional
costs resulting from this stay will be covered by the CEA.
The salary depends on qualifications and experience.
The postdoc will have access to large supercomputers equipped with multiple
GPUs and large storage for experiments, in addition to a professional laptop.
The post-doc specifically addresses the problem of representing multimodal
entities at large scale to disambiguate them. Other partners of the project
work on the visual, textual and KB representation, as well as on question
answering based on the three modalities.
We consider entities such as a person, a place, an object or an organization
(NGO, company...). Entities can be represented by different modalities, in
particular by visual and textual content. However, a given mention of this
entity is often ambiguous. For example, the mention «Paris» refers not only to
the city of France (and a dozen of other cities in the world), but also to
the model Paris Hilton and the Greek hero of the Trojan War. An additional
visual content linked to the mention can greatly help to disambiguate,
although the visual content itself carries other ambiguities. We also consider
a third type of information, namely links between entities within a knowledge
base. The task of Multimedia Question Answering needs all these three
modalities to be solved.
The postdoctoral associate will work on the representation of entities
described by several modalities, with a particular emphasis on the use of
visual data to help in search and linking of entities. The goal is to not only
disambiguate one modality by using another ROS18,KAM21, but also to
jointly disambiguate both by representing them in a common space. Most of
state of the art representation of visual and textual content rely on neuronal
models. There also exist embeddings that reflect the links in a knowledge base
WAN17. Many works address cross- modal tasks between two of these
modalities, relying on such representations projected in a common space, in
order to minimize a loss corresponding to the task of interest, such as visual
question answering (VQA) MAL14, ANT15, BEN17, SHA19 or zero-shot
learning LEC19, SKO21. Other approaches identify attributes in the visual
content through a pre-trained model, then query a knowledge base to map it to
the textual modality and learn a knowledge-based
VQA model WU16, WAN17. Such approaches have been extended to include
structural facts that link the attributes WAN18 and common-sense knowledge
MAR21, WU21. Other works address VQA involving some knowledge on named
entities, although still limited to the sole type of persons SHA19b. These
last approaches require a quite structured knowledge, but others allow more
general sources of knowledge, including free-form text found on the Web
MAR19. For more specific use cases, it is also possible to create an ad-
hoc knowledge base GAR20.
However, to tackle the MQA task of interest in the MEERQAT project, one must
address these issues at large scale, with a high level of ambiguity requiring
fine reasoning on the entities. Depending on the type of an entity, the
information to take into account in its representation is not obvious. A
person may be associated with just a couple of mentions and images, but the
situation becomes more complex for other types of entities. For instance, a
company may be associated with its logo, but also with its main products or
even its managers (CEO, CTO...). In the same vein, a location may be
represented by many pictures, and a city by landmark buildings or places.
We aim at determining the appropriate information to include in the
representation of a given entity. Hence, in a common space, an entity can be
represented by several vectors, that need to be combined into a unique
representation that reflects the similarity to the related entities. In such a
context, a promising approach consists of learning a visual representation
from natural language supervision RAD21 relying on large datasets by a
simple learning strategy based on contrastive predictive coding OOR18,
adapted to text and visual modalities ZHA20. The learned representation
allows to address multiple cross-modal tasks and provide a large-scale
vocabulary that is adapted to general audience in a given language. It
exhibits state of the art performance on several tasks and can even exceed
humans on certain tasks. However, it does not include any structural
information from a knowledge base, which is crucial for visual reasoning.
ANT15 Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.
L.; and Parikh, D. VQA: Visual Question Answering. In Proc. ICCV, 2015.
BEN17 Ben-Younes, H.; Cadene, R.; Cord, M.; and Thome, N. MUTAN:
Multimodal tucker fusion for visual question answering. In Proc. ICCV, 2017.
CHE20 Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework
for Contrastive Learning of Visual Representations, In Proc. ICML, 2020.
GAR20 Garcia, N.; Otani, M.; Chu, C.; Nakashima, Y. KnowIT VQA: Answering
Knowledge- Based Questions about Videos, In Proc. AAAI, 2020.
KAM21 Kamath, A.; Singh, M.; LeCun, Y.; Misra, I.; Synnaeve, G.; Carion,
N. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding. arXiv
preprint arXiv:2104.12763, 2021.
LEC19 Le Cacheux, Y.; Le Borgne, H.; Crucianu, M. Modeling Inter and
Intra-Class Relations in the Triplet Loss for Zero-Shot Learning. In Proc.
MAL14 Malinowski, M.; Fritz, M. A multi-world approach to question
answering about real- world scenes based on uncertain input. In Proc. NIPS,
MAR19 Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. OK-VQA: A
visual question answering benchmark requiring external knowledge. In Proc.
MAR21 Marino, K.; Chen, X.; Parikh, D.; Gupta, A.; Rohrbach, M. KRISP:
Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based
VQA. In Proc. CVPR, 2021.
OOR18 Oord, A. v. d.; Li, Y.; Vinyals, O. Representation learning with
contrastive predictive coding. arXiv:1807.03748, Jul 2018.
RAD21 Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal,
S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; Sutskever, I.
Learning Transferable Visual Models From Natural Language Supervision. ArXiv
2103.00020, Feb 2021.
ROS18 Rosenfeld, A.; Biparva, M.; Tsotsos, J. K. Priming Neural Networks.
In Proc. CVPR, 2018.
SHA19 Shah, M.; Chen, X.; Rohrbach, M.; Parikh, D. Cycle-consistency for
robust visual question answering. In Proc. CVPR, 2019.
SHA19b Shah, S.; Mishra, A.; Yadati, N.; Talukdar, P. P. KVQA: Knowledge-
aware visual question answering. In Proc. AAAI, 2019.
SKO21 Skorokhodov, I.; Elhoseiny, M. Class Normalization for
(Continual)? Generalized Zero- Shot Learning. arXiv:2006.11328, 2021.
WAN17 Wang, Q.; Mao, Z.; Wang, B.; Guo, L. Knowledge graph embedding: A
survey of approaches and applications. IEEE Transactions on Knowledge and Data
Engineering, 29(12):2724–2743, 2017.
WAN17b Wang, P.; Wu, Q.; Shen, C.; Dick, A.; Van Den Henge, A. Explicit
knowledge-based reasoning for visual question answering. In Proc. IJCAI, 2017.
WAN18 Wang, P.; Wu, Q.; Shen, C.; Dick, A.; and van den Hengel, A. FVQA:
Fact-based visual question answering. IEEE Trans. PAMI 40(10):2413–2427,
WU16 Wu, Q.; Wang, P.; Shen, C.; Dick, A.; van den Hengel, A. Ask me
anything: Free-form visual question answering based on knowledge from external
sources. In Proc. CVPR, 2016.
WU21 Wu, J.; Lu, J.; Sabharwal, A.; Mottaghi, R. Multi-Modal Answer
Validation for Knowledge-Based VQA. arXiv preprint arXiv:2103.12248, 2021.
ZHA20 Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C. D.;Langlotz, C. P.
Contrastive learning of medical visual representations from paired images and
text. arXiv preprint arXiv:2010.00747, 2020.
PhD in Computer Vision, Machine Learning, Natural Language Processing or other
Strong publication record, with accepted articles in top-tier conferences and
journals of the domain
Solid programming skills (pytorch/tensorflow). Publicly available project
will be appreciated
Ability to communicate and collaborate
Experience on using GPUs on a supercomputer (e.g. with SLURM or similar
tool) will be appreciated
Partial reimbursement of public transport costs
Possibility of teleworking (90 days per year) and flexible organization
of working hours
partial payment of insurance costs
Monthly gross salary amounting to 2746 euros
Theme/Domain : Vision, perception and multimedia interpretation
Town/city : Palaiseau
Inria Center : CRI Rennes - Bretagne Atlantique
Starting date : 2022-11-01
Duration of contract : 1 year, 6 months
Deadline to apply : 2022-10-09
Inria Team : LINKMEDIA
Amsaleg Laurent / Laurent.Amsaleg@irisa.fr
The keys to success
Description of the job:
Details at https: // www. meerqat.fr/wp-
Inria is the French national research institute dedicated to digital science
and technology. It employs 2,600 people. Its 200 agile project teams,
generally run jointly with academic partners, include more than 3,500
scientists and engineers working to meet the challenges of digital technology,
often at the interface with other disciplines. The Institute also employs
numerous talents in over forty different professions. 900 research support
staff contribute to the preparation and development of scientific and
entrepreneurial projects that have a worldwide impact.
Instruction to apply
Please submit online : your resume, cover letter and letters of recommendation
Defence Security :
This position is likely to be situated in a restricted area (ZRR), as
defined in Decree No. 2011-1425 relating to the protection of national
scientific and technical potential (PPST).Authorisation to enter an area is
granted by the director of the unit, following a favourable Ministerial
decision, as defined in the decree of 3 July 2012 relating to the PPST. An
unfavourable Ministerial decision in respect of a position situated in a ZRR
would result in the cancellation of the appointment.
Recruitment Policy :
As part of its diversity policy, all Inria positions are accessible to people
Warning : you must enter your e-mail address in order to save your
application to Inria. Applications must be submitted online on the Inria
website. Processing of applications sent from other channels is not