Post-Doctoral Research Visit F/M Multimodal entity representation and disambiguation



October 9, 2022


2022-05351 - Post-Doctoral Research Visit F/M Multimodal entity representation and disambiguation

Contract type : Fixed-term contract

Level of qualifications required : PhD or equivalent

Fonction : Post-Doctoral Research Visit

About the research centre or Inria department

The Inria Rennes - Bretagne Atlantique Centre is one of Inria's eight centres and has more than thirty research teams. The Inria Center is a major and recognized player in the field of digital sciences. It is at the heart of a rich R&D and innovation ecosystem: highly innovative PMEs, large industrial groups, competitiveness clusters, research and higher education players, laboratories of excellence, technological research institute, etc.



Exploiting multimedia content often relies on the correct identification of entities in text and images. A major difficulty for understanding a multimedia content lies in its ambiguity with regard to the actual user needs, for instance when identifying an entity from a given textual mention or matching a visual object to a query expressed through language.

The MEERQAT (https: // www. project addresses the problem of analyzing ambiguous visual and textual content by learning and combining their representations and by taking into account the existing knowledge about entities. It aims at solving the Multimedia Question Answering (MQA) task, which requires answering a textual question associated with a visual input like an image, given a knowledge base (KB) containing millions of unique entities and associated text.

The candidate will be hired by CEA (Palaiseau, near Paris, France) for a 18-month post-doc. A stay of 6 months at INRIA (Rennes, France) is planned during this period, provided that the health context allows it. The additional costs resulting from this stay will be covered by the CEA.

The salary depends on qualifications and experience.

The postdoc will have access to large supercomputers equipped with multiple GPUs and large storage for experiments, in addition to a professional laptop.


The post-doc specifically addresses the problem of representing multimodal entities at large scale to disambiguate them. Other partners of the project work on the visual, textual and KB representation, as well as on question answering based on the three modalities.

Main activities

Main activities

We consider entities such as a person, a place, an object or an organization (NGO, company...). Entities can be represented by different modalities, in particular by visual and textual content. However, a given mention of this entity is often ambiguous. For example, the mention «Paris» refers not only to the city of France (and a dozen of other cities in the world), but also to the model Paris Hilton and the Greek hero of the Trojan War. An additional visual content linked to the mention can greatly help to disambiguate, although the visual content itself carries other ambiguities. We also consider a third type of information, namely links between entities within a knowledge base. The task of Multimedia Question Answering needs all these three modalities to be solved.

The postdoctoral associate will work on the representation of entities described by several modalities, with a particular emphasis on the use of visual data to help in search and linking of entities. The goal is to not only disambiguate one modality by using another ROS18,KAM21, but also to jointly disambiguate both by representing them in a common space. Most of state of the art representation of visual and textual content rely on neuronal models. There also exist embeddings that reflect the links in a knowledge base WAN17. Many works address cross- modal tasks between two of these modalities, relying on such representations projected in a common space, in order to minimize a loss corresponding to the task of interest, such as visual question answering (VQA) MAL14, ANT15, BEN17, SHA19 or zero-shot learning LEC19, SKO21. Other approaches identify attributes in the visual content through a pre-trained model, then query a knowledge base to map it to the textual modality and learn a knowledge-based

VQA model WU16, WAN17. Such approaches have been extended to include structural facts that link the attributes WAN18 and common-sense knowledge MAR21, WU21. Other works address VQA involving some knowledge on named entities, although still limited to the sole type of persons SHA19b. These last approaches require a quite structured knowledge, but others allow more general sources of knowledge, including free-form text found on the Web MAR19. For more specific use cases, it is also possible to create an ad- hoc knowledge base GAR20.

However, to tackle the MQA task of interest in the MEERQAT project, one must address these issues at large scale, with a high level of ambiguity requiring fine reasoning on the entities. Depending on the type of an entity, the information to take into account in its representation is not obvious. A person may be associated with just a couple of mentions and images, but the situation becomes more complex for other types of entities. For instance, a company may be associated with its logo, but also with its main products or even its managers (CEO, CTO...). In the same vein, a location may be represented by many pictures, and a city by landmark buildings or places.

We aim at determining the appropriate information to include in the representation of a given entity. Hence, in a common space, an entity can be represented by several vectors, that need to be combined into a unique representation that reflects the similarity to the related entities. In such a context, a promising approach consists of learning a visual representation from natural language supervision RAD21 relying on large datasets by a simple learning strategy based on contrastive predictive coding OOR18, adapted to text and visual modalities ZHA20. The learned representation allows to address multiple cross-modal tasks and provide a large-scale vocabulary that is adapted to general audience in a given language. It exhibits state of the art performance on several tasks and can even exceed humans on certain tasks. However, it does not include any structural information from a knowledge base, which is crucial for visual reasoning.

ANT15 Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C. L.; and Parikh, D. VQA: Visual Question Answering. In Proc. ICCV, 2015.

BEN17 Ben-Younes, H.; Cadene, R.; Cord, M.; and Thome, N. MUTAN: Multimodal tucker fusion for visual question answering. In Proc. ICCV, 2017.

CHE20 Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations, In Proc. ICML, 2020.

GAR20 Garcia, N.; Otani, M.; Chu, C.; Nakashima, Y. KnowIT VQA: Answering Knowledge- Based Questions about Videos, In Proc. AAAI, 2020.

KAM21 Kamath, A.; Singh, M.; LeCun, Y.; Misra, I.; Synnaeve, G.; Carion, N. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding. arXiv preprint arXiv:2104.12763, 2021.

LEC19 Le Cacheux, Y.; Le Borgne, H.; Crucianu, M. Modeling Inter and Intra-Class Relations in the Triplet Loss for Zero-Shot Learning. In Proc. ICCV, 2019.

MAL14 Malinowski, M.; Fritz, M. A multi-world approach to question answering about real- world scenes based on uncertain input. In Proc. NIPS, 2014.

MAR19 Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proc. CVPR, 2019.

MAR21 Marino, K.; Chen, X.; Parikh, D.; Gupta, A.; Rohrbach, M. KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA. In Proc. CVPR, 2021.

OOR18 Oord, A. v. d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv:1807.03748, Jul 2018.

RAD21 Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; Sutskever, I. Learning Transferable Visual Models From Natural Language Supervision. ArXiv 2103.00020, Feb 2021.

ROS18 Rosenfeld, A.; Biparva, M.; Tsotsos, J. K. Priming Neural Networks. In Proc. CVPR, 2018.

SHA19 Shah, M.; Chen, X.; Rohrbach, M.; Parikh, D. Cycle-consistency for robust visual question answering. In Proc. CVPR, 2019.

SHA19b Shah, S.; Mishra, A.; Yadati, N.; Talukdar, P. P. KVQA: Knowledge- aware visual question answering. In Proc. AAAI, 2019.

SKO21 Skorokhodov, I.; Elhoseiny, M. Class Normalization for (Continual)? Generalized Zero- Shot Learning. arXiv:2006.11328, 2021.

WAN17 Wang, Q.; Mao, Z.; Wang, B.; Guo, L. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12):2724–2743, 2017.

WAN17b Wang, P.; Wu, Q.; Shen, C.; Dick, A.; Van Den Henge, A. Explicit knowledge-based reasoning for visual question answering. In Proc. IJCAI, 2017.

WAN18 Wang, P.; Wu, Q.; Shen, C.; Dick, A.; and van den Hengel, A. FVQA: Fact-based visual question answering. IEEE Trans. PAMI 40(10):2413–2427, 2018.

WU16 Wu, Q.; Wang, P.; Shen, C.; Dick, A.; van den Hengel, A. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proc. CVPR, 2016.

WU21 Wu, J.; Lu, J.; Sabharwal, A.; Mottaghi, R. Multi-Modal Answer Validation for Knowledge-Based VQA. arXiv preprint arXiv:2103.12248, 2021.

ZHA20 Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C. D.;Langlotz, C. P. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747, 2020.


PhD in Computer Vision, Machine Learning, Natural Language Processing or other relevant fields

Strong publication record, with accepted articles in top-tier conferences and journals of the domain

Solid programming skills (pytorch/tensorflow). Publicly available project will be appreciated

Ability to communicate and collaborate

Experience on using GPUs on a supercomputer (e.g. with SLURM or similar tool) will be appreciated

Benefits package
  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Possibility of teleworking (90 days per year) and flexible organization of working hours
  • partial payment of insurance costs
  • Remuneration

    Monthly gross salary amounting to 2746 euros

    General Information
  • Theme/Domain : Vision, perception and multimedia interpretation
  • Town/city : Palaiseau
  • Inria Center : CRI Rennes - Bretagne Atlantique
  • Starting date : 2022-11-01
  • Duration of contract : 1 year, 6 months
  • Deadline to apply : 2022-10-09
  • Contacts
  • Inria Team : LINKMEDIA
  • Recruiter : Amsaleg Laurent /
  • The keys to success

    Description of the job: Details at https: // www. content/uploads/2021/06/meerqatpostdocceainria.pdf

    About Inria

    Inria is the French national research institute dedicated to digital science and technology. It employs 2,600 people. Its 200 agile project teams, generally run jointly with academic partners, include more than 3,500 scientists and engineers working to meet the challenges of digital technology, often at the interface with other disciplines. The Institute also employs numerous talents in over forty different professions. 900 research support staff contribute to the preparation and development of scientific and entrepreneurial projects that have a worldwide impact.

    Instruction to apply

    Please submit online : your resume, cover letter and letters of recommendation eventually

    Defence Security : This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.

    Recruitment Policy : As part of its diversity policy, all Inria positions are accessible to people with disabilities.

    Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.