Find-2-Find: Multitask Learning for Anaphora Resolution and Object Localization

Cennet Oguz; Pascal Denis; Emmanuel Vincent; Simon Ostermann; Josef van Genabith

Communication Dans Un Congrès Année : 2023

Find-2-Find: Multitask Learning for Anaphora Resolution and Object Localization

(1) , (2) , (3) , (1) , (1)

1
2
3

Cennet Oguz

Fonction : Auteur

Deutsches Forschungszentrum für Künstliche Intelligenz GmbH = German Research Center for Artificial Intelligence

Pascal Denis

Fonction : Auteur
PersonId : 1744
IdHAL : pascal-denis
IdRef : 031934684

Machine Learning in Information Networks

Emmanuel Vincent

Fonction : Auteur
PersonId : 1256
IdHAL : emmanuelv
ORCID : 0000-0002-0183-7289
IdRef : 089360176

Speech Modeling for Facilitating Oral-Based Communication

Simon Ostermann

Fonction : Auteur
PersonId : 1299801

Deutsches Forschungszentrum für Künstliche Intelligenz GmbH = German Research Center for Artificial Intelligence

Josef van Genabith

Fonction : Auteur
PersonId : 1262350

Deutsches Forschungszentrum für Künstliche Intelligenz GmbH = German Research Center for Artificial Intelligence

Résumé

In multimodal understanding tasks, visual and linguistic ambiguities can arise. Visual ambiguity can occur when visual objects require a model to ground a referring expression in a video without strong supervision, while linguistic ambiguity can occur from changes in entities in action flows. As an example from the cooking domain, "oil" mixed with "salt" and "pepper" could later be referred to as a "mixture". Without a clear visual-linguistic alignment, we cannot know which among several objects shown is referred to by the language expression "mixture", and without resolved antecedents, we cannot pinpoint what the mixture is. We define this chicken-and-egg problem as visual-linguistic ambiguity. In this paper, we present Find2Find, a joint anaphora resolution and object localization dataset targeting the problem of visual-linguistic ambiguity, consisting of 500 anaphora-annotated recipes with corresponding videos. We present experimental results of a novel end-to-end joint multitask learning framework for Find2Find that fuses visual and textual information and shows improvements both for anaphora resolution and object localization as compared to a strong single-task baseline.

Domaines

Informatique [cs]

Fichier principal

emnlp23impress.pdf (4.64 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Pascal Denis : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04259861

Soumis le : jeudi 26 octobre 2023-12:02:11

Dernière modification le : vendredi 31 mai 2024-18:32:03

Archivage à long terme le : samedi 27 janvier 2024-18:55:28

Dates et versions

hal-04259861 , version 1 (26-10-2023)

Identifiants

HAL Id : hal-04259861 , version 1

Citer

Cennet Oguz, Pascal Denis, Emmanuel Vincent, Simon Ostermann, Josef van Genabith. Find-2-Find: Multitask Learning for Anaphora Resolution and Object Localization. 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023, Singapore, Singapore. ⟨hal-04259861⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CRISTAL UNIV-LORRAINE INRIA2 CRISTAL-MAGNET LORIA LORIA-NLPKD UNIV-LILLE

124 Consultations

74 Téléchargements

Find-2-Find: Multitask Learning for Anaphora Resolution and Object Localization

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager