Natural Language Processing

Sapienza NLP

Sapienza NLP

The Sapienza Natural Language Processing Group (Sapienza NLP), led by prof. Roberto Navigli, includes a large team of Ph.D. students and researchers which are part of the Computer, Control and Management Engineering Department (DIAG) .

Thiresia - PRIN 2022

Thiresia - PRIN 2022

Tiresias, the Theban soothsayer who was blinded for revealing the secrets of the gods and met in the underworld by Odysseus, unlike the other specters keeps memories of the past and, thanks to this, can help the Greek hero find his way home.

GLADIA

GLADIA

GLADIA is a team of computer scientists, physicists, engineers and mathematicians venturing beyond the boundaries of machine intelligence. Our aim is to understand the mathematics and dynamics of intelligence, and slash the time required to create new models from months to mere seconds. From language to audio, from latent geometry to model merging, we seek architectures and methods that make AI faster, more accessible, more creative, more powerful.

XL-AMR: Enabling Cross-Lingual AMR Parsing with Transfer Learning Techniques

Abstract Meaning Representation (AMR) is a popular formalism of natural language that represents the meaning of a sentence as a semantic graph. It is agnostic about how to derive meanings from strings and for this reason it lends itself well to the encoding of semantics across languages. However, cross-lingual AMR parsing is a hard task, because training data are scarce in languages other than English and the existing English AMR parsers are not directly suited to being used in a cross-lingual setting.

Just “OneSeC” for producing multilingual Sense-Annotated Data

The well-known problem of knowledge acquisition is one of the biggest issues in Word Sense Disambiguation (WSD), where annotated data are still scarce in English and almost absent in other languages. In this paper we formulate the assumption of One Sense per Wikipedia Category and present OneSeC, a language-independent method for the automatic extraction of hundreds of thousands of sentences in which a target word is tagged with its meaning.

MuLaN: Multilingual Label propagatioN for Word Sense Disambiguation

The knowledge acquisition bottleneck strongly affects the creation of multilingual sense-annotated data, hence limiting the power of supervised systems when applied to multilingual Word Sense Disambiguation. In this paper, we propose a semi-supervised approach based upon a novel label propagation scheme, which, by jointly leveraging contextualized word embeddings and the multilingual information enclosed in a knowledge base, projects sense labels from a high-resource language, i.e., English, to lower-resourced ones.

The Knowledge Acquisition Bottleneck Problem in Multilingual Word Sense Disambiguation

Word Sense Disambiguation (WSD) is the task of identifying the meaning of a word in a given context. It lies at the base of Natural Language Processing as it provides semantic information for words. In the last decade, great strides have been made in this field and much effort has been devoted to mitigate the knowledge acquisition bottleneck problem, i.e., the problem of semantically annotating texts at a large scale and in different languages. This issue is ubiquitous in WSD as it hinders the creation of both multilingual knowledge bases and manually-curated training sets.

CluBERT: A Cluster-Based Approach for Learning Sense Distributions in Multiple Languages

Knowing the Most Frequent Sense (MFS) of a word has been proved to help Word Sense Disambiguation (WSD) models significantly. However, the scarcity of sense-annotated data makes it difficult to induce a reliable and high-coverage distribution of the meanings in a language vocabulary. To address this issue, in this paper we present CluBERT, an automatic and multilingual approach for inducing the distributions of word senses from a corpus of raw sentences.

CSI: a Coarse Sense Inventory for 85% Word Sense Disambiguation

Word Sense Disambiguation (WSD) is the task of associating a word in context with one of its meanings. While many works in the past have focused on raising the state of the art, none has even come close to achieving an F-score in the 80% ballpark when using WordNet as its sense inventory. We contend that one of the main reasons for this failure is the excessively fine granularity of this inventory, resulting in senses that are hard to differentiate between, even for an experienced human annotator.

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma