This project aims to annotate text in different languages with a layer of "universal" semantic role labeling annotation. For this purpose, we use the frame and role labels of the English Proposition Bank to label shallow semantics in sentences in new target languages.

News (2022/04/29): Introduced Universal Proposition Bank 2.0 (UP2.0)

News (2022/01/01): UP1.0 Freeze !

News (2019/10/01): Two domain-specific Propbank released (Contract, Finance)!

News (2017/02/10): Initial version of Italian UP released!

News (2017/01/31): Initial versions of Finnish, Portuguese and Spanish UP released!

Release

This is release v2.0 of the Universal Proposition Banks (UP) with significant enhancements over v1.0, including:

  • (1) Propbanks with higher quality by using a state-of-the-art monolingual SRL and improved auto-generation of annotations;
  • (2) Expanded language coverage (from 7 to 23 languages );
  • (3) Span annotation for the decoupling of syntactic analysis; and
  • (4) Gold data for a subset of the languages.

v2.0 is built upon release 2.9 of the Universal Dependency Treebanks. We use the frame and role labels from the English Proposition Bank version 3.0.

Introduction

This project aims to annotate text in different languages with a layer of “universal” semantic role labeling annotation. For this purpose, we use the frame and role labels of the English Proposition Bank to label shallow semantics in sentences in new target languages.

For instance, consider the following sentences from different languages:

  • Sentence: Ich hatte Gelegenheit eines seiner Seminare zu besuchen.(I had the opportunity to attend one of his seminars.).
  • In CoNLL-U-Plus format, it looks like this, with English PropBank labels in the last three columns:
ID FORM LEMMA UPOS XPOS FEAT HEAD DEPREL UP:PREDS UP:ARGHEADS UP:ARGSPANS
1 Ich ich PRON PPER _ 2 nsubj _ _ _
2 hatte haben VERB VAFIN _ 0 root have.03 A0:1|A1:3 A0:1-1|A1:3-3
3 Gelegenheit Gelegenheit NOUN NN _ 2 obj _ _ _
4 eines ein DET PIS _ 6 det _ _ _
5 seiner sein DET PPOSAT _ 6 det:poss _ _ _
6 Seminare Seminar NOUN NN _ 8 obj _ _ _
7 zu zu PART PTKZU _ 8 mark _ _ _
8 besuchen besuchen VERB VVINF _ 3 xcomp attend.01 A0:1|A1:4 A0:1-1|A1:4-6
9 . . PUNCT . _ 2 punct _ _ _

The German verbs

  • ‘hatte’ is labeled as evoking the ‘have.03’ frame with two roles: “Ich” (I) is labeled A0 (owner) and “Gelegenheit” (opportunity) is labeled A1 (possession).
  • ‘besuchen’ is labeled as evoking the ‘attend.01’ frame with two roles: “Ich” (I) is labeled A0 (thing attending) and “eines seiner Seminare” (one of his seminars) is labeled A1 (thing attended).
  • Sentence: Elle lutte pour échapper aux tueurs à ses trousses.(She struggles to escape the killers chasing her.)
  • In CoNLL-U-Plus format, it looks like this, with English PropBank labels in the last three columns:
ID FORM LEMMA UPOS XPOS FEAT HEAD DEPREL UP:PREDS UP:ARGHEADS UP:ARGSPANS
1 Elle il PRON _ _ 2 nsubj _ _ _
2 lutte lutter VERB _ _ 0 root struggle.02 A0:1|A1:4 A0:1-1|A1:3-10
3 pour pour ADP _ _ 4 mark _ _ _
4 échapper échapper VERB _ _ 2 advcl escape.01 A0:1|A1:7 A0:1-1|A1:6-7
5-6 aux _ _ _ _ _ _ _ _ _
5 à à ADP _ _ 7 case _ _ _
6 les le DET _ _ 7 det _ _ _
7 tueurs tueur NOUN _ _ 4 obl:arg _ _ _
8 à à ADP _ _ 10 case _ _ _
9 ses son DET _ _ 10 det _ _ _
10 trousses trousses NOUN _ _ 4 obl:mod _ _ _
11 . . PUNCT _ _ 2 punct _ _ _

The French verbs

  • 'lutte' is labeled as evoking the 'struggle.02' frame with two roles: "Elle" (she) is labeled A0 (entity trying) and "pour échapper aux tueurs à ses trousses" (to escape the killers on her trail.) is labeled A1 (predicative action).
  • 'échapper' is labeled as evoking the 'escape.01' frame with two roles: "Elle" (she) is labeled A0 (entity escaping) and "les tueurs" (the killers) is labeled A1 (place or thing escaped).
  • Sentence: कुशीनगर की सीमा में प्रवेश करते ही भव्‍य प्रवेशद्वार आपका स्वागत करता है ।(The grand entrance welcomes you as you enter the limits of Kushinagar.).
  • In CoNLL-U-Plus format, it looks like this, with English PropBank labels in the last three columns:
ID FORM LEMMA UPOS XPOS FEAT HEAD DEPREL UP:PREDS UP:ARGHEADS UP:ARGSPANS
1 कुशीनगर कुशीनगर PROPN NNP _ 3 nmod _ _ _
2 की का ADP PSP _ 1 case _ _ _
3 सीमा सीमा NOUN NN _ 6 obl _ _ _
4 में में ADP PSP _ 3 case _ _ _
5 प्रवेश प्रवेश NOUN NN _ 6 compound _ _ _
6 करते कर VERB VM _ 12 advcl enter.01 A1:3 A1:1-3
7 ही ही PART RP _ 6 dep _ _ _
8 भव्‍य भव् ADJ JJ _ 9 amod _ _ _
9 प्रवेशद्वार प्रवेशद्वार NOUN NN _ 12 nsubj _ _ _
10 आपका आप PRON PRP _ 11 nmod _ _ _
11 स्वागत स्वागत NOUN NN _ 12 compound _ _ _
12 करता कर VERB VM _ 0 root _ _ _
13 है है AUX VAUX _ 12 aux _ _ _
14 | | PUNCT SYM _ 12 punct _ _ _

The Hindi verbs

  • 'करते' is labeled as evoking the 'enter.01' frame with one role: "कुशीनगर की सीमा" (kushinagar border) is labeled A1 (place or thing entered).
  • Sentence: 他花費了許多時間來比較加拿大地質調查局博物館中的恐龍化石。(He spent a lot of time comparing dinosaur fossils in the Geological Survey of Canada museum.).
  • In CoNLL-U-Plus format, it looks like this, with English PropBank labels in the last three columns:
ID FORM LEMMA UPOS XPOS FEAT HEAD DEPREL UP:PREDS UP:ARGHEADS UP:ARGSPANS
1 PRON PRP _ 7 nsubj _ _ _
2 花費 花費 VERB VV _ 7 advcl spend.02 A0:1|A1:5|A2:7 A0:1-1|A1:4-5|A2:7-17
3 AUX AS _ 2 aux _ _ _
4 許多 許多 NUM CD _ 5 nummod _ _ _
5 時間 時間 NOUN NN _ 2 obj _ _ _
6 ADV RB _ 7 mark _ _ _
7 比較 比較 VERB VV _ 0 root compare.01 A0:1 A0:1-1
8 加拿大 加拿大 PROPN NNP _ 13 nmod _ _ _
9 地質 地質 NOUN NN _ 13 nmod _ _ _
10 調查 調查 VERB VV _ 11 compound _ _ _
11 PART SFN _ 13 nmod _ _ _
12 博物 博物 NOUN NN _ 13 compound _ _ _
13 PART SFN _ 17 nmod _ _ _
14 NOUN NN _ 13 acl _ _ _
15 PART DEC _ 13 case _ _ _
16 恐龍 恐龍 NOUN NN _ 17 nmod _ _ _
17 化石 化石 NOUN NN _ 7 obj _ _ _
18 PUNCT . _ 7 punct _ _ _

The Chinese verbs

  • '花費' is labeled as evoking the 'spend.02' frame with roles: "他" (He) is labeled A0 (bider, waiter), "許多時間" (a lot of time) is labeled A1 (unit of time), "比較加拿大地質調查局博物館中的恐龍化石" (Comparing dinosaur fossils in the Geological Survey of Canada Museum) is labeled A2 (activity).
  • '比較' is labeled as evoking the 'compare.01' frame with one role: "他" (He) is labeled A0 ( entity making comparison).

Using this data, we can create SRL systems that predict English PropBank labels for many different languages. See a demo screencast of this SRL for English, French and German here.

Format

Data

The universal propbank (UP) for each language consists of three files (training, dev, and test data) with the extension .conllup.

The conllup format adds user defined columns to the original 10 columns from the CoNLL-U format (from UD). Our data consists of four columns: the original ID columns, plus three additional columns UP:PRED, UP:ARGHEADS, and UP:ARGSPANS.

  • ID (column 1) is the token id consistent with corresponding UD sentence.
  • UP:PRED (column 11) contains predicate sense label for this predicate. This sense provides roleset specific meanings for each of its arguments, as defined in EN propbank.
  • UP:ARGHEADS (column 12) contains the argument heads for arguments of this predicate. Each argument is in the format label:token_id. The arguments are separated by pipe | charactor.
  • UP:ARGSPANS (column 13) contains the argument spans for arguments of this predicate. Each argument is in the format label:start_token_id-end_token_id. The arguments are separated by pipe | charactor.

Repository

A UP release contains treebanks of the corresponding UD release. Since UP is automatically generated silver data we also release hand annotated EN SRL labels for a subset of the lanaguges to facilitate the research community to perform fair evaluation of their multilingual and cross-lingual SRL systems. To differentiate Gold from UP data we use the following conventions:

  • UP_<language>-<corpus>
  • GOLD_<language>-<corpus>

In addition, each language has a folder with verb overview files (produced from the frame files) in HTML format. These files can be viewed in a browser and give an overview of all English frames that each target language verb can evoke.

Script

We provide a python script to combine such a UP file with its corresponding UD file to produce the desired 13 column .conllup file. The script is available in tools repository: up2/merge_ud_up.py. It takes three arguments:

  • input_ud - input UD_file or input UD_folder

  • input_up - input UP_file or input UP_folder

  • output - output folder for combined output

Below is a sample execution with UD_folder and UP_folder - all the corresponding files from both folders (including subfolders) will be processed during one execution.

python3 up2/merge_ud_up.py input_ud=./tests/data/ud/hi/ --input_up=./tests/data/up/hi/ --output=./tests/data/ud-up/hi/

An analogical execution with the specific UD_file and UP_file is presented below:

python3 up2/merge_ud_up.py input_ud=./tests/data/ud/hi/hi_hdtb-ud-dev.conllu --input_up=./tests/data/up/hi/hi_hdtb-up-dev.conllu --output=./tests/data/ud-up/hi/hi_hdtb-ud-up-dev.conllu

Scope

Our current focus is to annotate all target language verbs with appropriate English frames. This means that the scope of frame-evoking elements is currently limited to verbs. We also do not label target language auxiliary verbs. For each universal propbank, about 90% of all verbs are currently labeled. Unlabeled verbs often convey semantics for which we either could not find an appropriate English verb, or are part of complex verb constructions which we currently do not handle.

A note on quality

This is an ongoing research project in which we use a combination of data-driven methods and some post-processing to generate these resources. This means that the labels in the UPs are mostly predicted over models trained on a different domain, which affects the quality. A good example is the German verb “angeben” which in our source data was mostly used in the “brag.01” sense, but in the German UD data is mostly used in the “report.01” sense, but almost never detected as such. We provide the languages specific observations in their respective README files.

Known Usages

  • (1) Foundation for Expanded Shallow Semantic Parsing (ESSP), a major differentiating advanced NLP primitive in Watson NLP, an embedding NLP library used widely within IBM products and solutions;
  • (2) Powers multiple IBM products and solutions such as Watson Discovery.

Current and future work

This is an ongoing project which we are improving along these lines:

  • (1) We are working on adding new languages to the current release.
  • (2) We are working to curate the data to improve the quality of SRL annotation.
  • (3) We are looking into extending the scope of frame-evoking-elements to other types of predicates besides verbs.

Citing UP in papers

If you use UP in your work, please cite these papers:

@InProceedings{jindal-EtAl:2022:LREC,
  author    = {Jindal, Ishan  and  Rademaker, Alexandre  and  Ulewicz, Micha{\l} and  Linh, Ha  and  Nguyen, Huyen  and  Tran, Khoi-Nguyen  and  Zhu, Huaiyu  and  Li, Yunyao},
  title     = {Universal Proposition Bank 2.0},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1700--1711},
  url       = {https://aclanthology.org/2022.lrec-1.181}
}
@inproceedings{akbik-etal-2015-generating,
    title = "Generating High Quality Proposition {B}anks for Multilingual Semantic Role Labeling",
    author = "Akbik, Alan  and
      Chiticariu, Laura  and
      Danilevsky, Marina  and
      Li, Yunyao  and
      Vaithyanathan, Shivakumar  and
      Zhu, Huaiyu",
    booktitle = "Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = jul,
    year = "2015",
    address = "Beijing, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P15-1039",
    doi = "10.3115/v1/P15-1039",
    pages = "397--407",
}

Publications

PriMeSRL-Eval: A Practical Quality Metric for Semantic Role Labeling Systems Evaluation. Ishan Jindal, Alexandre Rademaker, Khoi-Nguyen Tran, Huaiyu Zhu, Hiroshi Kanayama, Marina Danilevsky, Yunyao Li. IN EACL Findings 2023. EACL:Findings 2023.

Label Definitions Improve Semantic Role Labeling. Li Zhang, Ishan Jindal, Yunyao Li. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. NAACL 2022.

Universal Proposition Bank 2.0. Ishan Jindal, Alexandre Rademaker, Michał Ulewicz, Linh Ha, Huyen Nguyen, Khoi-Nguyen Tran, Huaiyu Zhu and Yunyao Li. In Proceedings of the Language Resources and Evaluation Conference. LREC 2022.

Learning Explainable Linguistic Expressions with Neural Inductive Logic Programming for Sentence Classification. Prithviraj Sen, Marina Danilevsky, Yunyao Li, Siddhartha Brahma, Matthias Boehm, Laura Chiticariu and Rajasekar Krishnamurthy. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) EMNLP 2020.

A Novel Workflow for Accurately and Efficiently Crowdsourcing Predicate Senses and Argument Labels. Youxuan Jiang, Huaiyu Zhu, Jonathan K. Kummerfeld, Yunyao Li and Walter Lasecki. 2020 Conference on Empirical Methods in Natural Language Processing: Findings (EMNLP) EMNLP:Findings 2020.

CLAR: A Cross-Lingual Argument Regularizer for Semantic Role Labeling. Ishan Jindal, Yunyao Li, Siddhartha Brahma and Huaiyu Zhu. 2020 Conference on Empirical Methods in Natural Language Processing: Findings (EMNLP) EMNLP:Findings 2020.

Learning Explainable Linguistic Expressions with Neural Inductive Logic Programming for Sentence Classification. Prithviraj Sen, Yunyao Li, Eser Kandogan, Yiwei Yang and Walter Lasecki. 2019 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations ACL:System Demonstration 2019.

Crowd-in-the-Loop: A Hybrid Approach for Annotating Semantic Roles. Chenguang Wang, Alan Akbik, Laura Chiticariu, Yunyao Li, Fei Xia and Anbang Xu. 2017 Conference on Empirical Methods on Natural Language Processing EMNLP 2017.

Active Learning for Black-Box Semantic Role Labeling with Neural Factors. Chenguang Wang, Laura Chiticariu and Yunyao Li. 2017 International Joint Conference on Artificial Intelligence IJCAI 2017.

Multilingual Aliasing for Auto-Generating Proposition Banks. Alan Akbik, Xinyu Guan and Yunyao Li. 26th International Conference on Computational Linguistics COLING 2016.

K-SRL: Instance-based Learning for Semantic Role Labeling. Alan Akbik and Yunyao Li. 26th International Conference on Computational Linguistics COLING 2016.

Multilingual Information Extraction with PolyglotIE. Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yonas Kbrom, Yunyao Li and Huaiyu Zhu. 26th International Conference on Computational Linguistics COLING 2016.

Towards Semi-Automatic Generation of Proposition Banks for Low-Resource Languages. Alan Akbik, Vishwajeet Kumar and Yunyao Li. 2016 Conference on Empirical Methods on Natural Language Processing EMNLP 2016.

Polyglot: Multilingual Semantic Role Labeling with Unified Labels. Alan Akbik and Yunyao Li. 54th Annual Meeting of the Association for Computational Linguistics ACL 2016.

Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling. Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yunyao Li, Shivakumar Vaithyanathan and Huaiyu Zhu. 53rd Annual Meeting of the Association for Computational Linguistics ACL 2015.

Preprints

Improved Semantic Role Labeling using Parameterized Neighborhood Memory Adaptation. Ishan Jindal, Ranit Aharonov, Siddhartha Brahma, Huaiyu Zhu and Yunyao Li. arXiv preprint arXiv:2011.14459

People

Contact

For questions, comments, and feedback, click on the Feedback link on the top of this page and create a separate git issues for each comment/feedback.

Core Team

Languages

  • Currently, UP data is available for 23 languages.
ID Language Corpus Notes
cs Czech CAC, CLTT, FicTree, PDT  
de German GSD, HDT  
el Greek GDT  
es Spanish GSD, AnCora  
fi Finnish TDT, FTB  
fr French GSD, Rhapsodie, Sequoia  
hi Hindi HDTB  
hu Hungarian Szeged  
id Indonesian GSD  
it Italian ISDT, ParTUT, PoSTWITA, TWITTIRO, VIT  
ja Japanese GSD  
ko Korean GSD, Kaist  
mr Marathi UFAL  
nl Dutch Alpino, LassySmall  
pl Polish LFG, PDB  
pt Portuguese Bosque, GSD  
ro Romanian Nonstandard, RRT, SiMoNERo  
ru Russian GSD, Taiga, SynTagRus  
ta Tamil TTB  
te Telugu MTG  
uk Ukrainian IU  
vi Vietnamese VTB  
zh Chinese GSD  

Gold Data

  • Gold data is available for 3 languages.
ID Language Corpus Notes
pt Portuguese Bosque Not yet available publically
pl Polish TrOntonotes Available
vi Vietnamese Tatoeba Available
Tags: