ELRA releases .
The ELRA Catalogue of Language Resources offers a repository of Language Resources (LRs) made available through ELRA.
(See full-size image)
An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.
Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.
Other resources identified, but not available through ELRA, can be viewed in the Universal Catalogue.
If you have any suggestions or comments, or need any further details about ELRA and its Catalogue of Language Resources, please refer to the contact us section.
|ELRA-L0100 : French dictionary of definitions (SYNAPSE)
The French dictionary of definitions
(SYNAPSE) consists of 216,835 entries
(147,378 nouns, 80,552 adjectives,
24,001 verbs, 4,677 adverbs, 1,560
prefixes, 107 prepositions, 614
interjections, 147 pronouns, 42
conjunctions, 27 articles), 309,078
definitions and 7,395 phraseological
units (phrases). Grammatical information
for each entry consists of: grammatical
category, gender, number, inflected
forms. This dictionary is provided in
XML format together with its DTD.
|ELRA-W0124 : English-Vietnamese Parallel Corpus
This is a corpus of 500,000
English-Vietnamese sentence pairs. The
parallel corpus contains English
documents translated by professional
translators into Vietnamese. The source
texts include books, dictionaries,
newspapers, online news. The texts are
provided in TEI format.
|ELRA-S0394 : Metalogue Multi-Issue Bargaining Dialogue
This corpus consists of approximately
2.5 hours of semantically annotated
English dialogue data that includes
speech and transcripts. Six unique
subjects (undergraduates between 19 and
25 years of age) participated in the
collection. The dialogue speech was
captured with two headset microphones
and saved in 16kHz, 16-bit mono linear
PCM FLAC format. Transcripts were
produced semi-automatically, using an
automatic speech recognizer followed by
manual correction. All text is presented
in UTF-8 as either plain text or XML.
|ELRA-S0395 : Nautilus Speaker Characterization (NSC) Corpus
This corpus comprises clean microphone
recordings of conversational speech from
300 German speakers (126 males and 174
females) aged 18 to 35 years, with no
marked dialect/accent. The recordings
were performed in an
acoustically-isolated room in 2016/2017.
Four scripted and four semi-spontaneous
dialogs were elicited from the speakers,
simulating telephone call inquiries.
Additionally, spontaneous neutral and
emotional speech utterances and
questions were produced. All labels are
provided, together with the speech
recordings and the speakers' metadata.
|ELRA-W0121 : 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish
2007 CoNLL Shared Task - Basque,
Catalan, Czech & Turkish consists of
dependency treebanks in four languages
used as part of the CoNLL 2007 shared
task on multi-lingual dependency parsing
and domain adaptation. The languages
covered in this release are: Basque,
Catalan, Czech and Turkish. The source
data in the treebanks in this release
consists principally of various texts
(e.g., textbooks, news, literature)
annotated in dependency format.
|(last update: January 2018)