1,625 language resources at your disposal
An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.
Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.
Latest Resources
Corpus for fine-grained analysis and automatic detection of irony on Twitter
The Corpus for fine-grained analysis and automatic detection of irony on Twitter was carefully annotated by trained annotators (Master’s students in Linguistics) using a detailed annotation scheme for irony categorization, which describes four labels: ‘ironic by means of a polarity contrast’, ‘situational irony’, ‘other verbal irony’ and ‘not ironic’. The ...
AUDIO Human Voice Pronunciations - Greek
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Swedish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Catalan
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Korean
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Polish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Thai
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Norwegian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Italian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Arabic
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Hebrew
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Spanish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Turkish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Russian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Japanese
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Portuguese (Portugal)
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Dutch
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Portuguese (Brazil)
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Danish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - English
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Chinese (Simplified)
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Czech
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
GEOLINGUAL Multilingual Geographical Entity Tables
A table of over 200 countries and other major geographical names worldwide – including their adjectives, persons, and main languages – in the following languages: Arabic, Chinese Simplified, Danish, Dutch, English, French, German, Greek, Hebrew, Japanese, Korean, Polish, Portuguese, Russian, Spanish, and Turkish.
English BIO Biographical Names (Multilingual)
This dataset consists of 4,200 dictionary entries regarding prominent persons worldwide. A similarly designed dataset for geographical locations is available as a separate package (ELRA-L0204-02).
English GEO Geographical Names (Multilingual)
This dataset consists of 7,200 dictionary entries regarding major locations worldwide. A similarly designed dataset for prominent persons (biographical names) is available as a separate package (ELRA-L0204-01).
Morphological lexicon - Slovak
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
MULTIGLOSS Multilingual Glossaries - L1-English pair
A series of innovative multilingual word-to-sense glossaries, based on a human-edited word-to-sense bilingual index of each language to English, which is linked automatically to the translation equivalents in 45 target languages. Each word and expression in every language is translated via its corresponding sense in English into 44 of these ...
Morphological lexicon - Russian
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Portuguese
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Norwegian Nynorsk
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Hebrew
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Parallel Corpora & Domains (bilingual and multilingual)
Parallel corpora for nearly 400 language pairs and numerous multilingual combinations, including 10 million bilingual segments and 90 million tokens in 20 languages: Arabic, Chinese (Simplified), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Italian, Japanese, Korean, North Sami, Norwegian, Polish, Portuguese (Brazilian and European), Russian, Spanish, Swedish, and Turkish. ...
Morphological lexicon - Norwegian Bokmål
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
MULTIGLOSS Multilingual Glossaries - L1-English pair + 1 language
A series of innovative multilingual word-to-sense glossaries, based on a human-edited word-to-sense bilingual index of each language to English, which is linked automatically to the translation equivalents in 45 target languages. Each word and expression in every language is translated via its corresponding sense in English into 44 of these ...
Morphological lexicon - German
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Italian
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Korean
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Japanese
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Spanish
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - English
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Dutch
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Swedish
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
EWA-DB – Early Warning of Alzheimer speech database
EWA-DB is a speech database that contains data from 3 clinical groups: Alzheimer's disease, Parkinson's disease, mild cognitive impairment, and a control group of healthy subjects. Speech samples of each clinical group were obtained using the EWA smartphone application, which contains 4 different language tasks: sustained vowel phonation, diadochokinesis, object ...
Morphological lexicon - French
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
GLOBAL Multilingual Lexical Data - Bilingual - Level 3
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...
GLOBAL Multilingual Lexical Data - Monolingual - Level 1
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...
GLOBAL Multilingual Lexical Data - Monolingual - Level 2
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...
GLOBAL Multilingual Lexical Data - Bilingual - Level 2
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...
GLOBAL Multilingual Lexical Data - Bilingual - Level 1
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...
GLOBAL Multilingual Lexical Data - Monolingual - Level 3
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...
Corpus of Spontaneous Japanese (CSJ)
The "Corpus of Spontaneous Japanese" (or CSJ) is a database containing a large collection of Japanese spoken language data and information for use in linguistic research; jointly developed by NINJAL, NICT and the Tokyo Institute of Technology, the CSJ is world-class in both the quantity and quality of the available ...
Bitext Synonym Data - General Language
The Bitext Synonym Data - General Language includes 31,723 entries and more than 100,000 synonyms for English language. This dataset is a set of synonyms developed to augment the English version of Wordnet, a powerful open-source lexical database, released in 2005. All synonyms can be linked to Bitext Lexical Data ...
Bitext Synthetic Data - Event and ticketing (Spanish language)
The Bitext Synthetic Data consist of pre-built training data for intent detection and are provided for 20 verticals for Spanish language (see ELRA-L0182 to ELRA-L0201). They cover the most common intents for each vertical and include a large number of example utterances for each intent, with optional entity/slot annotations for ...
Bitext Synthetic Data - Legal (Spanish language)
The Bitext Synthetic Data consist of pre-built training data for intent detection and are provided for 20 verticals for Spanish language (see ELRA-L0182 to ELRA-L0201). They cover the most common intents for each vertical and include a large number of example utterances for each intent, with optional entity/slot annotations for ...