Search and Browse – ELRA Catalogue

Amharic
English

ID: ELRA-W0074

The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in transliterated form and in English. The size of the corpus is of 232,653 words in Amharic and 291,701 in English. This parallel corpus contains documents from two domains, namely legal...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	2000.00 €
Licence: Commercial Use - ELRA VAR	2000.00 €	2000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	4000.00 €
Licence: Commercial Use - ELRA VAR	4000.00 €	4000.00 €

Annotated tweet corpus in Arabizi, French and English text

Arabic
English
French

ID: ELRA-W0323

ISLRN: 482-848-308-105-6

The annotated tweet corpus in Arabizi, French and English was built by ELDA on behalf of INSA Rouen Normandie (Normandie Université, LITIS team), in the framework of the SAPhIRS project (System for the Analysis of Information Propagation in Social Networks), funded by the DGE (Direction Générale ...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	7000.00 €
Licence: Commercial Use - ELRA VAR	7000.00 €	7000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	10000.00 €
Licence: Commercial Use - ELRA VAR	10000.00 €	10000.00 €

ARCADE/ROMANSEVAL corpus text

English
French
Italian

ID: ELRA-W0018

ISLRN: 681-769-134-114-2

The ARCADE/ROMANSEVAL corpus was used as a reference corpus in two international competitions: · ARCADE, an exercise on multilingual text alignment financed by AUPELF-UREF · ROMANSEVAL, part of the SENSEVAL exercise sponsored by ACL-SIGLEX and EURALEX, on word sense disambiguation. The corpus ...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	2000.00 €
Licence: Commercial Use - ELRA VAR	2000.00 €	2000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	5000.00 €
Licence: Commercial Use - ELRA VAR	5000.00 €	5000.00 €

Corpus for fine-grained analysis and automatic detection of irony on Twitter text

English

ID: ELRA-W0337

ISLRN: 478-366-550-085-8

The Corpus for fine-grained analysis and automatic detection of irony on Twitter was carefully annotated by trained annotators (Master’s students in Linguistics) using a detailed annotation scheme for irony categorization, which describes four labels: ‘ironic by means of a polarity contrast’, ‘si...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	100.00 €
Licence: Commercial Use - ELRA VAR	100.00 €	100.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	200.00 €
Licence: Commercial Use - ELRA VAR	200.00 €	200.00 €

CRATER 2 Corpus text

English
French
Spanish; Castilian

ID: ELRA-W0033

ISLRN: 052-466-219-226-4

The CRATER corpus was built upon the foundations of an earlier project, ET10/63, which was funded in the final phase of the Eurotra programme. The Corpus Resources and Terminology Extraction project (MLAP-93 20) extended the bilingual annotated English-French International Telecommunications Unio...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	25.00 €
Licence: Commercial Use - ELRA VAR	25.00 €	25.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	125.00 €
Licence: Commercial Use - ELRA VAR	125.00 €	125.00 €

CRATER corpus text

English
French
Spanish; Castilian

ID: ELRA-W0003

ISLRN: 645-721-607-031-5

The Corpus Resources and Terminology Extraction project (MLAP-93 20) has extended the bilingual annotated English-French International Telecommunications Union corpus to include Spanish, and has also debugged the existing corpus. The offer consists of a multi-lingual aligned corpus of 1,000,000 t...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	20.00 €
Licence: Commercial Use - ELRA VAR	20.00 €	20.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	100.00 €
Licence: Commercial Use - ELRA VAR	100.00 €	100.00 €

English-Chinese-Vietnamese Trilingual Parallel Corpus text

Chinese
English
Vietnamese

ID: ELRA-W0314

ISLRN: 637-630-726-817-9

The English-Chinese-Vietnamese Trilingual Parallel Corpus consists of 20,046 trilingual sets of sentence pairs. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	150.00 €	500.00 €
Licence: Commercial Use - ELRA VAR	1000.00 €	1000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	225.00 €	750.00 €
Licence: Commercial Use - ELRA VAR	1500.00 €	1500.00 €

English-Persian parallel corpus text

English
Persian

ID: ELRA-W0118

ISLRN: 074-825-114-781-7

The English-Persian parallel corpus contains more than 200,000 aligned sentences across a variety of text types from the domains of art, law, culture, science, religion, literature, medicine, idioms, politics and others. It is an extension of the English-Persian parallel corpus already distribute...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1000.00 €	5000.00 €
Licence: Commercial Use - ELRA VAR	5000.00 €	5000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1200.00 €	6000.00 €
Licence: Commercial Use - ELRA VAR	6000.00 €	6000.00 €

English-Persian parallel Corpus text

English
Persian

ID: ELRA-W0051

ISLRN: 671-618-321-687-7

Please refer to ELRA-W0118 for the latest version of this corpus. This version consists of about 3,500,000 English and Persian (Farsi) words aligned at sentence level (about 100,000 sentences, distributed over 50,021 entries). The format of the files is Unicode. It has been originally created wi...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	500.00 €	2500.00 €
Licence: Commercial Use - ELRA VAR	2500.00 €	2500.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	600.00 €	3000.00 €
Licence: Commercial Use - ELRA VAR	3000.00 €	3000.00 €

English-Punjabi Code-Mixed Social Media Content text

English
Panjabi; Punjabi

ID: ELRA-W0319

ISLRN: 695-759-706-170-8

The English-Punjabi Code-Mixed Social Media Content corpus is composed is composed of 893,615 parallel sentences of English-Punjabi distributed over the following domains: - 82,341 parallel sentences of English-Punjabi code-mixed Agriculture Domain Data, - 59,158 parallel sentences of English-P...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

English-Vietnamese Parallel Corpus text

English
Vietnamese

ID: ELRA-W0311

ISLRN: 893-470-491-825-6

The English-Vietnamese Parallel Corpus consists of 1,000,000 sentence pairs, with an average length of 20 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	900.00 €	1800.00 €
Licence: Commercial Use - ELRA VAR	9000.00 €	9000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1500.00 €	3000.00 €
Licence: Commercial Use - ELRA VAR	12000.00 €	12000.00 €

English-Vietnamese Parallel Corpus text

English
Vietnamese

ID: ELRA-W0124

ISLRN: 838-483-738-912-8

This is a corpus of 500,000 English-Vietnamese sentence pairs, built to develop SMT (Statistical Machine Translation) systems. The parallel corpus contains English documents translated by professional translators into Vietnamese. The source texts include books, dictionaries, newspapers, online ne...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	600.00 €	1200.00 €
Licence: Commercial Use - ELRA VAR	6000.00 €	6000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1000.00 €	2000.00 €
Licence: Commercial Use - ELRA VAR	8000.00 €	8000.00 €

EUROPARL Corpus Parallel Corpora: Portuguese-English text

English
Portuguese

ID: ELRA-W0090

ISLRN: 435-502-922-727-2

The EUROPARL Corpus (Portuguese-English subpart of the parallel corpora), was extracted from the proceedings of the European Parliament. It contains transcriptions of sessions dating back from 1996 to 2011, with a total of approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

Khresmoi manually annotated reference corpus text

English

ID: ELRA-W0081

ISLRN: 764-036-829-417-7

The Manually Annotated Reference Corpus is a collection of English web documents annotated with key entities (such as disease, drug), built in the framework of the Khresmoi project, funded by the European Commission. It has been constructed by first annotating these entities with an imperfect aut...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

MULTEXT JOC Corpus text

English
French
German
Italian
Spanish; Castilian

ID: ELRA-W0017

ISLRN: 900-482-746-635-0

This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains approx. 5 mill...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	2000.00 €
Licence: Commercial Use - ELRA VAR	2000.00 €	2000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	5000.00 €
Licence: Commercial Use - ELRA VAR	5000.00 €	5000.00 €

Multilingual Corpus text

Chinese
English
Korean

ID: ELRA-W0035

ISLRN: 731-151-596-869-3

Multilingual parallel corpus produced by Kaist Korterm containing 60 000 expressions in Korean, Chinese and English.

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	750.00 €	3000.00 €
Licence: Commercial Use - ELRA VAR	3000.00 €	3000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1500.00 €	6000.00 €
Licence: Commercial Use - ELRA VAR	6000.00 €	6000.00 €

Parallel Corpora & Domains (bilingual and multilingual) text

Arabic
Chinese
Danish
Dutch; Flemish
English
Finnish
French
German
Hebrew
Italian
Japanese
Korean
Modern Greek (1453-)
Northern Sami
Norwegian
Polish
Portuguese
Russian
Spanish; Castilian
Swedish
Turkish

ID: ELRA-W0336

ISLRN: 471-919-856-164-1

Parallel corpora for nearly 400 language pairs and numerous multilingual combinations, including 10 million bilingual segments and 90 million tokens in 20 languages: Arabic, Chinese (Simplified), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Italian, Japanese, Korean, North Sami...

MEMBER	academic	commercial
Licence: Commercial Use - ELRA VAR	0.10 €	0.10 €

NON MEMBER	academic	commercial
Licence: Commercial Use - ELRA VAR	0.11 €	0.11 €

Special offers are also available. Check here for details.

The EMILLE Lancaster Corpus text

Bengali
English
Gujarati
Hindi
Panjabi; Punjabi
Sinhala; Sinhalese
Tamil
Urdu

ID: ELRA-W0038

ISLRN: 438-045-014-925-0

The EMILLE Lancaster Corpus consists of three components: monolingual, parallel and annotated corpora. There are monolingual corpora for seven South Asian languages: Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil, Urdu. The EMILLE monolingual corpora contain approximately 58,880,000 words (i...

MEMBER	academic	commercial
Licence: Commercial Use - ELRA VAR		7500.00 €

NON MEMBER	academic	commercial
Licence: Commercial Use - ELRA VAR		12000.00 €

TRAD Arabic-English Mailing lists Parallel corpus - Development set text

Arabic
English

ID: ELRA-W0108

ISLRN: 213-044-240-074-6

This is a parallel corpus of 10,000 words in Arabic and a reference translation in English. The source texts are emails collected from Wikiar-I, a mailing list for discussions about the Arabic Wikipedia. The collected emails are dated from 2004 to 2007. The translation has been conducted follow...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	150.00 €	500.00 €
Licence: Commercial Use - ELRA VAR	500.00 €	500.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	300.00 €	1000.00 €
Licence: Commercial Use - ELRA VAR	1000.00 €	1000.00 €

Corpus:
Lexical/Conceptual:
Tool/Service:
Language Description:

Text:
Audio:
Image:
Video:
Text Numerical:
Text N-Gram:

Resource Type:

Media Type:

33 Language Resources (Page 1 of 2)