ELRA releases .
The ELRA Catalogue of Language Resources offers a repository of Language Resources (LRs) made available through ELRA.
(See full-size image)
An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.
Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.
Other resources identified, but not available through ELRA, can be viewed in the Universal Catalogue.
If you have any suggestions or comments, or need any further details about ELRA and its Catalogue of Language Resources, please refer to the contact us section.
|ELRA-L0098 : Arabic dictionary of inflected words
This dictionary consists of a list of 6
million inflected forms, fully
vowelized, and tagged with grammatical
information which includes POS and
grammatical features, including number,
gender, case, definiteness, tense, mood
and compatibility with clitic
agglutination. The data is formatted in
conformity with the data formats of
This dictionary is also
available together with recognition of
agglutinated clitics and inflection
system in the ELRA Catalogue under
|ELRA-L0099 : Arabic dictionary of inflected words with recognition of agglutinated clitics and inflection system
This dictionary consists of 6 million
inflected forms, fully vowelized,
generated in compliance with the
grammatical rules of Arabic and tagged
with grammatical information which
includes POS and grammatical features,
including number, gender, case,
definiteness, tense, mood and
compatibility with clitic agglutination.
It is accompanied by a grammatical
resource that recognizes hundreds of
millions of valid agglutinated words. In
order to be able to update the full-form
dictionary, a dictionary of 65 000
lemmas and the data required to inflect
them and regenerate the full-form
dictionary are also provided. The data
is formatted in conformity with the data
formats of Unitex/GramLab.
dictionary is also available without
recognition of agglutinated clitics and
without inflection system in the ELRA
Catalogue under reference ELRA-L0098.
|ELRA-W0119 : Helsinki Corpus of Swahili
This is a text corpus of Swahili
language of 25 million words, annotated
for part-of-speech, morphology and
syntax. The corpus contains prose text
from domains such as fiction, news media
and government documents, from the
period between 1953 and 2016.
|ELRA-W0120 : NUM 5M Mongolian written corpus
This is a corpus of Mongolian text
mostly from domains like online or
printed daily newspapers, literature,
and laws. Part of this corpus, about
2,800 sentences with 100,000 words, has
been POS-tagged manually and stored in
XML TEI format.
|ELRA-S0393 : Persian Speech Corpus
This speech corpus was recorded through
a "Blubbery" model microphone by one
male speaker in Persian (Tehrani accent)
in a professional studio. Synthesized
speech as an output using this corpus
has produced a high quality, natural
voice. It consists of 399 utterances for
a total of about 2.5 hours, with
orthographic and phonetic
|(last update: October 2017)