Spanish-English website parallel corpus (Processed)

See COPYRIGHT file which contains Source owners
See COPYRIGHT file which contains Source owners
See COPYRIGHT file which contains Source owners
See COPYRIGHT file which contains Source owners
664-503-904-200-9

ID:

ELRA-W0248

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu.
This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 21,007 TUs.
Period of crawling : 15/11/2016 - 23/01/2017
A strict validation process has been followed, which resulted in discarding:
- TUs from crawled websites that do not comply to the PSI directive,
- TUs with more than 99% of mispelled tokens,
- TUs identified during the manual validation process and all the TUs from websites whose error rate in the sample extracted for manual validation is strictly above the following thresholds:
50% of TUs with language identification errors,
50% of TUs with alignment errors,
50% of TUs with tokenization errors,
20% of TUs identified as machine translated content,
50% of TUs with translation errors.

MEMBERacademiccommercial
Licence: Other - Open Under-PSI
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Other - Open Under-PSI
0.00 € submit
0.00 € submit
Download
27/02/2020 Downloadable

People who looked at this resource also viewed the following:
Resources from the same project