Spoken Portuguese Corpus – ELRA Catalogue

Last view: 2024-04-18

860 Last view: 2024-04-18

Last update: 2020-07-09

3 Last update: 2020-07-09

Spoken Portuguese Corpus

View resource name in all available languages

Corpus du portugais parlé

ISLRN: 969-074-010-182-2

ID:

ELRA-S0345

The Spoken Portuguese corpus was collected among sociolinguistically diverse speakers having Portuguese as mother tongue or as second language. In a total of 86 recordings, the texts exemplify the Portuguese spoken in Portugal (30), in Brazil (20), in the African countries with Portuguese as its official language: Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe (5 each), in Macao (5), in Goa (3) and in East-Timor (3), corresponding to a total of 8h44m of recording.
The corpus was recorded in a situation of spontaneous oral communication, on different themes of everyday life, with speakers of different ages and social and professional backgrounds.
The recordings cover a period that goes from 1970 to 2001, and approximately 70% of them fall within the nineties. The corpus contains 153,588 tokens.

The corpus consists of audio files in .wav format, aligned transcriptions in XML Exmaralda format and transcriptions in plain text. The plain text files also have automatically assigned POS-tag information. The transcriptions of the corpus are also available in html format. The characters have been encoded in UTF-8.

View resource description in French

Le Corpus du portugais parlé a été collecté auprès de locuteurs variés d’un point de vue socio-linguistique, parlant le portugais en tant que langue maternelle ou deuxième langue. Pour un total de 86 enregistrements, les textes sont représentatifs du portugais parlé au Portugal (30), au Brésil (20), dans les pays africains dont le portugais est la langue officielle: Angola, Cap Vert, Guinée-Bissau, Mozambique et Sao Tome et Principe (5 chacun), à Macao (5), Goa (3) et au Timor oriental (3), correspondant à un total de 8h44m d’enregistrements.
Le corpus a été enregistré en situation de communication orale spontanée, sur différents thèmes de la vie quotidienne, avec des locuteurs d’âges différents et de diverses provenances socio-professionnelles.
Les enregistrements couvrent une période allant de 1970 à 2001, 70% d’entre eux environ ayant été enregistrés dans les années 90. Le corpus comprend 153 588 tokens.

Le corpus est constitué de fichiers audio au format .wav, de transcriptions alignées au format XML Exmaralda, ainsi que de transcriptions en texte intégral. Les fichiers texte intégral ont également des informations en partie du discours associées de façon automatique. Les transcriptions du corpus sont aussi disponibles au format html. Les caractères sont codés en UTF-8.

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	2500.00 €
Licence: Commercial Use - ELRA VAR	2500.00 €	2500.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	3000.00 €
Licence: Commercial Use - ELRA VAR	3000.00 €	3000.00 €

DistributionAvailability start date 12/09/2012 Contact Person

Valérie Mapelli

audio

Monolingual audio corpusLanguages

Portuguese

Linguality

Linguality type: Monolingual

Size

no size available

Size

8.73 Hours

Classification

This corpus consists of informal conversations between acquaintances, friends or relatives as well as formal acts as, for instance, radio programs or conferences.

Audio genre: Other

Audio FormatsTime Coverage

1970 to 2001

Recording

Source channel: Other

Metadata

Created: 05/12/2005

Metadata Language: French, English (fr, en)

Version

Version: 1.0

Last Updated: 09/12/2012

People who looked at this resource also viewed the following: