Item Details

The Corpus of Galicia / Spanish Bilingual Speech of the University of Vigo: Codes tagging and automatic anotation

Issue: Vol 4 No. 1 (2003) Estudios de Sociolingüística 4.1 2003

Journal: Sociolinguistic Studies

Subject Areas: Gender Studies Linguistics

DOI: 10.1558/sols.v4i1.358

Abstract:

Firstly, we present a brief explanation of this research project, the Corpus of Galician/Spanish Bilingual Speech (Corpus de Fala Bilingüe Galego/Castelán, abbreviated as CoFaBil), currently being complied at the University of Vigo. This ethnographicconversational based corpus has been recorded in a wide range of informal and spontaneous communicative situations, subsequently transcribed in detail with those conventions normally applied to conversation analysis. Secondly, we explain the manual annotation process of the corpus. The CHAT annotation system, applied in tagging this corpus, requires specifying the linguistic-communicative code to which each word belongs. So, we shall explain the problems to which this word by word tagging leads us. These problems cover phenomena characteristic of both bilingual conversation and languages in contact, but with the specificity that the scarce interlinguistic distance between the varieties of Galician and of Spanish call for adopting certain tagging values (presented in the text) that respond to the complex nature of the different phenomena detected. Thirdly, we present the solutions conceived for the automatic annotation of this corpus. The most important result is the computer application Anotador 1.0, which makes it possible to note down a substantial part of the phenomena appearing in the CoFaBil more speedily, while doing away with the interpretative biases involved in human annotating. Also, due to the versatility of this tool, it may be used as a corpora annotator of bilingual speech for any pair of languages.

Author: Xoán Paulo Rodríguez-Yáñez, Hakan Casares-Berg

View Original Web Page