phonology corpus corpus linguistics audio archive digitization restoration metadata alignment transcription phonetics
Institute of Information Theories and Applications FOI ITHEA
The paper presents our considerations related to the creation of a digital corpus of Bulgarian dialects.
The dialectological archive of Bulgarian language consists of more than 250 audio tapes. All tapes were recorded
between 1955 and 1965 in the course of regular dialectological expeditions throughout the country. The records
typically contain interviews with inhabitants of small villages in Bulgaria. The topics covered are usually related to
such issues as birth, everyday life, marriage, family relationship, death, etc. Only a few tapes contain folk songs
from different regions of the country.
Taking into account the progressive deterioration of the magnetic media and the realistic prospects of data loss,
the Institute for Bulgarian Language at the Academy of Sciences launched in 1997 a project aiming at restoration
and digital preservation of the dialectological archive. Within the framework of this project more than the half of
the records was digitized, de-noised and stored on digital recording media. Since then restoration and digitization
activities are done in the Institute on a regular basis. As a result a large collection of sound files has been
Our further efforts are aimed at the creation of a digital corpus of Bulgarian dialects, which will be made available
for phonological and linguistic research. Such corpora typically include besides the sound files two basic
elements: a transcription, aligned with the sound file, and a set of standardized metadata that defines the corpus.
In our work we will present considerations on how these tasks could be realized in the case of the corpus of
Bulgarian dialects. Our suggestions will be based on a comparative analysis of existing methods and techniques
to build such corpora, and by selecting the ones that fit closer to the particular needs. Our experience can be
used in similar institutions storing folklore archives, history related spoken records etc.