IMI-BAS BAS
 

BulDML at Institute of Mathematics and Informatics >
IMI >
IMI Periodicals >
Pliska Studia Mathematica Bulgarica >
2005 Volume 17 >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10525/2274

Title: Topic Segmentation: How Much Can We Do by Counting Words and Sequences of Words
Authors: Dias, Gael
Alves, Elsa
Nunes, C elia
Issue Date: 2005
Publisher: Institute of Mathematics and Informatics Bulgarian Academy of Sciences
Citation: Pliska Studia Mathematica Bulgarica, Vol. 17, No 1, (2005), 39p-70p
Abstract: In this paper, we present an innovative topic segmentation system based on a new informative similarity measure that takes into account word co-occurrence in order to avoid the accessibility to existing linguistic resources such as electronic dictionaries or lexico-semantic databases such as thesauri or ontology. Topic segmentation is the task of breaking documents into topically coherent multi-paragraph subparts. Topic segmentation has extensively been used in information retrieval and text summarization. In particular, our architecture proposes a language-independent topic segmentation system that solves three main problems evidenced by previous research: systems based uniquely on lexical repetition that show reliability problems, systems based on lexical cohesion using existing linguistic resources that are usually available only for dominating languages and as a consequence do not apply to less favored languages and finally systems that need previously existing harvesting training data. For that purpose, we only use statistics on words and sequences of words based on a set of texts. This solution provides a flexible solution that may narrow the gap between dominating languages and less favored languages thus allowing equivalent access to information.
URI: http://hdl.handle.net/10525/2274
ISSN: 0204-9805
Appears in Collections:2005 Volume 17

Files in This Item:

File Description SizeFormat
Pliska-17-2005-039-070.pdf1.19 MBAdobe PDFView/Open

 



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0!   Creative Commons License