IMI-BAS BAS
 

BulDML at Institute of Mathematics and Informatics >
IMI >
IMI Periodicals >
Pliska Studia Mathematica Bulgarica >
2004 Volume 16 >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10525/2322

Title: A Statistical Approach for Multilingual Document Clustering and Topic Extraction from Clusters
Authors: Silva, Joaquim
Mexia, Joao
Coelho, Carlos A.
Lopes, Gabriel
Keywords: Cluster Analysis
Applied Statistics
Document Clustering
Text Mining
Topics Extraction
Issue Date: 2004
Publisher: Institute of Mathematics and Informatics Bulgarian Academy of Sciences
Citation: Pliska Studia Mathematica Bulgarica, Vol. 16, No 1, (2004), 207p-228p
Abstract: This paper describes a statistics-based methodology for document unsupervised clustering and cluster topics extraction. For this purpose, multiword lexical units (MWUs) of any length are automatically extracted from corpora using the LiPXtractor extractor - a language independent statistics-based tool. The MWUs are taken as base-features to describe documents. These features are transformed and a document similarity matrix is constructed. From this matrix, a reduced set of features is selected using an approach based on Principal Component Analysis. Then, using the Model Based Clustering Analysis software, it is possible to obtain the best number of clusters. Precision and Recall for document-cluster assignment range above 90%. Most important MWUs are extracted from each cluster and taken as document cluster topics. Results on new document classification will just be mentioned.
Description: 2000 Mathematics Subject Classification: 62H30
URI: http://hdl.handle.net/10525/2322
ISSN: 0204-9805
Appears in Collections:2004 Volume 16

Files in This Item:

File Description SizeFormat
Pliska-16-2004-207-228.pdf1.19 MBAdobe PDFView/Open

 



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0!   Creative Commons License