Cluster Analysis Applied Statistics Document Clustering Text Mining Topics Extraction
Issue Date:
2004
Publisher:
Institute of Mathematics and Informatics Bulgarian Academy of Sciences
Citation:
Pliska Studia Mathematica Bulgarica, Vol. 16, No 1, (2004), 207p-228p
Abstract:
This paper describes a statistics-based methodology for document unsupervised clustering and cluster topics extraction.
For this purpose, multiword lexical units (MWUs) of any length are automatically extracted from corpora using the LiPXtractor extractor - a language independent statistics-based tool. The MWUs are taken as base-features to describe documents. These features are transformed and a document similarity matrix is constructed. From this matrix, a reduced set of features is selected using an approach
based on Principal Component Analysis. Then, using the Model Based Clustering Analysis software, it is possible to obtain the best number of clusters. Precision and Recall for document-cluster assignment range above 90%. Most important MWUs are extracted from each cluster and taken as document cluster topics. Results on new document classification will just be mentioned.