BulDML at Institute of Mathematics and Informatics: A Statistical Approach for Multilingual Document Clustering and Topic Extraction from Clusters

	Home

Browse
	Communities & Collections
	Issue Date
	Author
	Title
	Subject

Sign on to:
	Receive email updates
	My DSpace authorized users
	Edit Profile


	About DSpace

BulDML at Institute of Mathematics and Informatics >
IMI >
IMI Periodicals >
Pliska Studia Mathematica Bulgarica >
2004 Volume 16 >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10525/2322

Title:	A Statistical Approach for Multilingual Document Clustering and Topic Extraction from Clusters
Authors:	Silva, Joaquim Mexia, Joao Coelho, Carlos A. Lopes, Gabriel
Keywords:	Cluster Analysis Applied Statistics Document Clustering Text Mining Topics Extraction
Issue Date:	2004
Publisher:	Institute of Mathematics and Informatics Bulgarian Academy of Sciences
Citation:	Pliska Studia Mathematica Bulgarica, Vol. 16, No 1, (2004), 207p-228p
Abstract:	This paper describes a statistics-based methodology for document unsupervised clustering and cluster topics extraction. For this purpose, multiword lexical units (MWUs) of any length are automatically extracted from corpora using the LiPXtractor extractor - a language independent statistics-based tool. The MWUs are taken as base-features to describe documents. These features are transformed and a document similarity matrix is constructed. From this matrix, a reduced set of features is selected using an approach based on Principal Component Analysis. Then, using the Model Based Clustering Analysis software, it is possible to obtain the best number of clusters. Precision and Recall for document-cluster assignment range above 90%. Most important MWUs are extracted from each cluster and taken as document cluster topics. Results on new document classification will just be mentioned.
Description:	2000 Mathematics Subject Classification: 62H30
URI:	http://hdl.handle.net/10525/2322
ISSN:	0204-9805
Appears in Collections:	2004 Volume 16

Files in This Item:

File	Description	Size	Format
Pliska-16-2004-207-228.pdf		1.19 MB	Adobe PDF	View/Open

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Creative Commons License