3.3.2. Categorizing
The matricial representazion of the document collection can be used to explore in a more efficient way the whole collection and find groups of documents which share similar topics. To accomplish this goal we apply an optional clustering algorithm. This step divides unclassified texts, belonging to document collections made of specialized journals, into more homogeneous subsets. The categorization phase could be very important to detect documents which share a common topic without any a priory knowledge of their content, but if the user already searched for articles in a very specific topic, there could be no advantage in categorize such articles. Categorization is performed by means of CLUTO's clustering partitional algorithm, the repeated-bisecting method, which produces a globally optimized solution. BioSumm allows the user to select if clustering is performed and, in this case, the desired number of clusters. After executing the clustering block, BioSumm also saves in the main program folder two files, one representing the clusters distribution in a graphical way (Cluster.gif), while the other one (RM_TMP.txt) contains the RapidMiner output.< 3.3.1. Preprocessing | Index | 3.3.3. Summarizing > |