3.3.1. Preprocessing
It extracts relevant parts of the original document, and produces a matricial representation of the sources.
The preprocessing step parses the input collection and is designed to be flexible. Given a document collection D, we build a matricial representation W in which each row is a document and each column corresponds to a feature (stemmed word) of the documents. Each element of matrix W is the TF-IDF (Term Frequency - Inverse Document Frequency) value for a term, computed as follows: [FORMULA] where tfi,j is the term frequency of word j in document i and idfj is the inverse document frequency of the term j (a logarithmic function depending on the cardinality of document collection and inversely proportional to the number of documents in which term j appears). Matrix W is generated by means of the text plugin of Rapid Miner. Since in most cases the generated matrix is still characterized by a high dimensionality, a further filtering step is applied, elimining 'useless features', i.e. very frequent words that tend to be non discriminative for successive analysis.