In the following solution, we will go through the pipeline that achieved the baseline result you saw on the leaderboard. Therefore, careful readers will found several things that can be improved and others left apart for further discussions.


We need to download several contents to complete the lab assignment.

In [1]:
import numpy as np
import nltk'wordnet')
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/giuseppe/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

Newsgroup clustering

Let's start by loading the dataset.

In [2]:
import os

in_dir = "T-newsgroups"
files = sorted(os.listdir(in_dir), key=int) # note how the part 'key=lambda x: int(x)' is shortened here 
['0', '1', '2', '3', '4']

Be careful: os.listdir returns filenames in random order. To tackle that, we can just sort them using their integer cast as ordering criteria.

Data preparation step

Here we can make use of the TfidfVectorizer class from scikit-learn. Starting from the code snipped provided in the laboratory text, we have a custom LemmaTokenizer class that can be used in place of the TfidfVectorizer's standard tokenizer. In our tokenizer, we create tokens and lemmatize them through nltk word_tokenize and WordNetLemmatizer respectively.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords as sw
import re
import string

class LemmaTokenizer(object):
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
    def __call__(self, document):
        lemmas = []
        re_digit = re.compile("[0-9]") # regular expression to filter digit tokens
        for t in word_tokenize(document):
            t = t.strip()
            lemma = self.lemmatizer.lemmatize(t)
            # remove tokens with only punctuation chars and digits
            if lemma not in string.punctuation and \
               len(lemma) > 3 and \
               len(lemma) < 16 and \
               not re_digit.match(lemma):
        return lemmas

Let's apply now our custom tokenizer to generate the TFIDF representation. We can let the TfidfVectorizer read the documents for us, using the input parameter to its constructor. We should only care about passing to fit_transform the correct path to our files.

In [4]:
stopwords = sw.words('english') + ["'d", "'ll", "'re", "'s", "'ve", 'could', 'doe', 'ha', 'might', 'must', "n't", 'need', 'sha', 'wa', 'wo', 'would']
tokenizer = LemmaTokenizer()
vectorizer = TfidfVectorizer(input='filename', tokenizer=tokenizer, stop_words=stopwords)

filepath = [os.path.join(in_dir, f) for f in files] # here we need the correct path
X_tfidf = vectorizer.fit_transform(filepath)
<4000x40684 sparse matrix of type '<class 'numpy.float64'>'
	with 357446 stored elements in Compressed Sparse Row format>

Note that we extended the list of stopwords with further non significative words.

Principal Component Analysis

As you can see, we are dealing with about 40k different features. These are all the distinct words found in our corpus (i.e. the collection of our news). We can transform and reduce our input feature space through the Principal Component Analysis (PCA) (see Appendix on the laboratory text).

Let's focus on PCA applied by means of the Singular Value Decomposition (SVD). Each component extracted by the decomposition will express a given amount of the variance of our data. In the best case, all the variance is expressed by a low number of new features. As suggested, the TruncatedSVD class let us decide how many components we want to retain in the new space. Additionally, it provides you the variance those components encode with some inner attributes.

In [5]:
from sklearn.decomposition import TruncatedSVD

Let's try with a low number of features, e.g. 30, and use them to cluster our points. Please remember that the SVD algebric method transforms the initial space. Hence, you will not end up with the 30 words the best separate your data. Instead, you will have new components (i.e. projections of your points) in a new 30-dimensional space.

The fit_transform method performs the decomposition and maps the input data to the new vector space.

In [6]:
svd = TruncatedSVD(n_components=30, random_state=42)
X_svd = svd.fit_transform(X_tfidf)

print(f"Total variance explained: {np.sum(svd.explained_variance_ratio_):.2f}")
Total variance explained: 0.07

The 7% is quite low for real applications. We will investigate it in Section 2.4.

Clustering and wordclouds

Now that we have an initial vector representation, we can use it to cluster our documents. We will focus on the K-means algorithm with K starting from 2 and increasing. Next, we can choose the K value that better helps us to spot topics in the final clusters. Doing so, we are impersonating a domain expert that analyzes and evaluates the final subdivision. Remember that, in our context, one clear topic per cluster is the desired outcome.

One simple approach to describe the clusters would exploit wordcloud images. These "clouds" include the most important words in collection of texts for a certain metric. The more a word is significant the bigger it appears in the picture. Fortunately, we already have a "significance" metric: the TFIDF weight of each word. Let's exploit its very meaning and choose it as the "importance" metric for the wordcloud.

In [7]:
from sklearn.cluster import KMeans
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

Let's define a function that clusters our data with K-means and generates one wordcloud per cluster. As you can see in the following code, we will use only the top_count words with the highest TFIDF value.

In [8]:
def generate_wordclouds(X, X_tfidf, k, word_positions):
    """Cluster X with K-means with the specified k, and generate one wordcloud per cluster.
    :input X: numpy.array or numpy.matrix to cluster
    :input X_tfidf: sparse matrix with TFIDF values
    :input k: the k to be used in K-means
    :input word_positions: dictionary with pairs as word index (in vocabulary) -> word  
    :return cluster_ids: set with the clusters ids
    model = KMeans(n_clusters=k, random_state=42, n_jobs=-1)
    y_pred = model.fit_predict(X)
    cluster_ids = set(y_pred)
    top_count = 100

    for cluster_id in cluster_ids:
        # compute the total tfidf for each term in the cluster 
        tfidf = X_tfidf[y_pred == cluster_id]
        tfidf_sum = np.sum(tfidf, axis=0) # numpy.matrix
        tfidf_sum = np.asarray(tfidf_sum).reshape(-1) # numpy.array of shape (1, X.shape[1])
        top_indices = tfidf_sum.argsort()[-top_count:]
        term_weights = {word_positions[idx]: tfidf_sum[idx] for idx in top_indices}
        wc = WordCloud(width=1200, height=800, background_color="white")
        wordcloud = wc.generate_from_frequencies(term_weights)

        fig, ax = plt.subplots(figsize=(10, 6), dpi=100)
        ax.imshow(wordcloud, interpolation='bilinear')
        fig.suptitle(f"Cluster {cluster_id}")
    return cluster_ids

K = 2

In [9]:
word_positions = {v: k for k, v in vectorizer.vocabulary_.items()}
cluster_ids = generate_wordclouds(X_svd, X_tfidf, 2, word_positions)

It can easily be seen that Cluster 0 is characterized by words related to baseball. We can assume that there exists at least a group of news talking about baseball/sport. On the contrary, Cluster 1 looks way less promising. Many among the heaviest words have a low significance (e.g. article, writes, thing, like, think). Results can surely improve increasing K.

K = 3

In [10]:
cluster_ids = generate_wordclouds(X_svd, X_tfidf, 3, word_positions)