Tutorial On How To Implement Document Clustering In Python With K-means

by | Jan 16, 2023 | Data Science, Machine Learning, Natural Language Processing

Introduction to document clustering and its importance

Grouping similar documents together in Python based on their content is called document clustering, also known as text clustering. This unsupervised machine learning method is used to analyse and organise extensive collections of text data.

Document clustering is important because it can find patterns and structure in text data that is not organized. This can be useful for many things, like text mining, natural language processing, and getting information.

document clustering in python is similar to stars in the sky

Documents can be clustered by how similar they are.

Some examples of how document clustering can be used include:

  • Grouping news articles by topic to improve search results and recommendation systems
  • Organizing customer feedback by theme helps identify critical issues and improve customer service.
  • Clustering research papers by topic will assist with the literature review and research organisation.

Document clustering can help make sense of large and complex text data, making it more manageable and actionable.

Overview of standard document clustering techniques in Python

There are several standard techniques used for document clustering, including:

  1. K-means: This well-known clustering algorithm divides a dataset into k clusters, where k is a parameter that the user specifies. When using K-means, spherical clusters are created, with the mean of each cluster’s points serving as its representation.
  2. Hierarchical Clustering: With this approach, the nested grouping of the data is represented by a tree-like structure called a dendrogram. Merging or dividing clusters can be used to build a hierarchy of clusters. Hierarchical clustering is divided into two categories:
    • Agglomerative: This bottom-up method treats each data point as a single cluster, then iteratively merged with other clusters until a stopping criterion is satisfied.
    • Divisive: This top-down method splits the data points until a stopping criterion is met by treating them all as one big cluster.
  3. Expectation-Maximization (EM): This technique calculates the probabilities that make up the data’s probability distribution. The EM algorithm can group text data by assuming that the data points come from a mix of different probability distributions.
  4. Latent Dirichlet Allocation (LDA): Using this generative probabilistic model, text data is clustered according to the topics that produce the data. LDA is a well-liked topic modelling method that has also been applied to document clustering.
  5. DBSCAN : The density-based clustering algorithm known as Density-Based Spatial Clustering of Applications with Noise (DBSCAN) groups together data points that are closely spaced apart (points with numerous close neighbours) while classifying as outliers those points that are isolated in low-density areas (called noise points).

Each of these algorithms has benefits and drawbacks of its own, and the technique to be used will depend on the specifics of the dataset and the desired result.

Selection of a document clustering technique for your dataset

The best method for a particular dataset will depend on several factors, such as the dataset’s size and complexity, the desired result, and the available computing resources.

  • K-means is a good choice if the dataset is small to medium-sized and the number of clusters is known or can be estimated. It scales well to big datasets and is relatively quick and simple to implement.
  • Hierarchical clustering is a good choice if the dataset is large and the number of clusters is unknown. Large datasets can be handled, and the number of clusters need not be predetermined.
  • Expectation-Maximization (EM) or Latent Dirichlet Allocation (LDA) may be appropriate if the dataset has many features and the objective is to find latent structure in the data. With these methods, you can find patterns in the data that are not obvious at first glance.
  • DBSCAN is a good choice if the dataset contains many points closely clustered in some feature space regions and separated by large regions of low point density.
  • It’s a good idea to start with a combination of techniques, such as hierarchical clustering and k-means, and then use more specialised techniques, like LDA or EM. If you have little to no prior knowledge about the nature of the data and want to explore the dataset and discover patterns.

It’s important to keep in mind that to choose the right method, you need to think about both the problem and the data. You should try out different methods, evaluate how well they work using the available evaluation metrics, and then choose the method that best solves the problem.

Preparation of the dataset for clustering

Before using any document clustering method, the text must be cleaned, tokenized, vectorized, and other steps taken to prepare the dataset.

  1. Text cleaning: This step eliminates extraneous or unnecessary text, including punctuation, stop words, special characters, and numbers. By eliminating data noise, this step enhances the clustering algorithm’s accuracy.
  2. Tokenization: In this step, the text is broken up into tokens, single words or phrases, so the algorithm can figure out what it means.
  3. Stemming/Lemmatization: Stemming or a lexicon-based method is used to strip inflectional endings from words and return them to their basic form (lemmatization). In this step, similar words are grouped to help reduce the dataset’s dimensionality.
  4. Vectorization: In this step, the text is transformed into a numerical representation that the clustering algorithm can use as input. Bag-of-words, term frequency-inverse document frequency (TF-IDF), and word embeddings are a few techniques for vectorizing text. The best method will depend on the specific characteristics of the dataset and the desired result. Each of these methods has its benefits and drawbacks.
  5. Dimensionality reduction: By removing correlated features or projecting the data onto a lower-dimensional space, this step reduces the number of features. When the dataset has a lot of features and the clustering algorithm needs to be sped up, this step is especially helpful.

The dataset is ready to be used as input for the clustering algorithm once it has been prepared. It is crucial to keep in mind that the quality of the pre-processing steps determines the quality of the clustering results, making it crucial to take the time to prepare the dataset properly.

Implementation of the chosen document clustering technique using Python

Here is a list of Python libraries that can be used for document clustering.

  • scikit-learn: This is a popular machine learning library that provides a wide range of clustering algorithms, including k-means, hierarchical clustering, and EM. It also provides tools for pre-processing and evaluating the clustering results.
  • NLTK: This is a natural language processing library that provides tools for text cleaning, tokenization, stemming, and lemmatization.
  • Gensim: This is a library for topic modelling and document indexing that provides tools for vectorization, dimensionality reduction, and LDA.
  • Spacy: This industrial-strength natural language processing library provides tools for tokenization, stemming, and lemmatization.

Clustering text documents using k-means

Here is an example of how to use the k-means algorithm for document clustering in Python using the scikit-learn library and a sample dataset:

import numpy as np
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample dataset
dataset = ["I love playing football on the weekends",
           "I enjoy hiking and camping in the mountains",
           "I like to read books and watch movies",
           "I prefer playing video games over sports",
           "I love listening to music and going to concerts"]

# Vectorize the dataset
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(dataset)

# Define the number of clusters
k = 2

# Create a k-means model and fit it to the data
km = KMeans(n_clusters=k)

# Predict the clusters for each document
y_pred = km.predict(X)

# Print the cluster assignments
# Output: [1 1 0 1 0]

In this example, we first use the TfidfVectorizer to vectorize the dataset. This converts the text into a numerical representation that can be used as input for the k-means algorithm.

Then we specify the number of clusters to be used (in this case, 2) and build a KMeans model. The model is then fitted to the data, and each document is assigned to a cluster using the prediction method.

In this case, the documents are put into two clusters. The documents in the first cluster are about outdoor activities and sports. In contrast, the documents in the second cluster are about indoor activities such as reading and playing video games.

This is a simple example, but the same process can be applied to larger datasets and more complex text data. The number of clusters can also be adjusted to fit the desired outcome.

Evaluation of the clustering results

A crucial step in ensuring the algorithm operates as intended is assessing the results. The quality of clustering results can be assessed using several metrics, such as:

  1. Silhouette Score: This metric gauges how closely each point resembles its cluster in relation to other clusters. A score of 1 means that the point is well-matched to its own cluster, and a score of -1 means that it is better matched to another cluster. The silhouette score can range from -1 to 1. The clusters are clearly defined if the silhouette score is high.
  2. Adjusted Rand Index (ARI): This metric assesses how closely the actual labels match the forecasted labels. The scale goes from -1 to 1, with 1 representing a perfect match and -1 representing entirely different labels.
  3. Normalized Mutual Information (NMI): This metric measures how similar the actual labels are to the labels that were predicted.
  4. Fowlkes-Mallows index (FMI): This metric assesses how geometrically similar the actual labels are to the predicted ones.
  5. V-measure: This metric contrasts the harmonic mean between the clustering results’ precision and recall.
  6. Davies-Bouldin index (DBI): This metric measures the similarity between each cluster and its most similar cluster.
  7. Calinski-Harabasz index (CHI): This metric measures the ratio of the variance between clusters to the difference in variance within a cluster.

It’s important to remember that no single metric can provide a comprehensive evaluation of clustering results, so using several metrics is advised to obtain a more thorough evaluation. Remembering that various data types and clustering algorithms may respond better to various metrics is essential.

Visualization of the document clustering using Python libraries

Viewing the clusters can aid in understanding the dataset’s structure and data grouping. To visualise the document clustering, a variety of Python libraries are available, including:

  • Matplotlib: This is a plotting library that provides a wide range of plotting options, including scatter plots, line plots, and bar plots.
  • Seaborn: This is a library that is built on top of Matplotlib and provides advanced visualisation options such as heatmaps, pair plots, and violin plots.
  • Plotly: This is an open-source library that provides interactive and web-based visualisation options.
  • Bokeh: This is another interactive visualisation library that is well suited for large datasets and streaming data.
  • Yellowbrick: This visualisation library is built on top of Matplotlib and specifically designed for machine learning tasks.

Here is an example of how to visualise the clusters using Matplotlib and Seaborn:

import seaborn as sns

# convert to a sparse matrix
X = X.toarray()

# Create a scatter plot of the data colored by the predicted clusters
sns.scatterplot(x=X[:,0], y=X[:,5], hue=y_pred, palette='rainbow')

Here, we create a scatter plot of the first and fifth dimensions of the data and colour the points according to the predicted cluster assignments.

document clustering

The plot shows how the data are spread out within each cluster and makes it easier to see patterns in the data.

Similarly, other visualisation libraries can be used to plot different plots like heatmaps, pair plots, etc.

The example above should be considered a starting point, and you may need to adjust the code depending on the specific requirements of your project.


Document clustering is a strong technique that can effectively analyse and organise large amounts of text data. Document clustering is a multi-step process that includes pre-processing the dataset, choosing an appropriate clustering method, using the method, analysing the outcomes, and visualising the clusters.

Several Python libraries can be used to put document clustering strategies into action, and there are several ways to measure the quality of the clustering results. In addition, visualisation libraries can be used to comprehend the dataset’s structure and grouping.

Potential further next steps

Potential next steps for further analysis include the following:

  1. Fine-tuning the pre-processing steps: Depending on the dataset, the text cleaning, tokenization, and vectorization steps may need to be changed to improve the quality of the clustering results.
  2. Using different clustering methods: If the results of one method aren’t good enough, it might be helpful to try other methods and compare how well they work.
  3. Applying supervised learning: After clustering the data, it’s possible to use the clusters as features to train a supervised model on the same dataset for classification or regression tasks.
  4. Using domain-specific knowledge: If domain-specific knowledge can be used to guide the clustering process, it can be used to improve the quality of the results.
  5. Interpreting the clusters: It’s also important to interpret the clusters and understand the characteristics of the grouped documents. This can help figure out what patterns are going on in the data and guide further analysis.

Overall, document clustering in Python is an effective method for making sense of large and complex text data, making it more manageable and valuable. It’s important to remember that the quality of the pre-processing steps determines the quality of the clustering results, so it’s essential to take the time to prepare the dataset properly.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Related Articles

Most Powerful Open Source Large Language Models (LLM) 2023

Open Source Large Language Models (LLM) – Top 10 Most Powerful To Consider In 2023

What are open-source large language models? Open-source large language models, such as GPT-3.5, are advanced AI systems designed to understand and generate human-like...

l1 and l2 regularization promotes simpler models that capture the underlying patterns and generalize well to new data

L1 And L2 Regularization Explained, When To Use Them & Practical Examples

L1 and L2 regularization are techniques commonly used in machine learning and statistical modelling to prevent overfitting and improve the generalization ability of a...

Hyperparameter tuning often involves a combination of manual exploration, intuition, and systematic search methods

Hyperparameter Tuning In Machine Learning & Deep Learning [The Ultimate Guide With How To Examples In Python]

What is hyperparameter tuning in machine learning? Hyperparameter tuning is critical to machine learning and deep learning model development. Machine learning...

Countvectorizer is a simple techniques that counts the amount of times a word occurs

CountVectorizer Tutorial In Scikit-Learn And Python (NLP) With Advantages, Disadvantages & Alternatives

What is CountVectorizer in NLP? CountVectorizer is a text preprocessing technique commonly used in natural language processing (NLP) tasks for converting a collection...

Social media messages is an example of unstructured data

Difference Between Structured And Unstructured Data & How To Turn Unstructured Data Into Structured Data

Unstructured data has become increasingly prevalent in today's digital age and differs from the more traditional structured data. With the exponential growth of...

sklearn confusion matrix

F1 Score The Ultimate Guide: Formulas, Explanations, Examples, Advantages, Disadvantages, Alternatives & Python Code

The F1 score formula The F1 score is a metric commonly used to evaluate the performance of binary classification models. It is a measure of a model's accuracy, and it...

regression vs classification, what is the difference

Regression Vs Classification — Understand How To Choose And Switch Between Them

Classification vs regression are two of the most common types of machine learning problems. Classification involves predicting a categorical outcome, such as whether an...

Several images of probability densities of the Dirichlet distribution as functions.

Latent Dirichlet Allocation (LDA) Made Easy And Top 3 Ways To Implement In Python

Latent Dirichlet Allocation explained Latent Dirichlet Allocation (LDA) is a statistical model used for topic modelling in natural language processing. It is a...

One of the critical features of GPT-3 is its ability to perform few-shot and zero-shot learning. Fine tuning can further improve GPT-3

How To Fine-tuning GPT-3 Tutorial In Python With Hugging Face

What is GPT-3? GPT-3 (Generative Pre-trained Transformer 3) is a state-of-the-art language model developed by OpenAI, a leading artificial intelligence research...


Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2023 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2023. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!