How To Get Started With Topic Modelling In Python— ML And Deep Learning Algorithms

by | Dec 15, 2022 | Data Science, Machine Learning, Natural Language Processing

What is topic modelling?

Topic modelling is a technique used in natural language processing (NLP) to automatically identify and group similar words or phrases in a text. This lets us figure out the central ideas or themes in a group of documents. The main benefit is that this is possible even when there are a lot of different documents.

Topic modelling is one of our top 10 natural language processing techniques and is rather similar to keyword extraction, so definitely check out these articles to ensure you are using the right tools for the right problem.

Topic modelling can be helpful in various applications. Some common examples are automatically organizing a large corpus of documents, understanding customer feedback, or identifying common themes in social media posts.

topic modellinc automates the classification of large document corpus

Topic modelling can automate the classification of a large volume of documents.

What is topic modelling used for?

Topic modelling can be used in various situations where it is helpful to identify the main topics discussed in a text. Here are some potential use cases for topic modelling:

  • Analyzing customer feedback to identify common themes and concerns
  • Summarizing a large corpus of text by identifying the main topics discussed
  • Organizing a collection of documents into categories based on their content
  • Identifying trends and changes in the topics discussed in a collection of documents over time
  • Improving the accuracy of information retrieval systems by using topic modelling to improve the representation of documents in the system’s index.

These are just a few examples of the many potential use cases for topic modelling. It can be a powerful tool for making sense of extensive text collections and extracting valuable insights from them.

Is topic modelling supervised or unsupervised learning?

Topic modelling is a type of unsupervised machine learning that is used to discover the abstract topics that occur in a collection of documents. In topic modelling, a computer program analyses a set of documents and identifies the underlying themes or topics in the text. The program does this without being explicitly told what the topics are. It works without any supervision or guidance from a human. Instead, it relies on statistical techniques to identify patterns in the text that indicate the presence of specific topics.

Topic modelling can uncover hidden structures in extensive collections of documents. It is often used in text mining and natural language processing applications. It is a valuable tool for exploring and understanding large amounts of unstructured text data. Additionally, it can identify trends and patterns that may not immediately appear to a human reader.

Machine learning algorithms for topic modelling

1. Latent Dirichlet Allocation (LDA)

One of the most popular topic-modelling algorithms is Latent Dirichlet Allocation (LDA). This algorithm uses a probabilistic approach to identify the underlying topics in a collection of documents. Additionally, LDA assumes that each document is a mixture of topics, and each topic is a mixture of words. As a result, the algorithm uses this assumption to identify the document’s topics and related terms.

One of the benefits of LDA is that it can handle large amounts of text data. This makes it well-suited for applications such as analyzing customer feedback or social media posts. Additionally, LDA can identify topics that may not be explicitly mentioned in the text. This can help uncover hidden patterns or trends.

2. Non-Negative Matrix Factorization (NMF)

Another popular topic modelling algorithm is non-negative matrix factorization (NMF). NMF uses a linear algebra approach to identify the underlying topics in a collection of documents. Unlike LDA, NMF assumes that each document can only belong to a single topic. This can be helpful for specific applications.

NMF works by decomposing a large matrix of word-document co-occurrences into two smaller matrices: one that represents the words in the documents and the other that defines the topics. As a result, this allows the algorithm to discover the underlying topics in a corpus of documents and extract them in an easily interpretable way.

For example, let’s say you have a corpus of 100,000 news articles and want to find the topics that are most commonly discussed in these articles. You could then use NMF to decompose the matrix of word-document co-occurrences into two matrices: one representing the words in the documents and the other defining the topics. The resulting topics would then illustrate the most common themes or topics discussed in the news articles, and you could use these topics to categorize and organize the articles.

3. Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) is a dimensionality reduction technique based on singular value decomposition (SVD). Its purpose is to extract the underlying structure of a corpus of documents by representing the documents and words in a low-dimensional space.

In LSA, the first step is to construct a term-document matrix, which represents the frequency of each word in each document. This matrix is then decomposed using SVD, which produces a set of orthogonal latent vectors that capture the relationships between the terms and documents in the corpus. These latent vectors can then identify the underlying topics in the corpus.

One advantage of LSA is that it is computationally efficient, which makes it well-suited for large datasets. Additionally, LSA can handle synonyms and polysemy (words with multiple meanings) in a way that is more robust than some other topic modelling algorithms. However, LSA has been criticized for producing less interpretable topics than those made by different algorithms.

Deep learning for topic modelling

While deep learning is commonly used for various natural language processing tasks, it is not typically used for topic modelling. Instead, deep learning is often used to improve the performance of other techniques, such as Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF), by providing them with better word embeddings or by incorporating additional context information.

For example, one way deep learning can be used in topic modelling is to train a word embedding model on a large text corpus. This model can then be used to initialize the word vectors in an LDA or NMF model, which can improve the performance of the topic modelling algorithm. Another way deep learning can be used is to incorporate additional context information, such as the overall structure of the documents in the corpus or the relationships between words, into the topic modelling algorithm. This can help the algorithm better capture the underlying structure of the corpus and produce more accurate and interpretable topics.

Overall, while deep learning is not typically used as a standalone technique for topic modelling, it can help improve other algorithms’ performance and provide additional context information to help the algorithm better capture the underlying structure in the data.

How to do topic modelling in Python

1. LDA Scikit-Learn

Here is a simple example of how Latent Dirichlet Allocation (LDA) can be implemented in Python using the Scikit-Learn library:

from sklearn.decomposition import LatentDirichletAllocation

# define the number of topics
n_topics = 5

# create a Latent Dirichlet Allocation model
lda = LatentDirichletAllocation(n_components=n_topics)

# fit the model to the data

# transform the data using the fitted model
transformed = lda.transform(data)

This code uses the LatentDirichletAllocation class from the scikit-learn library to implement LDA. The n_components parameter is then used to specify the number of topics to be learned by the model. The fit method is used to fit the model to the input data, and the transform method is used to generate the topic distribution for each document.

Remember that this is just a simple example, and many different ways to implement LDA in Python exist. As a result, the implementation details can depend on the specific details of the problem.


In NLTK, LDA can be implemented using the ldamodel class in the gensim.models.ldamodel module. Here is an example of how you might use this class to train an LDA model on a corpus of text documents:

from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel

# Create a dictionary representing the corpus
dictionary = Dictionary(corpus)

# Create a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(doc) for doc in corpus]

# Train the LDA model on the corpus
lda_model = LdaModel(corpus_bow, num_topics=10, id2word=dictionary)

Here, corpus is a list of documents, where each document is a list of words. The LdaModel class takes the bag-of-words representation of the corpus as input, along with the number of topics to be learned and the dictionary mapping words to unique ids. This will train the LDA model on the corpus and allow you to use the model to infer the topics of new documents or to retrieve the most likely topics for a given document.

3. BERT topic modelling

BERT is a state-of-the-art natural language processing (NLP) model developed by Google that can be used for various tasks, including topic modelling. However, it is not a specific topic modelling algorithm, so no “BERT topic modelling code” exists.

To use BERT, combine it with a topic modelling algorithm such as Latent Dirichlet Allocation (LDA). You can then use the pre-trained Bert model to extract features from your text data, which can be used as input to the LDA algorithm to identify the topics in the text.

Here is an example of how you might use BERT for topic modelling in Python:

# Import the necessary libraries
import transformers
import sklearn

# Load the pre-trained Bert model
bert_model = transformers.BertModel.from_pretrained('bert-base-uncased')

# Define a function to extract features from your text data using Bert
def bert_features(data):
    input_ids = []
    attention_masks = []

    # Tokenize the text and create input_ids and attention_masks
    for text in data:
        inputs = tokenizer.encode_plus(text, add_special_tokens=True, max_length=MAX_LEN)

    # Convert input_ids and attention_masks to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    # Use Bert to extract features from the input text
    with torch.no_grad():
        outputs = bert_model(input_ids, attention_masks)
        features = outputs[0]

    return features

# Load your text data
data = ...

# Extract features using Bert
features = bert_features(data)

# Use LDA to identify the topics in the text
lda = sklearn.decomposition.LatentDirichletAllocation(n_components=10)

# Print the topics identified by LDA

This code uses the transformers library to load the pre-trained BERT model and then defines a function bert_features() to extract features from the input text data using BERT. The sklearn library is then used to perform LDA on the extracted features to identify the topics present in the text.

Topic modelling at Spot Intelligence

At Spot Intelligence, we often use topic modelling in the exploratory stages of analysis. It allows us to quickly deep dive into the documents at hand and visually see what the documents are about without reading or browsing through them.

Once we have identified topics we are interested in, we can use the results from the topic modelling to classify the documents and label them accordingly. This allows information to be found faster and further split into specific topics for analysis. This way, we can often segment the data into more manageable chunks that can then be summarised or aggregated together to get a more holistic view of the data set.

Combining topic modelling with a timeline is always an excellent analysis, as topics change over time. This is especially useful when analysing social media data, and doing trend analysis.

What are your favourite use cases of topic modelling? Let us know in the comments.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

ROC curve

ROC And AUC Curves In Machine Learning Made Simple & How To Tutorial In Python

What are ROC and AUC Curves in Machine Learning? The ROC Curve The ROC (Receiver Operating Characteristic) curve is a graphical representation used to evaluate the...

decision boundaries for naive bayes

Naive Bayes Classification Made Simple & How To Tutorial In Python

What is Naive Bayes? Naive Bayes classifiers are a group of supervised learning algorithms based on applying Bayes' Theorem with a strong (naive) assumption that every...

One class SVM anomaly detection plot

How To Implement Anomaly Detection With One-Class SVM In Python

What is One-Class SVM? One-class SVM (Support Vector Machine) is a specialised form of the standard SVM tailored for unsupervised learning tasks, particularly anomaly...

decision tree example of weather to play tennis

Decision Trees In ML Complete Guide [How To Tutorial, Examples, 5 Types & Alternatives]

What are Decision Trees? Decision trees are versatile and intuitive machine learning models for classification and regression tasks. It represents decisions and their...

graphical representation of an isolation forest

Isolation Forest For Anomaly Detection Made Easy & How To Tutorial

What is an Isolation Forest? Isolation Forest, often abbreviated as iForest, is a powerful and efficient algorithm designed explicitly for anomaly detection. Introduced...

Illustration of batch gradient descent

Batch Gradient Descent In Machine Learning Made Simple & How To Tutorial In Python

What is Batch Gradient Descent? Batch gradient descent is a fundamental optimization algorithm in machine learning and numerical optimisation tasks. It is a variation...

Techniques for bias detection in machine learning

Bias Mitigation in Machine Learning [Practical How-To Guide & 12 Strategies]

In machine learning (ML), bias is not just a technical concern—it's a pressing ethical issue with profound implications. As AI systems become increasingly integrated...

text similarity python

Full-Text Search Explained, How To Implement & 6 Powerful Tools

What is Full-Text Search? Full-text search is a technique for efficiently and accurately retrieving textual data from large datasets. Unlike traditional search methods...

the hyperplane in a support vector regression (SVR)

Support Vector Regression (SVR) Simplified & How To Tutorial In Python

What is Support Vector Regression (SVR)? Support Vector Regression (SVR) is a machine learning technique for regression tasks. It extends the principles of Support...


Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!