What is topic modelling?
Topic modelling is a technique used in natural language processing (NLP) to automatically identify and group similar words or phrases in a text. This lets us figure out the central ideas or themes in a group of documents. The main benefit is that this is possible even when there are a lot of different documents.
Table of Contents
Topic modelling is one of our top 10 natural language processing techniques and is rather similar to keyword extraction, so definitely check out these articles to ensure you are using the right tools for the right problem.
Topic modelling can be helpful in various applications. Some common examples are automatically organizing a large corpus of documents, understanding customer feedback, or identifying common themes in social media posts.
Topic modelling can automate the classification of a large volume of documents.
What is topic modelling used for?
Topic modelling can be used in various situations where it is helpful to identify the main topics discussed in a text. Here are some potential use cases for topic modelling:
- Analyzing customer feedback to identify common themes and concerns
- Summarizing a large corpus of text by identifying the main topics discussed
- Organizing a collection of documents into categories based on their content
- Identifying trends and changes in the topics discussed in a collection of documents over time
- Improving the accuracy of information retrieval systems by using topic modelling to improve the representation of documents in the system’s index.
These are just a few examples of the many potential use cases for topic modelling. It can be a powerful tool for making sense of extensive text collections and extracting valuable insights from them.
Is topic modelling supervised or unsupervised learning?
Topic modelling is a type of unsupervised machine learning that is used to discover the abstract topics that occur in a collection of documents. In topic modelling, a computer program analyses a set of documents and identifies the underlying themes or topics in the text. The program does this without being explicitly told what the topics are. It works without any supervision or guidance from a human. Instead, it relies on statistical techniques to identify patterns in the text that indicate the presence of specific topics.
Topic modelling can uncover hidden structures in extensive collections of documents. It is often used in text mining and natural language processing applications. It is a valuable tool for exploring and understanding large amounts of unstructured text data. Additionally, it can identify trends and patterns that may not immediately appear to a human reader.
Machine learning algorithms for topic modelling
Latent Dirichlet Allocation (LDA)
One of the most popular topic-modelling algorithms is Latent Dirichlet Allocation (LDA). This algorithm uses a probabilistic approach to identify the underlying topics in a collection of documents. Additionally, LDA assumes that each document is a mixture of topics, and each topic is a mixture of words. As a result, the algorithm uses this assumption to identify the document’s topics and related terms.
One of the benefits of LDA is that it can handle large amounts of text data. This makes it well-suited for applications such as analyzing customer feedback or social media posts. Additionally, LDA can identify topics that may not be explicitly mentioned in the text. This can help uncover hidden patterns or trends.
Non-Negative Matrix Factorization (NMF)
Another popular topic modelling algorithm is non-negative matrix factorization (NMF). NMF uses a linear algebra approach to identify the underlying topics in a collection of documents. Unlike LDA, NMF assumes that each document can only belong to a single topic. This can be helpful for specific applications.
NMF works by decomposing a large matrix of word-document co-occurrences into two smaller matrices: one that represents the words in the documents and the other that defines the topics. As a result, this allows the algorithm to discover the underlying topics in a corpus of documents and extract them in an easily interpretable way.
For example, let’s say you have a corpus of 100,000 news articles and want to find the topics that are most commonly discussed in these articles. You could then use NMF to decompose the matrix of word-document co-occurrences into two matrices: one representing the words in the documents and the other defining the topics. The resulting topics would then illustrate the most common themes or topics discussed in the news articles, and you could use these topics to categorize and organize the articles.
Latent Semantic Analysis (LSA)
Latent Semantic Analysis (LSA) is a dimensionality reduction technique based on singular value decomposition (SVD). Its purpose is to extract the underlying structure of a corpus of documents by representing the documents and words in a low-dimensional space.
In LSA, the first step is to construct a term-document matrix, which represents the frequency of each word in each document. This matrix is then decomposed using SVD, which produces a set of orthogonal latent vectors that capture the relationships between the terms and documents in the corpus. These latent vectors can then identify the underlying topics in the corpus.
One advantage of LSA is that it is computationally efficient, which makes it well-suited for large datasets. Additionally, LSA can handle synonyms and polysemy (words with multiple meanings) in a way that is more robust than some other topic modelling algorithms. However, LSA has been criticized for producing less interpretable topics than those made by different algorithms.
Deep learning for topic modelling
While deep learning is commonly used for a wide range of natural language processing tasks, it is not typically used for topic modelling. Instead, deep learning is often used to improve the performance of other techniques, such as Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF), by providing them with better word embeddings or by incorporating additional context information.
For example, one way deep learning can be used in topic modelling is to train a word embedding model on a large corpus of text. This model can then be used to initialize the word vectors in an LDA or NMF model, which can improve the performance of the topic modelling algorithm. Another way deep learning can be used is to incorporate additional context information, such as the overall structure of the documents in the corpus or the relationships between words, into the topic modelling algorithm. This can help the algorithm better capture the underlying structure of the corpus and produce more accurate and interpretable topics.
Overall, while deep learning is not typically used as a standalone technique for topic modelling, it can help improve other algorithms’ performance and provide additional context information that can help the algorithm better capture the underlying structure in the data.
How to do topic modelling in Python
Here is a simple example of how Latent Dirichlet Allocation (LDA) can be implemented in Python using the Scikit-Learn library:
from sklearn.decomposition import LatentDirichletAllocation # define the number of topics n_topics = 5 # create a Latent Dirichlet Allocation model lda = LatentDirichletAllocation(n_components=n_topics) # fit the model to the data lda.fit(data) # transform the data using the fitted model transformed = lda.transform(data)
This code uses the
LatentDirichletAllocation class from the scikit-learn library to implement LDA. The
n_components parameter is then used to specify the number of topics to be learned by the model. The
fit method is used to fit the model to the input data, and the
transform method is used to generate the topic distribution for each document.
Keep in mind that this is just a simple example, and there are many different ways to implement LDA in Python. As a result, the details of the implementation can depend on the specific details of the problem at hand.
In NLTK, LDA can be implemented using the
ldamodel class in the
gensim.models.ldamodel module. Here is an example of how you might use this class to train an LDA model on a corpus of text documents:
from gensim.corpora import Dictionary from gensim.models.ldamodel import LdaModel # Create a dictionary representing the corpus dictionary = Dictionary(corpus) # Create a bag-of-words representation of the corpus corpus_bow = [dictionary.doc2bow(doc) for doc in corpus] # Train the LDA model on the corpus lda_model = LdaModel(corpus_bow, num_topics=10, id2word=dictionary)
corpus is a list of documents, where each document is a list of words. The
LdaModel class takes the bag-of-words representation of the corpus as input, along with the number of topics to be learned and the dictionary mapping words to unique ids. This will train the LDA model on the corpus and allow you to use the model to infer the topics of new documents or to retrieve the most likely topics for a given document.
BERT topic modelling
BERT is a state-of-the-art natural language processing (NLP) model developed by Google that can be used for a wide range of tasks, including topic modeling. However, it is not a specific topic modelling algorithm, so there is no “BERT topic modelling code” as such.
To use BERT for topic modelling, you must combine it with a topic modelling algorithm such as Latent Dirichlet Allocation (LDA). You can then use the pre-trained Bert model to extract features from your text data, which can be used as input to the LDA algorithm to identify the topics present in the text.
Here is an example of how you might use BERT for topic modelling in Python:
# Import the necessary libraries import transformers import sklearn # Load the pre-trained Bert model bert_model = transformers.BertModel.from_pretrained('bert-base-uncased') # Define a function to extract features from your text data using Bert def bert_features(data): input_ids =  attention_masks =  # Tokenize the text and create input_ids and attention_masks for text in data: inputs = tokenizer.encode_plus(text, add_special_tokens=True, max_length=MAX_LEN) input_ids.append(inputs['input_ids']) attention_masks.append(inputs['attention_mask']) # Convert input_ids and attention_masks to tensors input_ids = torch.tensor(input_ids) attention_masks = torch.tensor(attention_masks) # Use Bert to extract features from the input text with torch.no_grad(): outputs = bert_model(input_ids, attention_masks) features = outputs return features # Load your text data data = ... # Extract features using Bert features = bert_features(data) # Use LDA to identify the topics in the text lda = sklearn.decomposition.LatentDirichletAllocation(n_components=10) lda.fit(features) # Print the topics identified by LDA print(lda.components_)
This code uses the
transformers library to load the pre-trained BERT model and then defines a function
bert_features() to extract features from the input text data using BERT. The
sklearn library is then used to perform LDA on the extracted features to identify the topics present in the text.
Topic modelling at Spot Intelligence
At Spot Intelligence, we often use topic modelling in the exploratory stages of analysis. It allows us to quickly deep dive into the documents at hand and visually see what the documents are about without reading or browsing through them.
Once we have identified topics we are interested in, we can use the results from the topic modelling to classify the documents and label them accordingly. This allows information to be found faster and further split into specific topics for analysis. This way, we can often segment the data into more manageable chunks that can then be summarised or aggregated together to get a more holistic view of the data set.
Combining topic modelling with a timeline is always an excellent analysis, as topics change over time. This is especially useful when analysing social media data, and doing trend analysis.
What are your favourite use cases of topic modelling? Let us know in the comments.