Viterbi Algorithm Made Simple [How To & Worked-Out Examples]

by | Jun 2, 2025 | Data Science, Natural Language Processing

Introduction

Imagine trying to understand what someone said over a noisy phone call or deciphering a DNA sequence from partial biological data. In both cases, you’re trying to uncover a hidden reality based on incomplete or noisy observations. This is precisely the kind of problem the Viterbi Algorithm was designed to solve.

Initially developed by Andrew Viterbi in 1967 for error correction in digital communication, the algorithm has since become a foundational tool in various fields, including speech recognition, natural language processing, bioinformatics, and wireless communications. At its core, the Viterbi Algorithm is a dynamic programming method for finding the most probable sequence of hidden states in a Hidden Markov Model (HMM), given a sequence of observed events.

What makes the Viterbi Algorithm so powerful is its ability to efficiently handle uncertainty in sequential data, where brute-force methods would quickly become impractical due to their exponential complexity.

viterbi algorithm example

In this blog post, we’ll break down the intuition behind the Viterbi Algorithm, explain how it works step by step, walk through a simple example, and explore some of its real-world applications.

Background Conceptsof the Viterbi Algorithm

To understand how the Viterbi Algorithm works, it’s essential first to become familiar with the model it operates on: the Hidden Markov Model (HMM).

What Is a Hidden Markov Model (HMM)?

An HMM is a statistical model used to represent systems that are probabilistic and sequential—where we can observe outputs but not the internal states that produced them. The word “hidden” refers to the fact that the true state of the system is not directly visible, but it influences the observable outputs.

The first generative AI models were simple statistical models, such as Markov chains

Real-Life Analogy

Imagine you’re trying to guess the weather (sunny, rainy, cloudy) based on what your friend wears every day. You can’t directly observe the weather (the hidden state), but you can observe their clothing choices (the visible output). Over time, you may learn the patterns: your friend wears sunglasses when it’s sunny, a raincoat when it’s raining, and a hoodie when it’s cloudy. This is precisely the kind of situation an HMM can model.

Key Components of an HMM

The following elements define an HMM:

– States (S): A finite set of hidden states. For example: {Sunny, Rainy, Cloudy}. These are not directly observable.

Observations (O): A set of observed events or symbols. For example: {sunglasses, raincoat, hoodie}.

Transition Probabilities (A): The probability of transitioning from one state to another. For example, the chance it’s sunny today, given it was rainy yesterday.

transition probability

Emission Probabilities (B): The probability of an observation being generated from a state. For example, the chance of wearing a hoodie on a cloudy day.

emission probability

– Initial Probabilities (π): The probability distribution over initial states—i.e., where the system starts.

initial probability

Together, these components define a complete probabilistic model of the system. The challenge the Viterbi Algorithm solves is this:

Given a sequence of observations, what is the most likely sequence of hidden states that produced them?

Before we dive into how the algorithm works, let’s define the problem it solves in more detail.

The Problem Statement

Now that we understand the components of a Hidden Markov Model (HMM) let’s define the specific problem the Viterbi Algorithm is designed to solve.

The Goal

Given:

  • A known Hidden Markov Model (i.e., you know the states, transitions, emissions, and initial probabilities),
  • And a sequence of observed events (also called the observation sequence),

The task is to determine the most probable sequence of hidden states that could have generated the observed sequence.

This is known as the decoding problem in Hidden Markov Models (HMMs).

Why This Problem Is Hard

Let’s say you have:

  • 3 hidden states
  • An observation sequence of 10 steps

The number of possible state sequences is 3^{10} = 59,049.

And with 20 steps? That’s over 3 billion sequences!

Brute-force enumeration—trying every possible sequence—is computationally impractical as the length of the sequence increases.

We need a more innovative way.

The Viterbi Solution

The Viterbi Algorithm solves this problem efficiently using dynamic programming. It avoids redundant calculations by:

  • Breaking the problem into subproblems,
  • Keeping track of the most likely path to each state at every time step,
  • And remember how it got there.

This allows it to compute the most probable sequence in linear time concerning the length of the observation sequence (and quadratic in the number of states) rather than exponential time.

Formal Statement

Given:

An HMM is defined by:

  • A set of hidden states S={s1,s2,…,sN}
  • Transition probabilities A=[aij​]
  • Emission probabilities B=[bj(o)]
  • Initial probabilities π=[πi​]
  • An observation sequence O=[o1,o2,…,oT]

Find:

The most probable sequence of hidden states Q∗=[q1,q2,…,qT]such that:

probable sequence of hidden states Viterbi algorithm

Where:

  • Q is any possible sequence of hidden states,
  • λ is the HMM model (parameters A, B, π).

Intuition Behind the Viterbi Algorithm

Before diving into code or equations, it helps to build an intuitive understanding of how the Viterbi Algorithm works—and why it’s such a powerful tool for decoding hidden state sequences.

The Core Idea: Smart Guessing with Memory

At a high level, the Viterbi Algorithm answers this question:

“At each point in time, what is the most probable path that led us to each possible state?”

Instead of evaluating all possible sequences, it builds up the best path step-by-step, remembering only the most promising paths and discarding the rest. This is what makes the algorithm efficient.

Think of It Like GPS Navigation

Imagine you’re driving through a city with several routes to your destination. At each intersection, choose the route that is likely to get you to the end the fastest, based on current traffic and experience. You don’t re-evaluate all previous paths at every intersection—you just choose the best so far and continue forward.

The Viterbi Algorithm works similarly: it keeps track of the most probable way to reach each state at each time step and uses that information to make efficient decisions at the next step.

Two Things the Algorithm Tracks

  1. Probability Table (Viterbi Matrix): For each state at each time step, store the maximum probability of arriving at that state.
  2. Back pointer Table: For each state at each time step, store the previous state that yielded the maximum probability.

These two tables allow the algorithm to:

  • Build up the most likely state path over time,
  • And trace back the optimal path once all observations have been processed.

A Simple Example

Let’s say you’re trying to guess the weather (Sunny or Rainy) over 3 days based on what someone wears (e.g., sunglasses, hoodie, raincoat).

  • You know the probabilities of switching from Sunny to Rainy (and vice versa).
  • You also see the likelihood of each piece of clothing under each weather condition.

As you observe what the person wears each day, the Viterbi Algorithm:

  • Starts by assuming each weather condition is equally likely,
  • Updates the probability of each weather condition for Day 2 based on Day 1,
  • Keeps track of which previous state (Day 1’s weather) made Day 2’s weather most likely,
  • Repeats this for Day 3,
  • Then, it traces back the sequence of weather conditions that most likely led to the complete set of clothing observations.

The Key Insight: Dynamic Programming

The Viterbi Algorithm avoids redundant calculations by reusing subproblem results. At each step, it:

  • Computes only the best path to each state,
  • Builds on that for the next step.

This transforms what could be an exponential-time problem into a manageable and efficient process.

Step-by-Step Explanation of Viterbi algorithm

Now that we’ve built up the intuition let’s break down how the Viterbi Algorithm works step by step. We’ll follow the structure of a dynamic programming approach that builds a path of maximum probability to each hidden state, one step at a time.

We’ll go through four main steps:

  1. Initialisation
  2. Recursion
  3. Termination
  4. Backtracking

Step 1: Initialisation

At time step t = 0, we calculate the probability of starting in each state and emitting the first observation.

For each state si​:

step 1 Viterbi algorithm

Where:

  • πi = initial probability of state si
  • bi(o0) = probability of observing o0 in state si
  • V0(i) stores the max probability of being in state si​ at time 0

We also initialise a back-pointer table to keep track of paths.

Step 2: Recursion

From time t = 1 to T – 1, we compute the probability of each state at time t given:

  • All possible previous states,
  • The transition probabilities from those states,
  • And the emission probability of the current observation.

For each state sj at time t:

step 2a Viterbi algorithm

And we store the index i that gave us the max value as a back pointer:

backpointer Viterbi algorithm

Where:

  • aij = probability of transitioning from state si to sj
  • bj(ot) = probability of emitting ot from state sj

Step 3: Termination

After processing all observations, we find the final state that has the highest probability at the last time step.

termination Viterbi algorithm
2nd step termination Viterbi algorithm

Where:

  • P* = the probability of the most likely state sequence
  • q*T−1​ = the final state in that optimal sequence

Step 4: Backtracking

Now, we trace back the most probable path using the back pointer table.

For t=T−2 down to 0:

step 4 Viterbi algorithm

This gives us the full state sequence q*=[q0∗,q1∗,…,qT−1∗] that most likely produced the observation sequence.

Summary Table

StepWhat Happens
InitializationSet initial probabilities based on π and first observation
RecursionFor each state and time, compute max probability path to that state
TerminationPick the state with the highest probability at the final time step
BacktrackingTrace back the most probable path using stored backpointers

Pseudocode and Example of Viterbi Algorithm

Now that you understand the theory and the steps, let’s make it concrete with pseudocode followed by a simple worked-out example. This will help you see how the algorithm runs in practice.

Pseudocode: Viterbi Algorithm

Input:
- States: S = {s1, s2, ..., sN}
- Observations: O = [o1, o2, ..., oT]
- Initial probabilities: π[i] = P(start in state i)
- Transition probabilities: A[i][j] = P(transition from i to j)
- Emission probabilities: B[j][o] = P(observe o in state j)

Output:
- Most probable state sequence: Q = [q1, q2, ..., qT]

1. Initialize:
   for each state s_i:
       V[0][i] = π[i] * B[i][O[0]]
       backpointer[0][i] = 0

2. Recursion:
   for t = 1 to T - 1:
       for each state s_j:
           V[t][j] = max_i (V[t-1][i] * A[i][j] * B[j][O[t]])
           backpointer[t][j] = argmax_i (V[t-1][i] * A[i][j])

3. Termination:
   P* = max_i V[T-1][i]
   q[T-1] = argmax_i V[T-1][i]

4. Backtracking:
   for t = T-2 to 0:
       q[t] = backpointer[t+1][q[t+1]]

Return q

Worked-Out Example

Let’s decode a simple example with only two states and three observations.

Model Parameters

  • States: Healthy, Fever
  • Observations: normal, cold, dizzy
  • Observation sequence: [‘normal’, ‘cold’, ‘dizzy’]

Initial Probabilities:

P(Healthy) = 0.6
P(Fever)   = 0.4

Transition Probabilities:

From \ ToHealthyFever
Healthy0.70.3
Fever0.40.6

Emission Probabilities:

Statenormalcolddizzy
Healthy0.50.40.1
Fever0.10.30.6

Step 1: Initialisation (t = 0)

Observation: Normal

V[0][Healthy] = 0.6 * 0.5 = 0.30
V[0][Fever]   = 0.4 * 0.1 = 0.04

Step 2: Recursion (t = 1)

Observation: cold

V[1][Healthy] = max(
    0.30 * 0.7 * 0.4,  # from Healthy
    0.04 * 0.4 * 0.4   # from Fever
) = max(0.084, 0.0064) = 0.084

V[1][Fever] = max(
    0.30 * 0.3 * 0.3,  # from Healthy
    0.04 * 0.6 * 0.3   # from Fever
) = max(0.027, 0.0072) = 0.027

Step 3: Recursion (t = 2)

Observation: dizzy

V[2][Healthy] = max(
    0.084 * 0.7 * 0.1,  # from Healthy
    0.027 * 0.4 * 0.1   # from Fever
) = max(0.00588, 0.00108) = 0.00588

V[2][Fever] = max(
    0.084 * 0.3 * 0.6,  # from Healthy
    0.027 * 0.6 * 0.6   # from Fever
) = max(0.01512, 0.00972) = 0.01512

Step 4: Termination and Backtracking

Final state:

Max at t=2: Fever with 0.01512
Backtrack gives path: Healthy → Healthy → Fever

Final Output

The most probable sequence of hidden states is:

['Healthy', 'Healthy', 'Fever']

Applications of the Viterbi Algorithm

The Viterbi Algorithm’s ability to efficiently decode the most probable sequence of hidden states from observed data has made it a cornerstone in many real-world applications. Here are some of the most common and impactful uses:

1. Digital Communications

One of the earliest and most prominent applications is in error correction for digital communication systems.

  • In noisy communication channels, transmitted data can get corrupted.
  • The Viterbi Algorithm helps decode the most likely transmitted bit sequence by modelling the channel as a Hidden Markov Model.
  • This is widely used in systems such as satellite communications, mobile phones, and digital TV.

2. Speech Recognition

Speech is a sequence of sounds that can be modelled as an HMM, where:

  • The hidden states correspond to phonemes or words,
  • The observations are the audio signals or features extracted from speech.

The Viterbi Algorithm helps find the most likely sequence of words given an audio signal, powering many voice assistants and dictation tools.

3. Bioinformatics

In bioinformatics, the Viterbi Algorithm is used extensively for:

  • Gene prediction: Identifying coding regions in DNA sequences.
  • Protein sequence analysis: Modelling protein families and domains.
  • HMMs model biological sequences where the states represent functional regions or structural motifs and observations are nucleotides or amino acids.

4. Natural Language Processing (NLP)

Part-of-speech (POS) tagging is a classic NLP problem where:

  • Each word’s grammatical category (noun, verb, adjective, etc.) is a hidden state.
  • The observed words are the observations.

The Viterbi Algorithm efficiently determines the most probable POS tag sequence for a sentence, thereby enhancing syntactic analysis and understanding.

In the sentence "The big cat," "big" modifies "cat," creating a modifier-head relationship in dependency parsing

5. Robotics and Control Systems

Robots often use HMMs for:

  • Localisation: Inferring a robot’s position based on sensor data.
  • Motion tracking: Estimating sequences of hidden states, such as robot poses.

The Viterbi Algorithm helps decode the most likely path or behaviour sequence, enabling smoother navigation and decision-making.

In essence, whenever a system is modelled by hidden states that influence observable data—mainly when sequential or time-dependent information is involved—the Viterbi Algorithm provides a powerful and efficient way to decode those hidden states.

Comparison with Other Algorithms

The Viterbi Algorithm is a powerful tool for decoding hidden state sequences in Hidden Markov Models, but it’s not the only algorithm designed for working with HMMs or similar models. Let’s compare it with some related algorithms to understand its strengths and when you might prefer alternatives.

Viterbi Algorithm vs. Forward Algorithm

  • Purpose:
    • Viterbi: Finds the single most likely sequence of hidden states given the observations (decoding problem).
    • Forward: Computes the total probability of observing a sequence by summing over all possible hidden state paths (evaluation problem).
  • Output:
    • Viterbi: Returns the most probable path.
    • Forward: Returns a probability score (likelihood) of the observations without a specific path.
  • Use Case:
    • Use Viterbi when you want the best explanation (path) of the observations.
    • Use Forward when you want to measure how likely the observed data is under the model.

Viterbi Algorithm vs. Baum-Welch Algorithm

  • Purpose:
    • Viterbi: Decodes the most probable hidden states given known model parameters.
    • Baum-Welch: An Expectation-Maximization algorithm used to train or estimate HMM parameters (transition and emission probabilities) when they are unknown.
  • Output:
    • Viterbi: Most probable hidden state sequence.
    • Baum-Welch: Refined model parameters to better fit data.
  • Use Case:
    • Use the Baum-Welch algorithm to learn or improve the HMM before applying the Viterbi algorithm for decoding.

Viterbi Algorithm vs. Posterior Decoding

  • Posterior Decoding:
    • Instead of finding the single most probable sequence, it finds the most probable state at each time step independently.
    • It uses marginal probabilities computed by combining forward and backward algorithms.
  • Difference:
    • Viterbi gives a globally optimal path.
    • Posterior decoding gives locally optimal states, but the resulting sequence may not be a valid path.
  • When to Use:
    • Posterior decoding is helpful if you want to minimise expected errors per position.
    • Viterbi is preferred when you need a coherent, globally optimal sequence.

Summary Table

AlgorithmPurposeOutputWhen to Use
ViterbiDecode most probable state sequenceSingle best path sequenceDecoding known HMM sequences
ForwardCompute likelihood of observationsProbability of observation sequenceModel evaluation and scoring
Baum-WelchTrain/estimate model parametersUpdated transition & emission probabilitiesTraining HMM parameters
Posterior DecodingFind most probable state at each timeMost probable states per timestepMinimizing local errors

The Viterbi Algorithm strikes a unique balance: it’s efficient, exact for decoding, and widely applicable, making it a go-to method when you want the best explanation of the hidden path in your data.

Optimisation and Variants of Viterbi algorithm

The classic Viterbi Algorithm is robust and efficient, but depending on the problem size, domain, or specific constraints, several optimisations and variants have been developed to improve performance, reduce memory usage, or adapt to specialised scenarios. Let’s explore some of these enhancements.

1. Logarithmic Implementation

The Viterbi Algorithm multiplies many probabilities, which can become extremely small, resulting in numerical underflow on computers.

Use logarithms of probabilities and convert multiplications into additions:

logarithmic implementation Viterbi algorithm

Benefit:

It improves numerical stability and prevents floating-point underflow, especially for long sequences.

2. Beam Search (Pruning)

When the state space or observation length is very large, computing all paths can still be expensive.

Keep only the top k most probable states (beam width) at each time step and prune away less likely paths.

Tradeoff:

This heuristic sacrifices exactness for speed and memory savings, making it useful in real-time or resource-constrained applications.

3. Parallel and Hardware Implementations

Modern hardware, such as GPUs or FPGAs, can parallelise parts of the Viterbi Algorithm, especially the computation of probabilities for each state at a given time step.

This enables significant speedups in applications such as speech recognition and digital communication decoding.

4. Variants for Specific HMM Structures

Segmental Viterbi Algorithm:

Works on models where states correspond to segments of varying lengths, which is helpful in speech and bioinformatics.

Viterbi Training:

An iterative variant combining Viterbi decoding and parameter re-estimation (similar to Baum-Welch but uses challenging assignments).

5. Memory-Efficient Viterbi

The back pointer table can consume a lot of memory for long sequences.

Some implementations recompute back pointers on the fly during backtracking or use checkpointing to balance memory and runtime.

Summary

Optimising the Viterbi Algorithm depends on the specific application needs:

OptimizationPurposeBenefitTradeoff
Logarithmic ScalePrevent numerical underflowNumerical stabilityAdditional computation for logs
Beam SearchSpeed and memory efficiencyFaster, less memory usageApproximate, may miss optimal path
Parallel ImplementationsSpeed up computationsMassive accelerationRequires suitable hardware
Segmental ViterbiVariable-length segmentsBetter modeling for speech/genomicsMore complex implementation
Memory-Efficient TechniquesReduce memory footprintWorks with very long sequencesMay increase runtime

Conclusion

The Viterbi Algorithm is a fundamental and elegant dynamic programming technique for decoding the most likely sequence of hidden states in a Hidden Markov Model. Its ability to efficiently explore exponentially many possible paths by focusing on the most probable ones has made it indispensable across a wide range of fields—from digital communications and speech recognition to bioinformatics and natural language processing.

By understanding its intuition, step-by-step process, and practical applications, you can appreciate how the Viterbi Algorithm transforms complex probabilistic inference into a tractable, efficient solution. Whether you’re working on decoding noisy signals, predicting gene sequences, or building more intelligent AI systems, mastering the Viterbi Algorithm is a valuable skill.

Finally, with various optimisations and variants available, the algorithm can be tailored to fit different computational constraints and real-world challenges—making it as relevant today as when it was first introduced.

If you want to dive deeper, experiment with implementations, or explore its many applications, the Viterbi Algorithm offers a rich and rewarding journey into the world of probabilistic models.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

fibonacci numbers

Dynamic Programming Explained & How To Tutorial In Python

What is Dynamic Programming? Dynamic Programming (DP) is a powerful algorithmic technique used to solve complex problems by breaking them down into simpler, overlapping...

Temporal Difference Learning Made Simple With Example & Alternatives

What is Temporal Difference Learning? Temporal Difference (TD) Learning is a core idea in reinforcement learning (RL), where an agent learns to make better decisions by...

interdependent variables

Understanding Interdependent Variables: The Hidden Web Of Cause And Effect

Have you ever wondered why raising interest rates slows down inflation, or why cutting down forests affects rainfall patterns? These everyday phenomena are driven by a...

Q-learning frozen lake problem

Deep Deterministic Policy Gradient Made Simple & How To Tutorial In Python

Introduction Reinforcement Learning (RL) has seen explosive growth in recent years, powering breakthroughs in robotics, game playing, and autonomous control. While...

multi-agent reinforcement learning marl

Multi-Agent Reinforcement Learning Made Simple, Top Approaches & 9 Tools

Introduction Imagine a group of robots cleaning a warehouse, a swarm of drones surveying a disaster zone, or autonomous cars navigating through city traffic. In each of...

viterbi algorithm example

Viterbi Algorithm Made Simple [How To & Worked-Out Examples]

Introduction Imagine trying to understand what someone said over a noisy phone call or deciphering a DNA sequence from partial biological data. In both cases, you're...

link prediction in graphical neural networks

Structured Prediction In Machine Learning: What Is It & How To Do It

What is Structured Prediction? In traditional machine learning tasks like classification or regression a model predicts a single label or value for each input. For...

q-learning explained witha a mouse navigating a maze and updating it's internal staate

Policy Gradient [Reinforcement Learning] Made Simple In An Elaborate Guide

Introduction Reinforcement Learning (RL) is a powerful framework that enables agents to learn optimal behaviours through interaction with an environment. From mastering...

q learning example

Deep Q-Learning [Reinforcement Learning] Explained & How To Example

Imagine teaching a robot to navigate a maze or training an AI to master a video game without ever giving it explicit instructions—only rewarding it when it does...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2025 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2025. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!