Z-Score Normalization Made Simple & How To Tutorial In Python

by | Feb 14, 2025 | Data Science, Machine Learning

What is Z-Score Normalization?

Z-score normalization, or standardization, is a statistical technique that transforms data to follow a standard normal distribution. This process ensures that data has a mean (μ) of 0 and a standard deviation (σ) of 1, making comparing different variables or datasets with different scales easier.

z-score normalization

The Z-Score Formula

Z-score normalization is calculated using the formula:

z-score formula

Where:

  • X = individual data point
  • μ = mean of the dataset
  • σ = standard deviation of the dataset

The result, Z, represents how many standard deviations a data point is from the mean.

Understanding Z-Scores

  • A Z-score of 0 means the data point is strictly at the mean.
  • A positive Z-score indicates the data point is above the mean.
  • A negative Z-score means the data point is below the mean.
  • The further the Z-score is from 0, the more unusual or extreme the data point is compared to the rest of the dataset.

Why Standardization Matters

  • It removes unit dependencies, directly comparing variables measured on different scales (e.g., height in cm vs. weight in kg).
  • It is essential for many machine learning algorithms (e.g., k-NN, SVM, PCA) that rely on distance-based calculations.
  • It helps identify outliers, as extreme Z-scores often indicate anomalies in the data.

Why Use Z-Score Normalization?

Z-score normalization offers several advantages, especially in contexts where data features are on different scales or have outliers. Here’s why it’s commonly used in data preprocessing:

1. Standardizes Data for Comparisons

One of the main reasons for using Z-score normalization is that it brings all features to a standard scale. Without this, features with more extensive ranges (e.g., income in thousands vs. age in years) could dominate distance-based algorithms like k-nearest neighbours (k-NN) or support vector machines (SVM). Z-scores standardize the data by transforming it into a distribution with a mean of 0 and a standard deviation of 1, making it easier to compare features directly.

2. Works Well for Algorithms That Rely on Distance Metrics

Many machine learning algorithms—especially those involving distance calculations, like k-NN and k-means clustering—require all features to be on a similar scale. Z-score normalization ensures that each feature contributes equally to the algorithm, preventing variables with more extensive ranges from overpowering others.

For example, in k-NN, the Euclidean distance formula would give more weight to features with larger numerical values. By normalizing the data, Z-score standardization helps each feature contribute proportionally to the distance metric.

kmeans data clustering

3. Improves Convergence in Gradient Descent Algorithms

Algorithms like linear regression, logistic regression, and neural networks that use gradient descent benefit from Z-score normalization. When features have vastly different scales, the gradient descent process can become inefficient, as the algorithm may take long steps in some directions and tiny steps in others. Normalizing the data with Z-scores ensures that the learning process is smoother and faster, leading to better model performance and faster convergence.

This stochasticity imbues SGD with the ability to traverse the optimization landscape more dynamically, potentially avoiding local minima and converging to better solutions.

4. Handles Outliers More Effectively Than Other Methods

While Z-score normalization doesn’t remove outliers, it makes identifying extreme values in the dataset easier since outliers are represented by Z-scores far from 0 (typically with a value above 3 or below -3). Detecting and managing them is more manageable. In contrast, methods like Min-Max normalization might distort outlier behaviour, compressing them into a small range.

outlier detection on a regression problem

However, extreme outliers can still significantly influence the mean and standard deviation, affecting the Z-score. This should be considered when deciding to use Z-score normalization.

5. No Constraints on Data Distribution

Unlike Min-Max normalization, which forces data to fit within a specified range (typically [0, 1]), Z-score normalization does not make any assumptions about the underlying data distribution. This makes it more versatile when dealing with non-uniformly distributed data. While Min-Max normalization is best suited for data following a uniform distribution, Z-score normalization can more effectively handle skewed or multi-modal distributions.

6. Improves Performance in Principal Component Analysis (PCA)

PCA, a dimensionality reduction technique, relies on each feature’s variance. If features are on different scales, the PCA may prioritize variables with higher variance, even if they are not more critical to the underlying structure of the data. Z-score normalization ensures that all features have equal weight, improving the quality of the reduced dimensions.

feature scaling: pca plot 2 dimensions

When Not to Use Z-Score Normalization

Despite its benefits, Z-score normalization is not always the best option. It’s essential to be mindful of the following situations:

  • When dealing with non-normal data, Z-score normalization assumes a normal distribution. Other techniques, like Robust Scaler or Quantile Transformation, may be more appropriate if your data is highly skewed or contains many outliers.
  • When data interpretation is essential, Z-scores can obscure the original scale, making it harder to interpret. Consider other normalization techniques if you need to keep the original scale for understanding or reporting purposes.

In the next section, we’ll walk through calculating Z-scores.

Step-by-Step Calculation of Z-Score Normalization

To better understand how Z-score normalization works, let’s go through an example dataset and calculate the Z-scores for each data point step by step.

Example Dataset

Let’s say we have a small dataset of ages (in years) of five people:

PersonAge
A22
B25
C30
D35
E40

Step 1: Calculate the Mean (μ)

The mean is the average of all data points in the dataset. We add all the values to find the mean and divide by the number of data points.

mu formula

For our dataset:

mu example calculation

So, the mean age is 30.4.

Step 2: Calculate the Standard Deviation (σ)

The standard deviation measures the amount of variation or dispersion of the dataset. The formula for standard deviation is:

sigma equation

We’ll first calculate the squared differences from the mean:

For Person A: (22−30.4)²=(−8.4)²=70.56

For Person B: (25−30.4)²=(−5.4)²=29.16

For Person C: (30−30.4)²=(−0.4)²=0.16

For Person D: (35−30.4)²=(4.6)²=21.16

For Person E: (40−30.4)²=(9.6)²=92.16

Now, sum the squared differences:

70.56+29.16+0.16+21.16+92.16=213.2

Finally, divide by the number of data points (n = 5) and take the square root:

sigma calculated z-score normalisation

So, the standard deviation is approximately 6.53.

Step 3: Calculate the Z-Score for Each Data Point

Now that we have the mean (30.4) and standard deviation (6.53), we can calculate the Z-score for each data point using the Z-score formula:

z-score formula for z-score normalisation

Let’s calculate the Z-scores for each person:

For Person A:

za calculated z-score normalisation

For Person B:

zb calculation z-score normalisation

For Person C:

zc score z-score normalisation

For Person D:

zd score z-score normalisation

For Person E:

ze score z-score normalisation

Step 4: Interpret the Results

The Z-scores tell us how each data point compares to the mean of the dataset in terms of standard deviations:

  • Person A has a Z-score of -1.29, meaning their age is 1.29 standard deviations below the mean.
  • Person B has a Z-score of -0.83, indicating their age is 0.83 standard deviations below the mean.
  • Person C has a Z-score of -0.06, meaning their age is very close to the mean (slightly below).
  • Person D has a Z-score of 0.70, indicating their age is 0.70 standard deviations above the mean.
  • Person E has a Z-score of 1.47, meaning their age is 1.47 standard deviations above the mean.

By calculating the Z-scores, we’ve standardized the ages in the dataset, transforming them into values representing how many standard deviations they are away from the mean. This allows for easier comparison of data points and is especially useful when dealing with data of different scales or units.

The following section will explore how Z-score normalization is applied in machine learning and why it’s crucial for specific algorithms.

Z-Score Normalization in Machine Learning

Z-score normalization is crucial in preparing data for machine learning models, especially those that rely on distance calculations or gradient-based optimization. Let’s explore why Z-score normalization is essential in machine learning and how it impacts various algorithms.

1. Feature Scaling for Distance-Based Algorithms

Algorithms like k-nearest Neighbors (k-NN) and Support Vector Machines (SVM) rely on distance metrics (e.g., Euclidean distance) to make predictions. If the dataset’s features have vastly different scales, the algorithm may assign more importance to features with larger numerical values, overshadowing more minor features and leading to biased predictions.

Z-score normalization ensures that all features are on the same scale, giving each feature equal weight in the distance calculation. This leads to more accurate results. For example, if you have two features—age (ranging from 18 to 80) and income (ranging from 10,000 to 100,000)—without normalization, income will dominate the distance calculation due to its larger values. By applying Z-score normalization, both features will be centred around zero and of equal importance in the model.

2. Gradient Descent Convergence

Many machine learning models, such as linear regression, logistic regression, and neural networks, use gradient descent for optimization. Gradient descent works by adjusting the model’s parameters based on the gradients of the loss function, and the scale of the features heavily influences the size of the gradient.

Suppose the features have significant disparities in scale. In that case, the gradient descent process might struggle to converge efficiently, as it could take tiny steps in some directions (where features have small values) and large steps in others (where features have large values). This can slow down the training process and make it harder to find the optimal parameters. Z-score normalization helps by scaling features to a standard range, making the gradient descent steps more uniform and allowing for faster convergence.

3. Improved Performance in Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique that seeks to find new axes (principal components) that explain the most variance in the data. If the features in the dataset are on different scales, PCA will give more importance to features with more significant variances, even if they aren’t the most relevant.

By applying Z-score normalization, each feature contributes equally to the variance, allowing PCA to find the most meaningful components based on the structure of the data rather than the scale of individual features. This results in more accurate and balanced dimensionality reduction, which can improve downstream models.

4. Impact on Regularization

Many machine learning models, such as ridge and lasso regression, use regularization techniques to prevent overfitting. Regularization methods add a penalty to the model’s loss function based on the magnitude of the coefficients, discouraging overly significant coefficients.

Without Z-score normalization, larger-scale features can have disproportionately large coefficients, even if they are not the most essential features. This can make the regularization process less effective. By normalizing the features, Z-score standardization ensures that the regularization penalties are applied equally across all features, leading to better model generalization.

5. When Z-Score Normalization is Not Necessary

While Z-score normalization is helpful for many machine learning algorithms, it is not always required. Some algorithms are scale-invariant or perform better with features that are not normalized. For example:

  • Decision trees, random forests, and gradient boosting machines: These tree-based algorithms are not sensitive to the scale of the data, as they split the data based on feature values and are not impacted by distances or gradient calculations. In these cases, normalization is unnecessary, and applying it might distort the model’s interpretability.
  • Naive Bayes: This algorithm is also typically unaffected by feature scaling, as it assumes independence between features and computes probabilities based on individual feature distributions.
decision boundaries for naive bayes

6. Z-Score Normalization for Outlier Detection

Z-score normalization can also help detect outliers. Data points with a Z-score more significant than 3 or less than -3 are considered outliers, as they are far from the mean (usually more than 3 standard deviations). Identifying outliers can be helpful in various machine learning tasks, such as:

  • Data cleaning: Removing or adjusting extreme outliers that could distort model performance.
  • Anomaly detection: Detecting rare events or anomalies that do not fit the general pattern of the data.

However, outliers can still influence the mean and standard deviation of small datasets. Therefore, it is important to evaluate whether Z-score normalization is appropriate for your specific data.

Z-score normalization is a key step in many machine learning workflows, especially for models that rely on distance-based calculations or gradient descent optimization. It helps to ensure that all features are treated equally, improves convergence in optimization algorithms, and can enhance model performance in techniques like PCA. However, it is essential to remember that not all algorithms require normalization, so it’s crucial to understand the nature of your model and data before deciding to apply Z-score normalization.

The following section will discuss some challenges and considerations when using Z-score normalization in real-world datasets.

Implementing Z-Score Normalization in Python

In this section, we’ll walk through implementing Z-score normalization in Python using two common approaches: NumPy for manual calculation and scikit-learn, a popular machine learning library, for automatic standardization.

1. Z-Score Normalization with NumPy (Manual Calculation)

If you want to understand the process behind Z-score normalization, you can calculate it manually using NumPy. Here’s a simple example of how to do this:

Example Dataset

Let’s say we have a dataset with age and income features.

PersonAgeIncome
A2220000
B2530000
C3040000
D3550000
E4060000

We’ll standardize this data with a mean of 0 and a standard deviation of 1.

Python Code

# Example dataset
data = np.array([[22, 20000],
                 [25, 30000],
                 [30, 40000],
                 [35, 50000],
                 [40, 60000]])

# Calculate the mean and standard deviation for each column (feature)
mean = np.mean(data, axis=0)
std_dev = np.std(data, axis=0)

# Apply Z-score normalization: Z = (X - mean) / std_dev
normalized_data = (data - mean) / std_dev

# Display the results
print("Original Data:")
print(data)
print("\nMean:", mean)
print("Standard Deviation:", std_dev)
print("\nNormalized Data:")
print(normalized_data)
  • We use np.mean(data, axis=0) to calculate the mean of each column (feature).
  • We use np.std(data, axis=0) to calculate the standard deviation for each feature.
  • We then apply the Z-score formula (X – mean) / std_dev​ to normalize each data point.

The output will show the original data, the calculated mean and standard deviation, and the normalized data (with a mean of 0 and standard deviation of 1).

2. Z-Score Normalization with Scikit-Learn (Using StandardScaler)

For convenience, scikit-learn provides a built-in method to perform Z-score normalization through the StandardScaler class. This method automates the mean and standard deviation calculation and applies the Z-score formula.

Python Code Using Scikit-Learn:

from sklearn.preprocessing import StandardScaler
import numpy as np

# Example dataset
data = np.array([[22, 20000],
                 [25, 30000],
                 [30, 40000],
                 [35, 50000],
                 [40, 60000]])

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler and transform the data
normalized_data = scaler.fit_transform(data)

# Display the results
print("Original Data:")
print(data)
print("\nNormalized Data:")
print(normalized_data)

# You can also access the mean and standard deviation used for normalization
print("\nMean of the original data:")
print(scaler.mean_)
print("\nStandard deviation of the original data:")
print(scaler.scale_)
  • The StandardScaler() is initialized, and then the fit_transform() method is called to fit the model to the data and apply the normalization in one step.
  • The scaler object’s mean_ and scale_ attributes hold the mean and standard deviation values used during the transformation.

This approach is more efficient, especially when working with larger datasets, and is widely used in machine learning pipelines.

Output

For the example dataset, the output would look like:

Original Data:
[[   22  20000]
 [   25  30000]
 [   30  40000]
 [   35  50000]
 [   40  60000]]

Normalized Data:
[[-1.33630621 -1.33630621]
 [-0.80183606 -0.80183606]
 [ 0.          0.        ]
 [ 0.80183606  0.80183606]
 [ 1.33630621  1.33630621]]

Mean of the original data:
[30.4 30000. ]

Standard deviation of the original data:
[6.535 18166.063]

3. Visualizing Z-Score Normalization

It’s helpful to visualize how Z-score normalization affects the distribution of data. Let’s plot the original data and the normalized data.

Python Code for Visualization:

import matplotlib.pyplot as plt

# Original data visualization
plt.subplot(1, 2, 1)
plt.title("Original Data")
plt.scatter(data[:, 0], data[:, 1], c='blue', label='Original Data')
plt.xlabel("Age")
plt.ylabel("Income")

# Normalized data visualization
plt.subplot(1, 2, 2)
plt.title("Normalized Data")
plt.scatter(normalized_data[:, 0], normalized_data[:, 1], c='green', label='Normalized Data')
plt.xlabel("Age (Normalized)")
plt.ylabel("Income (Normalized)")

plt.tight_layout()
plt.show()
original vs z-normalized data in python
  • The first plot shows the original data, where the scales of age and income are vastly different.
  • The second plot shows the normalized data, where both features have been transformed to the same scale, making them comparable.

In this section, we walked through two ways of performing Z-score normalization in Python: using NumPy for manual calculation and scikit-learn for an automated approach. We also covered how to visualize the normalization process and demonstrated how Z-score normalization helps standardize datasets for machine learning models.

Applying Z-score normalization ensures that your data is appropriately scaled, improving model performance and allowing algorithms to interpret all features reasonably without any feature dominating due to its scale.

Challenges and Considerations in Z-Score Normalization

While Z-score normalization is a widely used technique for preparing data, it does come with specific challenges and considerations. Understanding when and how to apply it and its limitations in particular scenarios is essential.

1. Impact of Outliers on Mean and Standard Deviation

One of the biggest challenges of Z-score normalization is its sensitivity to outliers. Since Z-scores are calculated based on the mean and standard deviation, extreme values can significantly affect these statistics. This could lead to an inaccurate representation of the “typical” data distribution, resulting in skewed Z-scores.

Example

If you have a dataset of people’s ages, and one of the ages is 200 years, it will disproportionately influence the mean and standard deviation. This would cause most of the Z-scores to be compressed into a narrow range, masking essential differences between most data points.

Solution

  • Handling outliers before normalization: You can remove or adjust outliers before applying Z-score normalization. Methods like IQR (Interquartile Range) or using a Robust Scaler can help mitigate the impact of outliers.
  • Using alternative normalization techniques: For datasets with significant outliers, other techniques like Min-Max normalization or Robust Scaler (which uses the median and interquartile range instead of mean and standard deviation) may be more appropriate.

2. Assumption of Normal Distribution

Z-score normalization assumes that the data is approximately normally distributed. However, in practice, many datasets are skewed or multi-modal, and the normalization may not be effective if the underlying data distribution significantly deviates from normality.

Example

Consider a dataset of customer spending in which most customers make small purchases, but a few make large purchases. Z-score normalization would not be ideal here, as it might distort the effect of the bulk of the customer data by emphasizing the outliers.

Solution

  • Transforming the data: If your data is skewed, consider applying a log transformation or other methods to make the distribution more normal before applying Z-score normalization.
  • Other normalization techniques: In cases where normality cannot be assumed, methods like Quantile Transformation or Robust Scaler might be more suitable.

3. Data Leakage in Time-Series Data

Z-score normalization can be problematic in time-series data, especially when future data points are used to calculate the mean and standard deviation for the entire dataset. This can lead to data leakage, where future information influences the current model, which may overstate its performance.

Example

When applying Z-score normalization to time-series data (e.g., stock prices, sales trends), calculating the mean and standard deviation over the entire time period might inadvertently include future data points when normalizing past data.

Solution

  • Normalize within a training window: For time-series data, always compute the mean and standard deviation using only past data or the data available up to the current time step to avoid data leakage.
  • Rolling window: You can also use a rolling window for normalization, which calculates the mean and standard deviation over a fixed time window, ensuring that future values are not involved in the normalization process.

4. Interpretability of Data

Z-score normalization transforms the data into a scale where values have no inherent meaning or unit. This can be a problem when the original units of measurement are essential for interpretation or reporting.

Example

When working with data that should be presented in its original form (such as financial transactions, temperature, or physical measurements), Z-scores may obscure the meaning of the data.

Solution:

  • Keep original data for reporting: If interpretability is crucial, you may choose not to normalize certain features, or you may opt for Min-Max normalization, which preserves the original scale within a fixed range (e.g., 0 to 1).
  • Back transformation: After model training, you can reverse the normalization by using the original mean and standard deviation to “back-transform” the results into the original scale.

5. Handling Sparse Data

Z-score normalization is less effective on sparse data, where many values are zero or missing. Sparse datasets are standard in fields like natural language processing or recommendation systems, where most data points are empty.

Content-Based Recommendation System where a user is recommended similar movies to those they have already watched

Example

In a customer-product interaction matrix (where many users have not interacted with many products), applying Z-score normalization could lead to misleading results, as the abundance of zero values would affect most features’ mean and standard deviation.

Solution

  • Sparse matrix normalization: In such cases, specialized techniques for sparse data, such as mean imputation or non-zero entries, may be more effective.
  • Feature engineering: Sometimes, transforming features into a more informative form, such as binary (e.g., interaction or no interaction), can improve performance without needing Z-score normalization.

6. The Need for Recalculation of Parameters on New Data

When using Z-score normalization in a machine learning pipeline, updating the normalization parameters (mean and standard deviation) as new data arrives is crucial. Failure to do so can lead to inconsistent results, especially when working with dynamic datasets.

Example:

If you’re using a model that receives new data over time (e.g., an online store collecting more customer data), the mean and standard deviation of the dataset will shift as new entries are added. The model may make incorrect predictions on the latest data if the normalisation parameters are not recalculated.

Solution:

  • Update parameters periodically: Ensure that you recalculate the mean and standard deviation of the dataset when new data arrives, especially in live systems or production environments.
  • Store normalization parameters: Save the mean and standard deviation used to train the model and apply those parameters to new data during inference to ensure consistency.

7. Choice of Normalization Technique

Finally, Z-score normalization may not always be the best option for every situation. Before deciding on a normalization approach, it’s essential to evaluate the characteristics of your data and the model you’re using.

Solution

Alternative techniques: Consider Min-Max scaling, Robust Scaler, or Log Transformation, depending on the nature of the data.

  • Min-Max scaling works well when you need data within a fixed range (e.g., [0, 1]) and are not concerned about outliers.
  • Robust Scaler is better for datasets with outliers, as it uses the median and interquartile range (IQR), making it less sensitive to extreme values.

Z-score normalization is a powerful tool for data preprocessing, but like all techniques, it has limitations. Before applying this technique, it is crucial to carefully consider the nature of your data, the presence of outliers, the distribution of the features, and the model you’re using. By understanding these challenges and taking steps to address them, you can ensure that your data is properly prepared and your model performs optimally.

Conclusion

Z-score normalization is a powerful and widely used technique for scaling data in machine learning. Transforming features with a mean of 0 and a standard deviation of 1 ensures that all features contribute equally to model training. This is especially important for algorithms that rely on distance metrics or gradient-based optimization. In many cases, Z-score normalization can lead to improved model performance, faster convergence, and better handling of multicollinearity.

However, like any data preprocessing technique, Z-score normalization has its challenges. Outliers, skewed data, and time-series data can complicate its application, and it may not always be suitable for every dataset or model type. It’s essential to assess the characteristics of your data and the needs of your machine learning model before choosing Z-score normalization. Alternative scaling techniques like Min-Max normalization or Robust Scaler may be more appropriate for datasets with outliers or non-normal distributions.

Ultimately, the choice of normalization method should be driven by the nature of your data and the specific requirements of the machine learning model you are using. With careful consideration and handling of edge cases, Z-score normalization can effectively prepare your data for successful model training and prediction.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

anonymization vs pseudonymisation

Data Anonymisation Made Simple [7 Methods & Best Practices]

What is Data Anonymisation? Data anonymisation is modifying or removing personally identifiable information (PII) from datasets to protect individuals' privacy. By...

z-score normalization

Z-Score Normalization Made Simple & How To Tutorial In Python

What is Z-Score Normalization? Z-score normalization, or standardization, is a statistical technique that transforms data to follow a standard normal distribution. This...

different types of data masking

Data Masking Explained, Different Types & How To Implement It

Understanding the Basics of Data Masking Data masking is a critical process in data security designed to protect sensitive information from unauthorised access while...

types of data transformation processes

What Is Data Transformation? 17 Powerful Tools And Technologies

What is Data Transformation? Data transformation is converting data from its original format or structure into a format more suitable for analysis, storage, or...

Real time vs batch processing

Real-time Vs Batch Processing Made Simple: What Is The Difference?

What is Real-Time Processing? Real-time processing refers to the immediate or near-immediate handling of data as it is received. Unlike traditional methods, where data...

what is churn prediction?

Churn Prediction Made Simple & Top 9 ML Techniques

What is Churn prediction? Churn prediction is the process of identifying customers who are likely to stop using a company's products or services in the near future....

the federated architecture used for federated learning

Federated Learning Made Simple, Why its Important & Application in the Real World

What is Federated Learning? Federated Learning (FL) is a cutting-edge machine learning approach emphasising privacy and decentralisation. Unlike traditional machine...

cloud vs edge computing

NLP And Edge Computing: How It Works & Top 7 Technologies for Offline Computing

In the age of digital transformation, Natural Language Processing (NLP) has emerged as a cornerstone of intelligent applications. From chatbots and voice assistants to...

elastic net vs l1 and l2 regularization

Elastic Net Made Simple & How To Tutorial In Python

What is Elastic Net Regression? Elastic Net regression is a statistical and machine learning technique that combines the strengths of Ridge (L2) and Lasso (L1)...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2025 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2025. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!