F1 Score The Ultimate Guide: Formulas, Explanations, Examples, Advantages, Disadvantages, Alternatives & Python Code

The F1 score formula

The F1 score is a metric commonly used to evaluate the performance of binary classification models.

It is a measure of a model’s accuracy, and it takes into account both precision and recall.

Precision is the number of true positive predictions divided by the total number of positive predictions. It measures how accurate the model’s positive predictions are.
The recall is the number of true positive predictions divided by the total number of actual positive cases. It measures how well the model identifies positive cases.

The F1 score is the harmonic mean of precision and recall, and it is calculated as follows:

The F1 score ranges from 0 to 1, with a score of 1 indicating perfect precision and recall and 0 indicating poor performance.

A high F1 score means that the model has high precision and high recall, which indicates that it is a good model for the binary classification task.

Precision and recall formula

Precision and recall are two commonly used metrics for evaluating the performance of classification models.

Precision measures the proportion of true positive predictions among the total positive predictions made by the model. It is calculated as:

Recall measures the proportion of true positive predictions among the total actual positive cases in the dataset. It is calculated as:

Where:

True positives (TP) are the number of correct positive predictions the model makes.
False positives (FP) are the number of incorrect positive predictions made by the model.
False negatives (FN) are the number of positive cases incorrectly predicted as negative by the model.

Precision measures the accuracy of the model’s positive predictions, while recall measures how well the model identifies positive cases. Both metrics are essential for evaluating the performance of classification models, and they are often used together in the F1 score to provide a more complete picture of the model’s performance.

Using just one of the metrics can be problematic; let us look at some examples.

An example of why combining precision and recall is important

Consider a spam email classification problem that aims to identify whether an email is spam. Suppose we have a dataset of 1000 emails, of which 100 are spam. We train a model and use it to predict whether each email is spam. The model correctly identifies all 100 spam emails and predicts 50 non-spam emails as spam.

If we evaluate the model using precision, we will get a precision score of 100/150 = 0.67, which seems reasonable. However, this score is misleading because it does not consider the fact that the model fails to identify 50 non-spam emails correctly. In this case, the high number of false positives means that the precision score is not providing a comprehensive evaluation of the model’s performance.

In such cases, using other metrics, such as recall or the F1 score, is vital, considering both the model’s true positive and false negative rates.

In the example above, the recall score would be 100/100 = 1, indicating that the model correctly identifies all the spam emails. However, the F1 score, which combines precision and recall, would be lower than the precision score due to the high number of false positives.

When should you use accuracy vs F1 score?

Accuracy and F1 score are metrics commonly used to evaluate the performance of classification models. While accuracy measures the proportion of correct predictions among all predictions, the F1 score combines both precision and recall to provide a more comprehensive evaluation of the model’s performance.

Accuracy is a helpful metric when the classes in the dataset are balanced, meaning that the number of positive and negative examples is roughly equal. In such cases, accuracy can reasonably estimate the model’s overall performance.

However, accuracy can be misleading when the classes are imbalanced, meaning that one class has significantly more examples. In such cases, the model may achieve high accuracy by simply predicting the majority class all the time while completely failing to identify the minority class.

The F1 score is a better metric to use when the classes in the dataset are imbalanced or when the cost of false positives and false negatives is not equal. It provides a balanced measure of the model’s performance that considers both the true positive and false positive rates of the model.

So when should you use accuracy vs F1 score?

Accuracy is a good metric when the classes in the dataset are balanced. At the same time, the F1 score is a better metric to use when the classes are imbalanced or when the cost of false positives and false negatives is not equal.

Interpretation of F1 Score

The F1 score is a metric used to evaluate the overall performance of a binary classification model. The harmonic mean of precision and recall ranges from 0 to 1, with a score of 1 indicating perfect precision and recall and 0 indicating poor performance.

The interpretation of the F1 score depends on the specific problem and context. In general, a higher F1 score indicates better performance of the model. However, what constitutes a “good” or “acceptable” F1 score can vary depending on the specific domain, the application, and the consequences of errors.

For example, in a medical diagnosis task where false negatives are more critical than false positives (i.e., it is more important to avoid missing a disease than to diagnose it falsely), a high recall score may be more important than precision, and thus a higher F1 score would be desirable. On the other hand, in a spam filtering task where false positives are more critical than false negatives (i.e., it is better to mistakenly classify a legitimate email as spam than to let a spam email through), a high precision score may be more important than recall, and thus a higher F1 score would be desirable.

As a result, the F1 score provides a single value that summarizes the overall performance of a binary classification model, but its interpretation depends on the specific problem and context.

What is a good F1 score?

A good F1 score depends on the specific problem and context; no one-size-fits-all answer exists. However, as a general rule of thumb, an F1 score of 0.7 or higher is often considered good.

In some applications, a higher F1 score may be required, mainly if precision and recall are both essential and a high cost is associated with false positives and false negatives. On the other hand, in some applications, a lower F1 score may be acceptable if the cost of misclassification is low or if other metrics, such as accuracy or ROC-AUC, are more critical.

It is crucial to remember that the F1 score should not be considered in isolation but should be evaluated along with other metrics and factors, such as the dataset characteristics, the problem complexity, and the cost of misclassification.

What causes a low F1 score?

A low F1 score indicates the poor overall performance of a binary classification model. It can be caused by a variety of factors, including:

Imbalanced data: If the dataset is imbalanced, meaning one class is much more frequent than the other, the model may have difficulty learning to distinguish the minority class, resulting in poor performance and a low score.
Insufficient data: If the dataset is too small or does not contain enough representative examples of each class, the model may not be able to learn a good representation of the data and may perform poorly, resulting in a low score.
Inappropriate model selection: If the model is not appropriate for the specific task or if it is not tuned correctly, it may perform poorly, resulting in a low score.
Inadequate features: If the features used to train the model do not capture the relevant information for the task, the model may not be able to learn the patterns in the data and may perform poorly, resulting in a low score.

To improve the F1 score, it is essential to identify the cause of the poor performance and take appropriate steps to address it.

For example, if the dataset is imbalanced, techniques such as oversampling or undersampling can balance the classes. If the model is not appropriate or not correctly tuned, alternative models or hyperparameter tuning can be explored. Finally, if the features are inadequate, feature engineering or selection can be performed to identify more relevant features for the task.

What causes a high F1 score?

A high F1 score indicates the good overall performance of a binary classification model. It means the model can correctly identify the positive cases while minimizing false positives and negatives.

A high score can be achieved by various factors, including:

High-quality training data: A high-quality dataset representative of the problem being solved can improve the model’s performance, resulting in a high score.
Appropriate model selection: Choosing an appropriate model architecture for the specific problem can improve the model’s ability to learn the patterns in the data, resulting in a high score.
Hyperparameter tuning: Tuning the model’s hyperparameters can optimize its performance, resulting in a high score.
Effective feature engineering: Selecting or creating informative features that capture the relevant information in the data can improve the model’s ability to learn and generalize, resulting in a high score.

It is important to note that a high F1 score does not guarantee perfect performance, and it is still possible for the model to make errors. Moreover, interpreting a “high” F1 score depends on the specific problem and context. In some cases, a slightly lower F1 score may still be acceptable if it balances the trade-off between precision and recalls appropriately for the specific task.

Digging deeper into the F1 score with a confusion matrix

A confusion matrix is a table used to evaluate the performance of a classifier or a machine learning model. It shows the number of correct and incorrect predictions made by the model for each class.

The matrix rows represent the actual or true class labels, while the columns represent the predicted class labels. Therefore, the diagonal elements of the matrix represent the number of correct predictions, while the off-diagonal elements represent the number of incorrect predictions.

Here’s an example of a confusion matrix for a binary classification problem:

            Predicted class
            Positive Negative
Actual class
Positive      50      10
Negative      20      70

The model correctly predicted 50 positive and 70 negative examples in this example. However, it incorrectly predicted 20 negative examples as positive (false positives) and 10 positive examples as negative (false negatives).

A confusion matrix can also compute performance metrics such as precision, recall, accuracy, and F1 score. These metrics provide more detailed insights into the model’s performance beyond the simple accuracy metric.

Advantges of the F1 score

Here are some benefits of using the F1 score:

Balancing precision and recall: It considers both precision and recall and provides a single value that balances the trade-off between these metrics. This is useful for evaluating models with different trade-offs between precision and recall, depending on the specific problem and context.
Easy to interpret: It is a simple and intuitive metric that ranges from 0 to 1, with higher scores indicating better performance. It is easy to understand and interpret, even for non-technical stakeholders.
Robust to class imbalance: It is robust to class imbalance, which is a common issue in binary classification tasks where one class is much more frequent than the other. It provides a balanced evaluation of the model’s performance across both classes.
Applicable to both small and large datasets: It applies to both small and large datasets and can provide a quick evaluation of the model’s performance without needing more complex metrics.
Can be used for model selection: It can be used as a criterion for model selection or hyperparameter tuning, allowing for a fair comparison between different models or settings.

Overall, the F1 score is a helpful metric for evaluating the performance of binary classification models, providing a balanced evaluation of both precision and recall and allowing for straightforward interpretation and comparison of models.

Disadvantages of the f1 score

While the F1 score is a popular and widely used metric for evaluating binary classification models, it has some disadvantages that should be considered:

F1 score is not informative about the distribution of errors: It provides a single value that summarizes the model’s performance across both precision and recall. However, it does not give any information about the distribution of errors, which can be important for specific applications.
The F1 score assumes equal importance of precision and recall: It equally weights precision and recall, assuming they have equal importance. However, precision and recall may have different costs or significance in some applications, and another metric may be more appropriate.
The F1 score may not be optimal for multiclass classification: It is designed for binary classification problems and may not directly apply to multiclass classification problems. Other metrics, such as accuracy or micro/macro F1 scores, may be more appropriate.
The F1 score may not be sensitive to specific patterns in the data: It is a generic metric that does not consider particular patterns or characteristics of the data. However, in some cases, a more specialized metric may be necessary to capture the specific properties of the problem.

Alternatives to the F1 score

While the F1 score is a popular and widely used metric for evaluating binary classification models, several other metrics can be used depending on the specific problem and context.

Here are some alternatives to the F1 score:

Accuracy: Accuracy measures the proportion of correctly classified instances, regardless of the class. It is a simple and intuitive metric that may not be appropriate for imbalanced datasets or when false positives and negatives are of different costs.
Precision and Recall: Precision measures the proportion of true positive predictions among all positive predictions, while recall measures the proportion of true positive predictions among all actual positive instances. These metrics help evaluate models with different trade-offs between precision and recall.
ROC-AUC: ROC-AUC (Receiver Operating Characteristic – Area Under the Curve) is a metric that measures the ability of the model to distinguish between positive and negative instances across a range of classification thresholds. It provides a comprehensive evaluation of the model’s performance across different thresholds and is helpful for imbalanced datasets.
F-beta score: The F-beta score is a generalization of the F1 score that allows for tuning the relative importance of precision and recall. The beta parameter controls recall weighting, with higher values giving more weight to recall.
G-mean: G-mean (Geometric Mean) is a metric that combines precision and recall into a single score. It is helpful for imbalanced datasets and provides a balanced evaluation of the model’s performance across both classes.

How can you calculate the F1 score in Python?

In Python, you can calculate the F1 score using the f1_score() function from the sklearn.metrics module. Here’s an example of how to use it:

from sklearn.metrics import f1_score

# true labels
y_true = [1, 0, 1, 1, 0, 1]

# predicted labels
y_pred = [1, 0, 0, 1, 1, 1]

# calculate F1 score
f1 = f1_score(y_true, y_pred)

print("F1 score: ", f1)

In this example, y_true is a list of true labels and y_pred is a list of predicted labels. The f1_score() function calculates the F1 score and returns it as a floating-point number. Finally, the F1 score is printed to the console using the print() function.

You can also specify additional arguments to the f1_score() function, such as the average method to use for multiclass classification problems or the beta value for the F-beta score. Check the documentation for more details on the available options.

Digging in deeper with a confusion matrix

Next, you should have a look at the confusion matrix using Python and scikit-learn:

from sklearn.metrics import confusion_matrix
import numpy as np

# Generate some example data
y_true = np.array([0, 0, 1, 1, 1, 0, 0, 1])
y_pred = np.array([0, 1, 0, 1, 1, 0, 1, 1])

# Compute the confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Print the confusion matrix
print(cm)

This will output the following confusion matrix:

[[2 2]
 [1 3]]

In this example, the rows correspond to the true class labels, and the columns correspond to the predicted class labels. The top-left element of the matrix (2) represents the number of true negatives (i.e., the number of examples that were correctly predicted to be negative), the top-right element (2) represents the number of false positives (i.e., the number of examples that were incorrectly predicted to be positive), the bottom-left element (1) represents the number of false negatives (i.e., the number of examples that were incorrectly predicted to be negative), and the bottom-right element (3) represents the number of true positives (i.e., the number of examples that were correctly predicted to be positive).

Interested in a visually appealing, colourful confusion matrix?

Then check out Scikit-learn’s plotting script.

Conclusion

The F1 score is a popular and widely used metric for evaluating the performance of binary classification models. It combines precision and recall into a single score, considering the model’s performance across both classes.

However, it has some limitations, including its assumption of the equal importance of precision and recall, its lack of sensitivity to specific patterns in the data, and its potential limitations in multiclass classification problems.

When using the F1 score, it is vital to consider the specific problem and context and to select the appropriate metric based on the goals and requirements of the task.

A good F1 score depends on the specific situation and context, and there is no one-size-fits-all answer, but a score of 0.7 or higher is often considered to be a good benchmark.

Overall, it is a valuable and widely used metric, but it should be evaluated along with other metrics and factors to comprehensively evaluate the model’s performance.

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.