Introduction
Text data is everywhere—from social media posts and customer reviews to emails and product descriptions. For data scientists and analysts, working with this unstructured form of data presents unique challenges and opportunities. Whether you are cleaning messy data, analysing trends, or extracting valuable insights from textual information, mastering text manipulation is a critical skill.
Table of Contents
Fortunately, Pandas, one of Python’s most widely used data manipulation libraries, provides a robust set of tools designed to work with text data. From simple string operations like replacing characters to more complex tasks such as extracting patterns with regular expressions, Pandas makes handling text efficient and scalable.
In this blog post, we’ll walk you through various ways to work with text in Pandas, covering everything from basic string methods to advanced techniques for handling and transforming textual information. By the end, you’ll be equipped with practical knowledge to make sense of your text data, regardless of the dataset you’re working with.
Loading Text Data in Pandas
Before manipulating and analysing text data, load it into a Pandas DataFrame. Pandas offers several convenient methods to import data from various file formats, such as CSV, Excel, and plain text files. Let’s explore the most common ways to load text data into Pandas.
1. Loading Text Data from a CSV File in Pandas
One of the most common formats for text data is the CSV (Comma-Separated Values) file. You can load CSV files into Pandas using pd.read_csv(). This method reads the file and converts it into a DataFrame, representing each column as a separate series.
import pandas as pd
# Load a CSV file into a DataFrame
df = pd.read_csv('data.csv')
# View the first few rows of the DataFrame
print(df.head())
You can also specify additional parameters when loading a CSV file, such as a different delimiter (e.g., tabs or semicolons) or handling missing values:
# Load a CSV file with a different delimiter (e.g., semicolon-separated)
df = pd.read_csv('data.csv', delimiter=';')
# Load a CSV file and specify NA values
df = pd.read_csv('data.csv', na_values=['NA', 'None'])
2. Loading Text Data from an Excel File in Pandas
If your text data is stored in an Excel spreadsheet, Pandas provides the pd.read_excel() method. This method allows you to load data from a specific sheet or from multiple sheets within an Excel file.
# Load text data from an Excel file
df = pd.read_excel('data.xlsx')
# Load data from a specific sheet in the Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
3. Loading Data from Plain Text Files in Pandas
For files that contain unstructured text data (e.g., logs, articles, or paragraphs), you can use pd.read_fwf() (for fixed-width formatted text) or pd.read_table() (for delimited text data). If your text file is a list of lines, you may need to load it manually and structure it into a DataFrame.
# Load fixed-width formatted text file
df = pd.read_fwf('data.txt')
# Load a text file with a custom delimiter
df = pd.read_table('data.txt', delimiter='\t')
For very unstructured text files, you may want to read the file line by line and then manually process it into a DataFrame:
# Read a text file into a DataFrame
with open('textfile.txt') as f:
lines = f.readlines()
# Create a DataFrame from the text lines
df = pd.DataFrame(lines, columns=['text'])
4. Loading Data from the Web in Pandas
Sometimes, you may want to load text data directly from a web URL. Pandas allow you to pass the URL directly into pd.read_csv() or pd.read_table(). This can be useful when working with public datasets available online.
# Load a CSV file from a URL
url = 'https://example.com/data.csv'
df = pd.read_csv(url)
5. Previewing and Exploring the Data in Pandas
After loading your text data into a DataFrame, inspecting it to ensure everything was loaded correctly is important. Pandas provides several methods to explore the structure and content of your data:
- df.head(): View the first few rows.
- df.info(): Get an overview of the DataFrame, including data types and memory usage.
- df.columns: List all column names.
- df.describe(): Get summary statistics (useful for numerical data but can give insights into the structure of the text columns).
# Preview the first 5 rows of the DataFrame
print(df.head())
# Get information about the DataFrame
df.info()
# Check the column names
print(df.columns)
Text String Methods in Pandas
When working with text data in Pandas, the .str accessor is your go-to tool for performing string operations on entire columns of a DataFrame. It allows you to efficiently apply various string methods to each column element without loops. Let’s explore some of the most common string methods you’ll use to manipulate text in Pandas.
1. Changing Case: Lowercase and Uppercase in Pandas
Standardising the case of your text data is often the first step in cleaning. Pandas offers the str.lower() and str.upper() methods to convert all text to lowercase or uppercase.
# Convert all text in 'column_name' to lowercase
df['column_name'] = df['column_name'].str.lower()
# Convert all text in 'column_name' to uppercase
df['column_name'] = df['column_name'].str.upper()
2. Removing Whitespace in Pandas
Extra whitespace can cause inconsistencies when working with text data. Pandas provides the str.strip() method to remove leading and trailing whitespace. You can also use str.lstrip() and str.rstrip() to remove only leading or trailing spaces.
# Remove leading and trailing whitespace from 'column_name'
df['column_name'] = df['column_name'].str.strip()
# Remove leading whitespace only
df['column_name'] = df['column_name'].str.lstrip()
# Remove trailing whitespace only
df['column_name'] = df['column_name'].str.rstrip()
3. Measuring String Length in Pandas
To find the length of each string in a column, use str.len(). This is particularly useful when filtering or categorising text based on its length.
# Get the length of each string in 'column_name'
df['string_length'] = df['column_name'].str.len()
4. Replacing Substrings in Pandas
If you need to replace specific substrings within your text data, use the str.replace() method. You can either replace all occurrences of a substring or use regular expressions for more complex patterns.
# Replace 'old_value' with 'new_value' in 'column_name'
df['column_name'] = df['column_name'].str.replace('old_value', 'new_value')
# Using regular expressions to replace patterns
df['column_name'] = df['column_name'].str.replace(r'\d+', '', regex=True) # Removes digits
5. Splitting and Joining Strings in Pandas
To split the text into multiple parts (e.g., breaking a full name into first and last names), use str.split(). To join strings from numerous columns, str.cat() can be used.
# Split strings in 'column_name' by a delimiter (e.g., space) and expand into multiple columns
df[['first_name', 'last_name']] = df['full_name'].str.split('', expand=True)
# Join two columns (e.g., first_name and last_name) with a space separator
df['full_name'] = df['first_name'].str.cat(df['last_name'], sep='')
6. Checking for Substrings in Pandas
To check whether a string contains a specific substring, you can use str.contains(). This is especially useful for filtering data based on text patterns.
# Filter rows where 'column_name' contains the word 'Python'
df[df['column_name'].str.contains('Python', case=False)]
7. Finding and Extracting Substrings in Pandas
To locate the position of a substring, use str.find(). If you want to extract specific parts of a string based on position, str.slice() or str.extract() with regular expressions can be helpful.
# Find the position of a substring in 'column_name'
df['substring_position'] = df['column_name'].str.find('keyword')
# Extracting a pattern using a regular expression
df['email_domain'] = df['email'].str.extract(r'@(\w+\.\w+)')
Filtering and Extracting Information in Pandas
Once you’ve loaded your text data into a Pandas DataFrame, the next step is filtering specific rows or extracting valuable information from the text. Whether you want to find rows containing particular keywords or extract patterns like dates, phone numbers, or email domains, Pandas offers powerful tools to make this easy.
1. Filtering Rows Based on Text Conditions
A common task when working with text data is filtering rows based on the content of a specific column. The str.contains() method allows you to filter rows where a string contains a particular substring or matches a regular expression.
Example: Filtering Rows Containing a Specific Word
# Filter rows where 'column_name' contains the word 'Python' (case-insensitive)
filtered_df = df[df['column_name'].str.contains('Python', case=False)]
# View the filtered DataFrame
print(filtered_df.head())
This method can also be combined with other DataFrame methods, allowing you to handle complex conditions.
Example: Filtering Based on Multiple Conditions
# Filter rows where 'column_name' contains 'Python' and 'status' equals 'Active'
filtered_df = df[(df['column_name'].str.contains('Python', case=False)) & (df['status'] == 'Active')]
You can also use str.startswith() and str.endswith() to filter rows based on how the string begins or ends.
# Filter rows where 'column_name' starts with 'Data'
df[df['column_name'].str.startswith('Data')]
# Filter rows where 'column_name' ends with 'Science'
df[df['column_name'].str.endswith('Science')]
2. Extracting Substrings
Sometimes, you’ll need to extract specific parts of a string. Pandas offers several methods, including str.slice() and str.extract().
Using str.slice() to Extract by Position
If you know the exact position of the text you want to extract (e.g., the first 5 characters), you can use str.slice():
# Extract the first 5 characters of each string in 'column_name'
df['first_five_chars'] = df['column_name'].str.slice(0, 5)
# Extract characters from position 5 to 10
df['middle_chars'] = df['column_name'].str.slice(5, 10)
Using str.extract() with Regular Expressions
For more complex extraction tasks, such as extracting patterns like email domains or phone numbers, you can use regular expressions with the str.extract() method.
# Extract email domains from 'email' column
df['email_domain'] = df['email'].str.extract(r'@(\w+\.\w+)')
# Extract phone numbers using a regular expression pattern
df['phone_number'] = df['contact_info'].str.extract(r'(\d{3}-\d{3}-\d{4})')
The regular expression defines the pattern you want to extract in these examples. For instance, the @(\w+\.\w+) pattern extracts the part of the email address that follows the @ symbol.
3. Extracting Multiple Matches with str.findall()
If your data contains multiple instances of a pattern within a single string, you can use the str.findall() method to extract all matches. This method returns a list of matches for each string.
Example: Extracting All Hashtags from a Text Column
# Extract all hashtags from a 'text' column
df['hashtags'] = df['text'].str.findall(r'#\w+')
# View the extracted hashtags
print(df[['text', 'hashtags']])
This returns a new column where each cell contains a list of all the hashtags found in the text.
4. Filtering Rows with Missing or Non-Text Data
Text data often contains missing or null values, which can interfere with filtering or extraction tasks. Pandas provides ways to handle this, such as filtering out rows with missing values using notna() or isna().
# Filter rows where 'column_name' is not null
df_filtered = df[df['column_name'].notna()]
# Filter rows where 'column_name' is null
df_null = df[df['column_name'].isna()]
5. Combining Filtering and Extraction
Often, filtering and extraction go hand in hand. You might first filter rows based on a condition and then extract specific information from the filtered data.
Example: Filtering Emails and Extracting Domains
# Filter rows where 'email' contains a valid email format
valid_emails = df[df['email'].str.contains(r'^\w+@\w+\.\w+$')]
# Extract domains from the filtered valid emails
valid_emails['email_domain'] = valid_emails['email'].str.extract(r'@(\w+\.\w+)')
Handling Missing Text Data In Pandas
Missing data is a common challenge when working with real-world datasets, especially text fields. Whether text entries are partially missing, completely absent, or contain irregular values, it’s essential to handle these issues properly to ensure the integrity of your analysis. Pandas offers several methods to manage missing text data, allowing you to clean your dataset and handle gaps effectively.
1. Identifying Missing Text Data in Pandas
Before dealing with missing values, you must identify which entries in your DataFrame are missing. In Pandas, missing values are typically represented as NaN (Not a Number), and you can use the following methods to detect them.
Using isna() and notna()
The isna() method allows you to check for NaN values, while notna() returns the opposite (non-missing values).
# Check for missing values in 'text_column'
missing_values = df['text_column'].isna()
# View rows where text data is missing
df_missing = df[missing_values]
# Check for non-missing values
df_non_missing = df['text_column'].notna()
You can apply this method to the entire DataFrame to see where missing values are located across all columns.
# Check for missing values in the entire DataFrame
df.isna().sum()
This will give you the count of missing values in each column, which is particularly useful when determining how much of your text data is affected by missing entries.
2. Filling Missing Text Data in Pandas
Once you’ve identified the missing data, you must decide how to handle it. One common approach is to fill missing values with a placeholder or impute them with meaningful data. Pandas offers the fillna() method for this purpose.
Filling with a Placeholder
If you want to simply fill missing text fields with a placeholder (such as “Unknown” or “N/A”), you can do so with fillna():
# Fill missing text values in 'text_column' with 'Unknown'
df['text_column'] = df['text_column'].fillna('Unknown')
Filling with Mode or Other Values
Sometimes, you may want to fill in missing values with the most frequent (mode) text value or a specific value based on domain knowledge.
# Fill missing values with the most frequent value (mode)
most_frequent = df['text_column'].mode()[0]
df['text_column'] = df['text_column'].fillna(most_frequent)
3. Dropping Rows or Columns with Missing Text in Pandas
If a large portion of a column or row is missing, and you believe that filling the gaps wouldn’t add meaningful information, you can drop the affected rows or columns entirely using dropna().
Dropping Rows with Missing Text
To remove rows where a specific column contains missing text, you can specify the column in subset:
# Drop rows where 'text_column' is missing
df_cleaned = df.dropna(subset=['text_column'])
If multiple columns are affected, you can drop rows that have missing values in any of the specified columns:
# Drop rows where any of the specified columns are missing
df_cleaned = df.dropna(subset=['text_column', 'another_column'])
Dropping Columns with Many Missing Values
If a column has too many missing values, you may decide to drop the entire column:
# Drop columns where all or most values are missing
df_cleaned = df.dropna(axis=1, how='all') # Drop columns with all missing values
df_cleaned = df.dropna(axis=1, thresh=0.7*len(df)) # Drop columns with over 70% missing values
4. Handling Missing Text in Complex Datasets in Pandas
In more complex cases, you might want to handle missing values conditionally. For example, if a specific value in one column implies missing data in another, you can apply custom logic using apply() or mask().
Example: Filling Missing Text Conditionally
If missing text in a column depends on the value of another column, you can use apply() or where() for conditional logic:
# Fill missing text conditionally based on another column
df['text_column'] = df['text_column'].fillna(df['related_column'].apply(lambda x: 'Placeholder' if x == 'Condition' else 'Other Placeholder'))
5. Interpolating or Forward/Backward Filling in Pandas
For cases where text data follows a pattern or sequence, you can use forward-fill or backward-fill methods to propagate the last valid entry to the next missing one.
Forward-Fill (ffill()) and Backward-Fill (bfill())
# Forward-fill missing values in 'text_column'
df['text_column'] = df['text_column'].ffill()
# Backward-fill missing values in 'text_column'
df['text_column'] = df['text_column'].bfill()
Forward-fill is particularly useful in time-series datasets or when the data logically follows a sequence where previous values can fill the gaps.
Applying Custom Functions to Text Data in Pandas
While Pandas offers a wide array of built-in string methods, you often need to perform more specific or complex operations on your text data that go beyond standard functions. In such cases, you can apply custom functions to manipulate, clean, or transform text data according to your unique requirements. Pandas provides several ways to apply custom functions efficiently, including apply(), map(), and lambda functions.
1. Using apply() to Apply Custom Functions
The apply() method allows you to apply any custom function to each element of a DataFrame column. This is particularly useful when dealing with complex transformations that cannot be achieved using built-in methods.
Example: Applying a Custom Cleaning Function
Suppose you want to standardise and clean text data by performing multiple operations, such as removing punctuation and converting text to lowercase. To achieve this, you can create a custom function.
import string
# Define a custom function to clean text
def clean_text(text):
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Convert to lowercase
text = text.lower()
return text
# Apply the custom function to the 'text_column'
df['cleaned_text'] = df['text_column'].apply(clean_text)
# View the result
print(df[['text_column', 'cleaned_text']].head())
The apply() method iterates over each column element and applies the clean_text() function, returning the transformed text in a new column.
2. Using lambda Functions for Simple Operations
You can use lambda functions with apply() for simple one-line operations. This is useful for quick manipulations that don’t require a separate function definition.
Example: Shortening Text to a Fixed Length
If you want to limit the text length to a specific number of characters, you can use a lambda function to slice the string.
# Shorten text to 50 characters using a lambda function
df['short_text'] = df['text_column'].apply(lambda x: x[:50])
# View the shortened text
print(df[['text_column', 'short_text']].head())
This method is ideal for quick operations that are easy to define in-line.
3. Applying Functions to Multiple Columns
You can apply custom functions to multiple columns by passing a function to apply() across axis=1. This is useful when your function depends on more than one column.
Example: Combining Text from Multiple Columns
Let’s say you want to concatenate two columns, such as first and last names while ensuring proper formatting (e.g., title case for each name).
# Define a custom function to concatenate first and last names
def format_full_name(row):
return f" {row['first_name'].title()} {row['last_name'].title()}"
# Apply the function across rows (axis=1)
df['full_name'] = df.apply(format_full_name, axis=1)
# View the result
print(df[['first_name', 'last_name', 'full_name']].head())
Here, the apply() method applies the format_full_name() function across each row, and axis=1 specifies that the function should be applied row-wise rather than column-wise.
4. Using map() for Element-Wise Operations
The map() function is another useful method for applying custom transformations to a Pandas Series. It’s similar to apply() but works directly on Series objects and is often slightly faster for simple element-wise operations.
Example: Mapping Values to Custom Categories
Suppose you have a text column with categories like “Beginner,” “Intermediate,” and “Advanced,” and you want to map these categories to numerical values.
# Define a mapping dictionary
category_map = {'Beginner': 1, 'Intermediate': 2, 'Advanced': 3}
# Apply the map() function to map categories to numbers
df['category_numeric'] = df['category_column'].map(category_map)
# View the result
print(df[['category_column', 'category_numeric']].head())
This method efficiently transforms text-based categories into numeric values for easier analysis.
5. Using applymap() for Element-Wise Operations on the Entire DataFrame
If you need to apply a function element-wise to an entire DataFrame, you can use applymap(). This is useful when applying the same transformation to every element in a DataFrame, rather than just a single column.
Example: Stripping Extra Whitespace from All String Columns
# Apply a lambda function to strip whitespace from every element in the DataFrame
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
# View the cleaned DataFrame
print(df.head())
This ensures all string values in the DataFrame have their leading and trailing whitespace removed.
6. Handling Errors in Custom Functions with try-except
When applying custom functions to text data, you might encounter errors due to unexpected data formats or missing values. You can handle these gracefully by incorporating try-except blocks into your functions.
Example: Handling Errors in a Custom Function
If your text data includes numeric values or null entries that could raise errors, you can handle these cases with a try-except block.
# Define a custom function with error handling
def safe_clean_text(text):
try:
# Remove punctuation and convert to lowercase
text = text.translate(str.maketrans('', '', string.punctuation)).lower()
except AttributeError:
# Handles cases where 'text' is not a string return text
return text
# Apply the function with error handling
df['cleaned_text'] = df['text_column'].apply(safe_clean_text)
# View the result
print(df[['text_column', 'cleaned_text']].head())
This approach prevents your function from failing due to unexpected input types.
7. Performance Considerations for Large Datasets
Applying custom functions row by row can be slow when working with large datasets. If performance becomes an issue, consider using vectorised operations (when possible) or libraries like Dask or Swifter to parallelise the apply() operations.
Example: Using swifter for Faster Apply
The Swifter library automatically optimises your apply() functions for better performance on large datasets.
import swifter
# Apply the custom function using swifter for performance optimization
df['cleaned_text'] = df['text_column'].swifter.apply(clean_text)
This can significantly speed up text transformations when working with large-scale data.
Concatenating and Merging Text Columns in Pandas
When working with text data in Pandas, you’ll often need to combine information from multiple columns into a single column. This is common in tasks like combining first and last names, creating full addresses, or merging related information. Pandas provides several methods to concatenate and merge text columns efficiently.
1. Concatenating Text Columns Using + Operator
One of the simplest ways to concatenate two or more text columns in Pandas is by using the + operator. This method works similarly to string concatenation in Python, allowing you to directly combine the content of different columns.
Example: Combining First and Last Names
# Concatenate 'first_name' and 'last_name' columns
df['full_name'] = df['first_name'] + ' ' + df['last_name']
# View the result
print(df[['first_name', 'last_name', 'full_name']].head())
This method adds a space between the names for readability. You can customise the separator (e.g., commas, slashes, etc.) based on your needs.
Example: Concatenating with a Custom Separator
# Concatenate with a custom separator (e.g., a comma)
df['full_name'] = df['last_name'] + ',' + df['first_name']
However, be cautious when concatenating columns that may contain missing values (NaN), as the + operator will return NaN if any of the columns have missing data. In such cases, you should handle missing values first.
2. Using str.cat() for Concatenation
Pandas provides the str.cat() method, which is more flexible for concatenating columns. It allows you to handle missing values and specify a separator.
Example: Concatenating with str.cat()
# Concatenate first and last names with a space as a separator
df['full_name'] = df['first_name'].str.cat(df['last_name'], sep='')
# View the result
print(df[['first_name', 'last_name', 'full_name']].head())
The str.cat() method is particularly useful because it allows you to specify how to handle NaN values with the na_rep parameter.
Example: Handling Missing Values
# Concatenate first and last names, replacing missing values with 'Unknown'
df['full_name'] = df['first_name'].str.cat(df['last_name'], sep=' ', na_rep='Unknown')
# View the result
print(df[['first_name', 'last_name', 'full_name']].head())
This ensures that the concatenation operation does not result in NaN if any fields are missing.
3. Concatenating Multiple Columns
If you need to concatenate more than two columns, str.cat() easily handles this by passing a list of columns.
Example: Concatenating Multiple Address Columns
# Concatenate' address', 'city', 'state', and 'zip_code' into a full address
df['full_address'] = df['address'].str.cat([df['city'], df['state'], df['zip_code']], sep=',')
# View the result
print(df[['address', 'city', 'state', 'zip_code', 'full_address']].head())
In this example, we concatenate four columns (address, city, state, and zip code) into a single full address column, using a comma as the separator.
4. Using agg() for Concatenation Across Rows
If you need to concatenate text from multiple rows into a single string, you can use the agg() method. This is particularly useful when summarising text data or when you want to combine text data grouped by a specific column.
Example: Concatenating Text Across Rows
# Group by 'category' and concatenate text from the 'description'
column df_grouped = df.groupby('category')['description'].agg(''.join)
# View the result
print(df_grouped.head())
This groups the DataFrame by the category column and concatenates all descriptions for each category into a single string.
5. Merging Text Columns with merge()
Pandas provides the merge() function to combine two DataFrames based on one or more common columns. This is especially useful when combining related information from different sources or datasets.
Example: Merging DataFrames with a Common Column
# Merge two DataFrames on a common 'id' column
df_merged = pd.merge(df1, df2, on='id')
# View the merged DataFrame
print(df_merged.head())
This method joins two DataFrames (df1 and df2) based on the common id column. Depending on how you want to combine the data, you can also specify the type of join (left, right, inner, or outer).
6. Handling Duplicates After Concatenation
After concatenating or merging text columns, you might end up with duplicate values in your data. Pandas provides methods like drop_duplicates() to handle these cases.
Example: Removing Duplicates
# Remove duplicate rows based on the 'full_name' column
df_cleaned = df.drop_duplicates(subset=['full_name'])
# View the cleaned DataFrame
print(df_cleaned.head())
You can also use this method to drop duplicates across multiple columns.
Advanced Text Manipulation: Working with Regular Expressions
Regular expressions (regex) are a powerful tool for advanced text manipulation and pattern matching in Pandas. They allow you to search for specific patterns within text, extract information, and perform complex string operations. Pandas integrate with Python’s re-module, providing methods like str.contains(), str.extract(), and str.replace() to efficiently handle regular expressions on DataFrames.
Basics of Regular Expressions
Before diving into Pandas-specific applications, it’s essential to understand some basic regex concepts:
- Literal characters: Exact matches, e.g., “abc” matches the string “abc”.
- Metacharacters: Special symbols like . (any character), ^ (start of string), $ (end of string), * (zero or more repetitions), + (one or more repetitions), [] (character set), etc.
- Escaping special characters: Use \ to escape metacharacters when needed.
1. Finding Text with str.contains()
The str.contains() method allows you to search for rows where a specific pattern exists within a text column. This method is often used to filter data based on text patterns.
Example: Searching for Patterns
Suppose you have a dataset with a column containing email addresses, and you want to filter rows where the email is from a specific domain (e.g., Gmail).
# Filter rows where 'email' contains '@gmail.com'
gmail_users = df[df['email'].str.contains(r'@gmail\.com')]
# View the filtered DataFrame
print(gmail_users.head())
In this example, we use @gmail\.com as the pattern, where \. escapes the dot, and r” denotes a raw string to prevent Python from interpreting escape characters.
Case Sensitivity and Missing Values
You can make the search case-insensitive and handle missing values with optional parameters.
# Case-insensitive search for '@gmail.com', ignoring NaN values
gmail_users = df[df['email'].str.contains(r'@gmail\.com', case=False, na=False)]
This ensures the search ignores case differences and doesn’t fail when encountering NaN values.
2. Extracting Patterns with str.extract()
To extract specific parts of a string based on a pattern, you can use the str.extract() method. This is especially useful when isolating certain information, such as dates, IDs, or email domains.
Example: Extracting Email Domain
# Extract the domain from the 'email' column
df['domain'] = df['email'].str.extract(r'@([a-zA-Z0-9.-]+)')
# View the result
print(df[['email', 'domain']].head())
In this example, the regex pattern @([a-zA-Z0-9.-]+) matches the domain part of the email (after the @ symbol) and extracts it into a new column. The parentheses indicate the portion of the string to extract, and + ensures one or more valid domain characters are matched.
Extracting Multiple Groups
You can extract multiple pieces of information from text by using various capture groups.
# Extract both username and domain from the 'email' column
df[['username', 'domain']] = df['email'].str.extract(r' ([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)')
This pattern captures both the username and the email domain by using two sets of parentheses.
3. Replacing Text with str.replace()
The str.replace() method allows you to replace text in a column based on a regex pattern. This is particularly useful for cleaning or standardising text data.
Example: Replacing Substrings
Suppose you want to standardise phone numbers by removing non-numeric characters.
# Replace non-numeric characters in 'phone_number' with an empty string
df['cleaned_phone'] = df['phone_number'].str.replace(r'\D', '', regex=True)
# View the result
print(df[['phone_number', 'cleaned_phone']].head())
Here, \D is a regex pattern that matches any non-digit character, and it’s replaced with an empty string to leave only the numeric characters.
Example: Formatting Text with Replacements
You can also use str.replace() to format text in a specific way. For example, reformatting dates from “dd-mm-yyyy” to “yyyy-mm-dd”.
# Reformat dates from 'dd-mm-yyyy' to 'yyyy-mm-dd'
df['formatted_date'] = df['date_column'].str.replace(r'(\d{2})-(\d{2})-(\d{4})', r'\3-\2-\1', regex=True)
# View the result
print(df[['date_column', 'formatted_date']].head())
This uses regex groups (\d{2}) and (\d{4}) to capture the day, month, and year, and reorders them to create a standardised format.
4. Splitting Strings with str.split() and Regex
The str.split() method can also be used with regular expressions to split strings based on complex patterns. This is useful when data needs to be split into multiple columns.
Example: Splitting Based on Delimiters
Suppose you have a column with names in the format “First Last” and want to split it into two columns.
# Split the 'name' column into 'first_name' and 'last_name'
df[['first_name', 'last_name']] = df['name'].str.split(' ', expand=True)
# View the result
print(df[['name', 'first_name', 'last_name']].head())
This example splits the name into first and last names using the space as a delimiter. The expand=True parameter ensures the result is returned as separate columns.
Example: Splitting with Complex Delimiters
You can also use a regex pattern for splitting, such as splitting a string on multiple possible delimiters.
# Split based on commas, semicolons, or spaces
df[['part1', 'part2']] = df['complex_string'].str.split(r'[;,\s]+', expand=True)
This splits the string whenever it encounters a comma, semicolon, or space, using the regex pattern [;,\s]+.
5. Handling Complex Patterns
Regular expressions allow you to handle complex text patterns, such as validating or extracting particular text formats. Here are a few examples of more advanced usage:
Validating Text Patterns
You can validate whether a column follows a specific format (e.g., checking for valid email addresses).
# Validate email format
valid_emails = df['email'].str.contains(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', regex=True)
# View valid emails
print(df[valid_emails])
This regex ensures that the email addresses follow a common format, with a username, @ symbol, domain, and valid top-level domain (TLD).
Extracting Complex Information
Regular expressions are useful when dealing with complex, semi-structured data, such as extracting dates and product codes or identifying key patterns in text.
# Extract product codes from a complex string
df['product_code'] = df['description'].str.extract(r'(\b[A-Z]{3}-\d{4}\b)')
# View the result
print(df[['description', 'product_code']].head())
This example extracts product codes in the format “ABC-1234” from a description column.
Case Study Example: Analysing a Real-World Text Dataset
This section will discuss a practical example of using Pandas to analyse a real-world text dataset. We will demonstrate how to load, clean, manipulate, and extract insights from the data using various techniques discussed earlier, including string methods, regular expressions, and custom functions. We’ll use a dataset containing customer reviews for a product or service for this case study.
1. Loading the Dataset
Let’s assume we have a dataset (customer_reviews.csv) that contains product reviews and other relevant fields such as the customer name, review text, rating, and review date. We will start by loading the data into a Pandas DataFrame.
import pandas as pd
# Load the dataset
df = pd.read_csv('customer_reviews.csv')
# View the first few rows of the DataFrame
print(df.head())
The dataset contains the following columns:
- review_id: Unique identifier for each review
- customer_name: The name of the reviewer
- review_text: The text content of the review
- rating: The rating given by the customer (1 to 5)
- review_date: The date when the review was posted
2. Cleaning and Preprocessing the Text
Next, we will clean and preprocess the review_text column by handling missing values, removing unnecessary characters, and standardising the format.
Handling Missing Values
First, we will check for missing values in the review_text column and replace them with a placeholder value like “No review.”
# Fill missing reviews with 'No review'
df['review_text'] = df['review_text'].fillna('No review')
# Check for missing values
print(df['review_text'].isnull().sum())
Removing Punctuation and Lowercasing Text
To ensure uniformity, we will remove punctuation from the reviews and convert the text to lowercase for easier analysis.
import string
# Define a function to clean the text
def clean_review(text):
text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
text = text.lower() # Convert to lowercase
return text
# Apply the cleaning function to the 'review_text' column
df['cleaned_review'] = df['review_text'].apply(clean_review)
# View the cleaned data
print(df[['review_text', 'cleaned_review']].head())
3. Extracting Information with Regular Expressions
Let’s extract some useful insights from the review text. We will start by checking if any reviews mention specific keywords related to the product (e.g., “quality,” “price,” “service”) using regular expressions.
Finding Keyword Mentions
We can use str.contains() to find reviews that mention certain keywords like “quality” and “price.”
# Search for reviews mentioning 'quality' or 'price'
df['mentions_quality'] = df['cleaned_review'].str.contains(r'\bquality\b', regex=True) df['mentions_price'] = df['cleaned_review'].str.contains(r'\bprice\b', regex=True)
# View the result
print(df[['cleaned_review', 'mentions_quality', 'mentions_price']].head())
Here, we use word boundaries (\b) in the regex pattern to ensure we capture whole words only.
4. Sentiment Analysis Using Custom Functions
For deeper analysis, let’s define a simple sentiment analysis function that classifies reviews as “positive,” “negative,” or “neutral” based on the presence of certain keywords. For instance, reviews that mention words like “good,” “excellent,” or “happy” can be marked as positive, and those with words like “bad,” “poor,” or “disappointed” can be labelled as negative.
Custom Sentiment Function
# Define a function for basic sentiment analysis
def classify_sentiment(text):
positive_words = ['good', 'great', 'excellent', 'happy', 'love']
negative_words = ['bad', 'poor', 'disappointed', 'unhappy', 'hate']
# Check for positive and negative words in the text
if any(word in text for word in positive_words):
return 'positive'
elif any(word in text for word in negative_words):
return 'negative'
else:
return 'neutral'
# Apply the sentiment function to the cleaned reviews
df['sentiment'] = df['cleaned_review'].apply(classify_sentiment)
# View the sentiment classification
print(df[['cleaned_review', 'sentiment']].head())
This simple approach gives us a basic sentiment classification based on the presence of positive or negative words.
5. Aggregating and Analysing Review Data
Now that we’ve classified the reviews, we can perform some aggregate analysis to gain insights into customer feedback patterns.
Sentiment Distribution
Let’s calculate the distribution of sentiments to understand the overall customer satisfaction.
# Calculate the sentiment distribution
sentiment_counts = df['sentiment'].value_counts()
# View the sentiment distribution
print(sentiment_counts)
This shows how many reviews fall into each sentiment category (positive, negative, or neutral).
Average Rating by Sentiment
Next, we will calculate the average rating for each sentiment group to see if there’s a correlation between sentiment and the rating given.
# Calculate the average rating for each sentiment category
average_rating_by_sentiment = df.groupby('sentiment')['rating'].mean()
# View the result
print(average_rating_by_sentiment)
This shows whether positive reviews tend to have higher ratings and negative reviews tend to have lower ratings.
6. Visualising the Results
Finally, we can create some simple visualisations to display the results of our analysis. For example, we can plot the distribution of sentiments and the average rating for each sentiment category using Matplotlib or Seaborn.
import matplotlib.pyplot as plt
import seaborn as sns
# Plot the sentiment distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='sentiment', data=df, palette='coolwarm')
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()
# Plot the average rating by sentiment
plt.figure(figsize=(8, 6))
sns.barplot(x=average_rating_by_sentiment.index, y=average_rating_by_sentiment.values, palette='coolwarm')
plt.title('Average Rating by Sentiment')
plt.xlabel('Sentiment')
plt.ylabel('Average Rating')
plt.show()
These visualisations give a clearer picture of how customers feel about the product and how their sentiment relates to the ratings they give.
Conclusion
This blog post explored various techniques for working with text data in Pandas, from loading and cleaning text to applying advanced manipulations using regular expressions. By leveraging Pandas’ capabilities, we demonstrated how to analyse textual information effectively and extract meaningful insights.
Throughout the sections, we covered key aspects, including:
- Loading Text Data: Understanding how to import text data from various formats and structure it within a DataFrame.
- String Methods: Utilising built-in string functions to manipulate and analyse text efficiently.
- Filtering and Extracting Information: Applying filters and extraction techniques to hone in on specific data points.
- Handling Missing Text Data: To maintain data integrity, strategies for dealing with missing values.
- Applying Custom Functions: Creating and applying custom functions for specialised text processing tasks.
- Concatenating and Merging Text Columns: Combining information from multiple columns to create more informative datasets.
- Advanced Text Manipulation: Utilising regular expressions for pattern matching, validation, and complex text operations.
- Case Study Example: Analysing a real-world text dataset to demonstrate the practical application of these techniques in deriving insights.
By mastering these techniques, you can enhance your data analysis skills and effectively handle text data in various contexts, from customer reviews to social media sentiment analysis. The ability to manipulate and analyse text data opens up numerous opportunities for gaining insights, driving decisions, and creating value from unstructured data.
As you continue to work with text data, consider exploring additional libraries, such as nltk or spaCy, for more advanced natural language processing capabilities. Combined with Pandas, these tools can empower you to tackle even more complex text analysis challenges.
0 Comments