How To Process Text In Python With Pandas Made Simple

by | Oct 22, 2024 | Data Science, Natural Language Processing

Introduction

Text data is everywhere—from social media posts and customer reviews to emails and product descriptions. For data scientists and analysts, working with this unstructured form of data presents unique challenges and opportunities. Whether you are cleaning messy data, analysing trends, or extracting valuable insights from textual information, mastering text manipulation is a critical skill.

Fortunately, Pandas, one of Python’s most widely used data manipulation libraries, provides a robust set of tools designed to work with text data. From simple string operations like replacing characters to more complex tasks such as extracting patterns with regular expressions, Pandas makes handling text efficient and scalable.

In this blog post, we’ll walk you through various ways to work with text in Pandas, covering everything from basic string methods to advanced techniques for handling and transforming textual information. By the end, you’ll be equipped with practical knowledge to make sense of your text data, regardless of the dataset you’re working with.

Loading Text Data in Pandas

Before manipulating and analysing text data, load it into a Pandas DataFrame. Pandas offers several convenient methods to import data from various file formats, such as CSV, Excel, and plain text files. Let’s explore the most common ways to load text data into Pandas.

1. Loading Text Data from a CSV File in Pandas

One of the most common formats for text data is the CSV (Comma-Separated Values) file. You can load CSV files into Pandas using pd.read_csv(). This method reads the file and converts it into a DataFrame, representing each column as a separate series.

import pandas as pd 

# Load a CSV file into a DataFrame 
df = pd.read_csv('data.csv') 

# View the first few rows of the DataFrame 
print(df.head())

You can also specify additional parameters when loading a CSV file, such as a different delimiter (e.g., tabs or semicolons) or handling missing values:

# Load a CSV file with a different delimiter (e.g., semicolon-separated) 
df = pd.read_csv('data.csv', delimiter=';') 

# Load a CSV file and specify NA values 
df = pd.read_csv('data.csv', na_values=['NA', 'None'])

2. Loading Text Data from an Excel File in Pandas

If your text data is stored in an Excel spreadsheet, Pandas provides the pd.read_excel() method. This method allows you to load data from a specific sheet or from multiple sheets within an Excel file.

# Load text data from an Excel file 
df = pd.read_excel('data.xlsx') 

# Load data from a specific sheet in the Excel file 
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

3. Loading Data from Plain Text Files in Pandas

For files that contain unstructured text data (e.g., logs, articles, or paragraphs), you can use pd.read_fwf() (for fixed-width formatted text) or pd.read_table() (for delimited text data). If your text file is a list of lines, you may need to load it manually and structure it into a DataFrame.

# Load fixed-width formatted text file 
df = pd.read_fwf('data.txt') 

# Load a text file with a custom delimiter 
df = pd.read_table('data.txt', delimiter='\t')

For very unstructured text files, you may want to read the file line by line and then manually process it into a DataFrame:

# Read a text file into a DataFrame 
with open('textfile.txt') as f: 
    lines = f.readlines() 
    
# Create a DataFrame from the text lines 
df = pd.DataFrame(lines, columns=['text'])

4. Loading Data from the Web in Pandas

Sometimes, you may want to load text data directly from a web URL. Pandas allow you to pass the URL directly into pd.read_csv() or pd.read_table(). This can be useful when working with public datasets available online.

# Load a CSV file from a URL 
url = 'https://example.com/data.csv' 
df = pd.read_csv(url)

5. Previewing and Exploring the Data in Pandas

After loading your text data into a DataFrame, inspecting it to ensure everything was loaded correctly is important. Pandas provides several methods to explore the structure and content of your data:

  • df.head(): View the first few rows.
  • df.info(): Get an overview of the DataFrame, including data types and memory usage.
  • df.columns: List all column names.
  • df.describe(): Get summary statistics (useful for numerical data but can give insights into the structure of the text columns).
# Preview the first 5 rows of the DataFrame 
print(df.head()) 

# Get information about the DataFrame 
df.info() 

# Check the column names 
print(df.columns)

Text String Methods in Pandas

When working with text data in Pandas, the .str accessor is your go-to tool for performing string operations on entire columns of a DataFrame. It allows you to efficiently apply various string methods to each column element without loops. Let’s explore some of the most common string methods you’ll use to manipulate text in Pandas.

1. Changing Case: Lowercase and Uppercase in Pandas

Standardising the case of your text data is often the first step in cleaning. Pandas offers the str.lower() and str.upper() methods to convert all text to lowercase or uppercase.

# Convert all text in 'column_name' to lowercase 
df['column_name'] = df['column_name'].str.lower() 

# Convert all text in 'column_name' to uppercase 
df['column_name'] = df['column_name'].str.upper()

2. Removing Whitespace in Pandas

Extra whitespace can cause inconsistencies when working with text data. Pandas provides the str.strip() method to remove leading and trailing whitespace. You can also use str.lstrip() and str.rstrip() to remove only leading or trailing spaces.

# Remove leading and trailing whitespace from 'column_name' 
df['column_name'] = df['column_name'].str.strip() 

# Remove leading whitespace only 
df['column_name'] = df['column_name'].str.lstrip() 

# Remove trailing whitespace only 
df['column_name'] = df['column_name'].str.rstrip()

3. Measuring String Length in Pandas

To find the length of each string in a column, use str.len(). This is particularly useful when filtering or categorising text based on its length.

# Get the length of each string in 'column_name' 
df['string_length'] = df['column_name'].str.len()

4. Replacing Substrings in Pandas

If you need to replace specific substrings within your text data, use the str.replace() method. You can either replace all occurrences of a substring or use regular expressions for more complex patterns.

# Replace 'old_value' with 'new_value' in 'column_name' 
df['column_name'] = df['column_name'].str.replace('old_value', 'new_value') 

# Using regular expressions to replace patterns 
df['column_name'] = df['column_name'].str.replace(r'\d+', '', regex=True) # Removes digits

5. Splitting and Joining Strings in Pandas

To split the text into multiple parts (e.g., breaking a full name into first and last names), use str.split(). To join strings from numerous columns, str.cat() can be used.

# Split strings in 'column_name' by a delimiter (e.g., space) and expand into multiple columns 
df[['first_name', 'last_name']] = df['full_name'].str.split('', expand=True) 

# Join two columns (e.g., first_name and last_name) with a space separator 
df['full_name'] = df['first_name'].str.cat(df['last_name'], sep='')

6. Checking for Substrings in Pandas

To check whether a string contains a specific substring, you can use str.contains(). This is especially useful for filtering data based on text patterns.

# Filter rows where 'column_name' contains the word 'Python' 
df[df['column_name'].str.contains('Python', case=False)]

7. Finding and Extracting Substrings in Pandas

To locate the position of a substring, use str.find(). If you want to extract specific parts of a string based on position, str.slice() or str.extract() with regular expressions can be helpful.

# Find the position of a substring in 'column_name' 
df['substring_position'] = df['column_name'].str.find('keyword') 

# Extracting a pattern using a regular expression 
df['email_domain'] = df['email'].str.extract(r'@(\w+\.\w+)')

Filtering and Extracting Information in Pandas

Once you’ve loaded your text data into a Pandas DataFrame, the next step is filtering specific rows or extracting valuable information from the text. Whether you want to find rows containing particular keywords or extract patterns like dates, phone numbers, or email domains, Pandas offers powerful tools to make this easy.

1. Filtering Rows Based on Text Conditions

A common task when working with text data is filtering rows based on the content of a specific column. The str.contains() method allows you to filter rows where a string contains a particular substring or matches a regular expression.

Example: Filtering Rows Containing a Specific Word

# Filter rows where 'column_name' contains the word 'Python' (case-insensitive) 
filtered_df = df[df['column_name'].str.contains('Python', case=False)] 

# View the filtered DataFrame 
print(filtered_df.head())

This method can also be combined with other DataFrame methods, allowing you to handle complex conditions.

Example: Filtering Based on Multiple Conditions

# Filter rows where 'column_name' contains 'Python' and 'status' equals 'Active' 
filtered_df = df[(df['column_name'].str.contains('Python', case=False)) & (df['status'] == 'Active')]

You can also use str.startswith() and str.endswith() to filter rows based on how the string begins or ends.

# Filter rows where 'column_name' starts with 'Data' 
df[df['column_name'].str.startswith('Data')] 

# Filter rows where 'column_name' ends with 'Science' 
df[df['column_name'].str.endswith('Science')]

2. Extracting Substrings

Sometimes, you’ll need to extract specific parts of a string. Pandas offers several methods, including str.slice() and str.extract().

Using str.slice() to Extract by Position

If you know the exact position of the text you want to extract (e.g., the first 5 characters), you can use str.slice():

# Extract the first 5 characters of each string in 'column_name' 
df['first_five_chars'] = df['column_name'].str.slice(0, 5) 

# Extract characters from position 5 to 10 
df['middle_chars'] = df['column_name'].str.slice(5, 10)

Using str.extract() with Regular Expressions

For more complex extraction tasks, such as extracting patterns like email domains or phone numbers, you can use regular expressions with the str.extract() method.

# Extract email domains from 'email' column 
df['email_domain'] = df['email'].str.extract(r'@(\w+\.\w+)') 

# Extract phone numbers using a regular expression pattern 
df['phone_number'] = df['contact_info'].str.extract(r'(\d{3}-\d{3}-\d{4})')

The regular expression defines the pattern you want to extract in these examples. For instance, the @(\w+\.\w+) pattern extracts the part of the email address that follows the @ symbol.

3. Extracting Multiple Matches with str.findall()

If your data contains multiple instances of a pattern within a single string, you can use the str.findall() method to extract all matches. This method returns a list of matches for each string.

Example: Extracting All Hashtags from a Text Column

# Extract all hashtags from a 'text' column 
df['hashtags'] = df['text'].str.findall(r'#\w+') 

# View the extracted hashtags 
print(df[['text', 'hashtags']])

This returns a new column where each cell contains a list of all the hashtags found in the text.

4. Filtering Rows with Missing or Non-Text Data

Text data often contains missing or null values, which can interfere with filtering or extraction tasks. Pandas provides ways to handle this, such as filtering out rows with missing values using notna() or isna().

# Filter rows where 'column_name' is not null 
df_filtered = df[df['column_name'].notna()] 

# Filter rows where 'column_name' is null 
df_null = df[df['column_name'].isna()]

5. Combining Filtering and Extraction

Often, filtering and extraction go hand in hand. You might first filter rows based on a condition and then extract specific information from the filtered data.

Example: Filtering Emails and Extracting Domains

# Filter rows where 'email' contains a valid email format 
valid_emails = df[df['email'].str.contains(r'^\w+@\w+\.\w+$')] 

# Extract domains from the filtered valid emails 
valid_emails['email_domain'] = valid_emails['email'].str.extract(r'@(\w+\.\w+)')

Handling Missing Text Data In Pandas

Missing data is a common challenge when working with real-world datasets, especially text fields. Whether text entries are partially missing, completely absent, or contain irregular values, it’s essential to handle these issues properly to ensure the integrity of your analysis. Pandas offers several methods to manage missing text data, allowing you to clean your dataset and handle gaps effectively.

1. Identifying Missing Text Data in Pandas

Before dealing with missing values, you must identify which entries in your DataFrame are missing. In Pandas, missing values are typically represented as NaN (Not a Number), and you can use the following methods to detect them.

Using isna() and notna()

The isna() method allows you to check for NaN values, while notna() returns the opposite (non-missing values).

# Check for missing values in 'text_column' 
missing_values = df['text_column'].isna() 

# View rows where text data is missing 
df_missing = df[missing_values] 

# Check for non-missing values 
df_non_missing = df['text_column'].notna()

You can apply this method to the entire DataFrame to see where missing values are located across all columns.

# Check for missing values in the entire DataFrame 
df.isna().sum()

This will give you the count of missing values in each column, which is particularly useful when determining how much of your text data is affected by missing entries.

2. Filling Missing Text Data in Pandas

Once you’ve identified the missing data, you must decide how to handle it. One common approach is to fill missing values with a placeholder or impute them with meaningful data. Pandas offers the fillna() method for this purpose.

Filling with a Placeholder

If you want to simply fill missing text fields with a placeholder (such as “Unknown” or “N/A”), you can do so with fillna():

# Fill missing text values in 'text_column' with 'Unknown' 
df['text_column'] = df['text_column'].fillna('Unknown')

Filling with Mode or Other Values

Sometimes, you may want to fill in missing values with the most frequent (mode) text value or a specific value based on domain knowledge.

# Fill missing values with the most frequent value (mode) 
most_frequent = df['text_column'].mode()[0] 
df['text_column'] = df['text_column'].fillna(most_frequent)

3. Dropping Rows or Columns with Missing Text in Pandas

If a large portion of a column or row is missing, and you believe that filling the gaps wouldn’t add meaningful information, you can drop the affected rows or columns entirely using dropna().

Dropping Rows with Missing Text

To remove rows where a specific column contains missing text, you can specify the column in subset:

# Drop rows where 'text_column' is missing 
df_cleaned = df.dropna(subset=['text_column'])

If multiple columns are affected, you can drop rows that have missing values in any of the specified columns:

# Drop rows where any of the specified columns are missing 
df_cleaned = df.dropna(subset=['text_column', 'another_column'])

Dropping Columns with Many Missing Values

If a column has too many missing values, you may decide to drop the entire column:

# Drop columns where all or most values are missing 
df_cleaned = df.dropna(axis=1, how='all') # Drop columns with all missing values 
df_cleaned = df.dropna(axis=1, thresh=0.7*len(df)) # Drop columns with over 70% missing values

4. Handling Missing Text in Complex Datasets in Pandas

In more complex cases, you might want to handle missing values conditionally. For example, if a specific value in one column implies missing data in another, you can apply custom logic using apply() or mask().

Example: Filling Missing Text Conditionally

If missing text in a column depends on the value of another column, you can use apply() or where() for conditional logic:

# Fill missing text conditionally based on another column 
df['text_column'] = df['text_column'].fillna(df['related_column'].apply(lambda x: 'Placeholder' if x == 'Condition' else 'Other Placeholder'))

5. Interpolating or Forward/Backward Filling in Pandas

For cases where text data follows a pattern or sequence, you can use forward-fill or backward-fill methods to propagate the last valid entry to the next missing one.

Forward-Fill (ffill()) and Backward-Fill (bfill())

# Forward-fill missing values in 'text_column' 
df['text_column'] = df['text_column'].ffill() 

# Backward-fill missing values in 'text_column' 
df['text_column'] = df['text_column'].bfill()

Forward-fill is particularly useful in time-series datasets or when the data logically follows a sequence where previous values can fill the gaps.

Applying Custom Functions to Text Data in Pandas

While Pandas offers a wide array of built-in string methods, you often need to perform more specific or complex operations on your text data that go beyond standard functions. In such cases, you can apply custom functions to manipulate, clean, or transform text data according to your unique requirements. Pandas provides several ways to apply custom functions efficiently, including apply(), map(), and lambda functions.

1. Using apply() to Apply Custom Functions

The apply() method allows you to apply any custom function to each element of a DataFrame column. This is particularly useful when dealing with complex transformations that cannot be achieved using built-in methods.

Example: Applying a Custom Cleaning Function

Suppose you want to standardise and clean text data by performing multiple operations, such as removing punctuation and converting text to lowercase. To achieve this, you can create a custom function.

import string 

# Define a custom function to clean text 
def clean_text(text): 
    # Remove punctuation 
    text = text.translate(str.maketrans('', '', string.punctuation)) 
    # Convert to lowercase 
    text = text.lower() 
    return text 

# Apply the custom function to the 'text_column' 
df['cleaned_text'] = df['text_column'].apply(clean_text) 
# View the result 
print(df[['text_column', 'cleaned_text']].head())

The apply() method iterates over each column element and applies the clean_text() function, returning the transformed text in a new column.

2. Using lambda Functions for Simple Operations

You can use lambda functions with apply() for simple one-line operations. This is useful for quick manipulations that don’t require a separate function definition.

Example: Shortening Text to a Fixed Length

If you want to limit the text length to a specific number of characters, you can use a lambda function to slice the string.

# Shorten text to 50 characters using a lambda function 
df['short_text'] = df['text_column'].apply(lambda x: x[:50])
 
# View the shortened text 
print(df[['text_column', 'short_text']].head())

This method is ideal for quick operations that are easy to define in-line.

3. Applying Functions to Multiple Columns

You can apply custom functions to multiple columns by passing a function to apply() across axis=1. This is useful when your function depends on more than one column.

Example: Combining Text from Multiple Columns

Let’s say you want to concatenate two columns, such as first and last names while ensuring proper formatting (e.g., title case for each name).

# Define a custom function to concatenate first and last names 
def format_full_name(row): 
    return f" {row['first_name'].title()} {row['last_name'].title()}" 
    
# Apply the function across rows (axis=1) 
df['full_name'] = df.apply(format_full_name, axis=1) 

# View the result 
print(df[['first_name', 'last_name', 'full_name']].head())

Here, the apply() method applies the format_full_name() function across each row, and axis=1 specifies that the function should be applied row-wise rather than column-wise.

4. Using map() for Element-Wise Operations

The map() function is another useful method for applying custom transformations to a Pandas Series. It’s similar to apply() but works directly on Series objects and is often slightly faster for simple element-wise operations.

Example: Mapping Values to Custom Categories

Suppose you have a text column with categories like “Beginner,” “Intermediate,” and “Advanced,” and you want to map these categories to numerical values.

# Define a mapping dictionary 
category_map = {'Beginner': 1, 'Intermediate': 2, 'Advanced': 3} 

# Apply the map() function to map categories to numbers 
df['category_numeric'] = df['category_column'].map(category_map) 

# View the result 
print(df[['category_column', 'category_numeric']].head())

This method efficiently transforms text-based categories into numeric values for easier analysis.

5. Using applymap() for Element-Wise Operations on the Entire DataFrame

If you need to apply a function element-wise to an entire DataFrame, you can use applymap(). This is useful when applying the same transformation to every element in a DataFrame, rather than just a single column.

Example: Stripping Extra Whitespace from All String Columns

# Apply a lambda function to strip whitespace from every element in the DataFrame 
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x) 

# View the cleaned DataFrame 
print(df.head())

This ensures all string values in the DataFrame have their leading and trailing whitespace removed.

6. Handling Errors in Custom Functions with try-except

When applying custom functions to text data, you might encounter errors due to unexpected data formats or missing values. You can handle these gracefully by incorporating try-except blocks into your functions.

Example: Handling Errors in a Custom Function

If your text data includes numeric values or null entries that could raise errors, you can handle these cases with a try-except block.

# Define a custom function with error handling 
def safe_clean_text(text): 
  try: 
  # Remove punctuation and convert to lowercase 
  text = text.translate(str.maketrans('', '', string.punctuation)).lower() 
  except AttributeError: 
  # Handles cases where 'text' is not a string return text 
  return text 
  
# Apply the function with error handling 
df['cleaned_text'] = df['text_column'].apply(safe_clean_text) 

# View the result 
print(df[['text_column', 'cleaned_text']].head())

This approach prevents your function from failing due to unexpected input types.

7. Performance Considerations for Large Datasets

Applying custom functions row by row can be slow when working with large datasets. If performance becomes an issue, consider using vectorised operations (when possible) or libraries like Dask or Swifter to parallelise the apply() operations.

Example: Using swifter for Faster Apply

The Swifter library automatically optimises your apply() functions for better performance on large datasets.

import swifter 

# Apply the custom function using swifter for performance optimization 
df['cleaned_text'] = df['text_column'].swifter.apply(clean_text)

This can significantly speed up text transformations when working with large-scale data.

Concatenating and Merging Text Columns in Pandas

When working with text data in Pandas, you’ll often need to combine information from multiple columns into a single column. This is common in tasks like combining first and last names, creating full addresses, or merging related information. Pandas provides several methods to concatenate and merge text columns efficiently.

1. Concatenating Text Columns Using + Operator

One of the simplest ways to concatenate two or more text columns in Pandas is by using the + operator. This method works similarly to string concatenation in Python, allowing you to directly combine the content of different columns.

Example: Combining First and Last Names

# Concatenate 'first_name' and 'last_name' columns 
df['full_name'] = df['first_name'] + ' ' + df['last_name'] 

# View the result 
print(df[['first_name', 'last_name', 'full_name']].head())

This method adds a space between the names for readability. You can customise the separator (e.g., commas, slashes, etc.) based on your needs.

Example: Concatenating with a Custom Separator

# Concatenate with a custom separator (e.g., a comma) 
df['full_name'] = df['last_name'] + ',' + df['first_name']

However, be cautious when concatenating columns that may contain missing values (NaN), as the + operator will return NaN if any of the columns have missing data. In such cases, you should handle missing values first.

2. Using str.cat() for Concatenation

Pandas provides the str.cat() method, which is more flexible for concatenating columns. It allows you to handle missing values and specify a separator.

Example: Concatenating with str.cat()

# Concatenate first and last names with a space as a separator 
df['full_name'] = df['first_name'].str.cat(df['last_name'], sep='') 

# View the result 
print(df[['first_name', 'last_name', 'full_name']].head())

The str.cat() method is particularly useful because it allows you to specify how to handle NaN values with the na_rep parameter.

Example: Handling Missing Values

# Concatenate first and last names, replacing missing values with 'Unknown' 
df['full_name'] = df['first_name'].str.cat(df['last_name'], sep=' ', na_rep='Unknown') 

# View the result 
print(df[['first_name', 'last_name', 'full_name']].head())

This ensures that the concatenation operation does not result in NaN if any fields are missing.

3. Concatenating Multiple Columns

If you need to concatenate more than two columns, str.cat() easily handles this by passing a list of columns.

Example: Concatenating Multiple Address Columns

# Concatenate' address', 'city', 'state', and 'zip_code' into a full address 
df['full_address'] = df['address'].str.cat([df['city'], df['state'], df['zip_code']], sep=',') 

# View the result 
print(df[['address', 'city', 'state', 'zip_code', 'full_address']].head())

In this example, we concatenate four columns (address, city, state, and zip code) into a single full address column, using a comma as the separator.

4. Using agg() for Concatenation Across Rows

If you need to concatenate text from multiple rows into a single string, you can use the agg() method. This is particularly useful when summarising text data or when you want to combine text data grouped by a specific column.

Example: Concatenating Text Across Rows

# Group by 'category' and concatenate text from the 'description' 
column df_grouped = df.groupby('category')['description'].agg(''.join) 

# View the result 
print(df_grouped.head())

This groups the DataFrame by the category column and concatenates all descriptions for each category into a single string.

5. Merging Text Columns with merge()

Pandas provides the merge() function to combine two DataFrames based on one or more common columns. This is especially useful when combining related information from different sources or datasets.

Example: Merging DataFrames with a Common Column

# Merge two DataFrames on a common 'id' column 
df_merged = pd.merge(df1, df2, on='id') 

# View the merged DataFrame 
print(df_merged.head())

This method joins two DataFrames (df1 and df2) based on the common id column. Depending on how you want to combine the data, you can also specify the type of join (left, right, inner, or outer).

6. Handling Duplicates After Concatenation

After concatenating or merging text columns, you might end up with duplicate values in your data. Pandas provides methods like drop_duplicates() to handle these cases.

Example: Removing Duplicates

# Remove duplicate rows based on the 'full_name' column 
df_cleaned = df.drop_duplicates(subset=['full_name']) 

# View the cleaned DataFrame 
print(df_cleaned.head())

You can also use this method to drop duplicates across multiple columns.

Advanced Text Manipulation: Working with Regular Expressions

Regular expressions (regex) are a powerful tool for advanced text manipulation and pattern matching in Pandas. They allow you to search for specific patterns within text, extract information, and perform complex string operations. Pandas integrate with Python’s re-module, providing methods like str.contains(), str.extract(), and str.replace() to efficiently handle regular expressions on DataFrames.

Basics of Regular Expressions

Before diving into Pandas-specific applications, it’s essential to understand some basic regex concepts:

  • Literal characters: Exact matches, e.g., “abc” matches the string “abc”.
  • Metacharacters: Special symbols like . (any character), ^ (start of string), $ (end of string), * (zero or more repetitions), + (one or more repetitions), [] (character set), etc.
  • Escaping special characters: Use \ to escape metacharacters when needed.

1. Finding Text with str.contains()

The str.contains() method allows you to search for rows where a specific pattern exists within a text column. This method is often used to filter data based on text patterns.

Example: Searching for Patterns

Suppose you have a dataset with a column containing email addresses, and you want to filter rows where the email is from a specific domain (e.g., Gmail).

# Filter rows where 'email' contains '@gmail.com' 
gmail_users = df[df['email'].str.contains(r'@gmail\.com')] 

# View the filtered DataFrame 
print(gmail_users.head())

In this example, we use @gmail\.com as the pattern, where \. escapes the dot, and r” denotes a raw string to prevent Python from interpreting escape characters.

Case Sensitivity and Missing Values

You can make the search case-insensitive and handle missing values with optional parameters.

# Case-insensitive search for '@gmail.com', ignoring NaN values 
gmail_users = df[df['email'].str.contains(r'@gmail\.com', case=False, na=False)]

This ensures the search ignores case differences and doesn’t fail when encountering NaN values.

2. Extracting Patterns with str.extract()

To extract specific parts of a string based on a pattern, you can use the str.extract() method. This is especially useful when isolating certain information, such as dates, IDs, or email domains.

Example: Extracting Email Domain
# Extract the domain from the 'email' column 
df['domain'] = df['email'].str.extract(r'@([a-zA-Z0-9.-]+)') 

# View the result 
print(df[['email', 'domain']].head())

In this example, the regex pattern @([a-zA-Z0-9.-]+) matches the domain part of the email (after the @ symbol) and extracts it into a new column. The parentheses indicate the portion of the string to extract, and + ensures one or more valid domain characters are matched.

Extracting Multiple Groups

You can extract multiple pieces of information from text by using various capture groups.

# Extract both username and domain from the 'email' column 
df[['username', 'domain']] = df['email'].str.extract(r' ([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)')

This pattern captures both the username and the email domain by using two sets of parentheses.

3. Replacing Text with str.replace()

The str.replace() method allows you to replace text in a column based on a regex pattern. This is particularly useful for cleaning or standardising text data.

Example: Replacing Substrings

Suppose you want to standardise phone numbers by removing non-numeric characters.

# Replace non-numeric characters in 'phone_number' with an empty string 
df['cleaned_phone'] = df['phone_number'].str.replace(r'\D', '', regex=True) 

# View the result 
print(df[['phone_number', 'cleaned_phone']].head())

Here, \D is a regex pattern that matches any non-digit character, and it’s replaced with an empty string to leave only the numeric characters.

Example: Formatting Text with Replacements

You can also use str.replace() to format text in a specific way. For example, reformatting dates from “dd-mm-yyyy” to “yyyy-mm-dd”.

# Reformat dates from 'dd-mm-yyyy' to 'yyyy-mm-dd' 
df['formatted_date'] = df['date_column'].str.replace(r'(\d{2})-(\d{2})-(\d{4})', r'\3-\2-\1', regex=True) 

# View the result 
print(df[['date_column', 'formatted_date']].head())

This uses regex groups (\d{2}) and (\d{4}) to capture the day, month, and year, and reorders them to create a standardised format.

4. Splitting Strings with str.split() and Regex

The str.split() method can also be used with regular expressions to split strings based on complex patterns. This is useful when data needs to be split into multiple columns.

Example: Splitting Based on Delimiters

Suppose you have a column with names in the format “First Last” and want to split it into two columns.

# Split the 'name' column into 'first_name' and 'last_name' 
df[['first_name', 'last_name']] = df['name'].str.split(' ', expand=True) 

# View the result 
print(df[['name', 'first_name', 'last_name']].head())

This example splits the name into first and last names using the space as a delimiter. The expand=True parameter ensures the result is returned as separate columns.

Example: Splitting with Complex Delimiters

You can also use a regex pattern for splitting, such as splitting a string on multiple possible delimiters.

# Split based on commas, semicolons, or spaces 
df[['part1', 'part2']] = df['complex_string'].str.split(r'[;,\s]+', expand=True)

This splits the string whenever it encounters a comma, semicolon, or space, using the regex pattern [;,\s]+.

5. Handling Complex Patterns

Regular expressions allow you to handle complex text patterns, such as validating or extracting particular text formats. Here are a few examples of more advanced usage:

Validating Text Patterns

You can validate whether a column follows a specific format (e.g., checking for valid email addresses).

# Validate email format 
valid_emails = df['email'].str.contains(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', regex=True) 

# View valid emails 
print(df[valid_emails])

This regex ensures that the email addresses follow a common format, with a username, @ symbol, domain, and valid top-level domain (TLD).

Extracting Complex Information

Regular expressions are useful when dealing with complex, semi-structured data, such as extracting dates and product codes or identifying key patterns in text.

# Extract product codes from a complex string 
df['product_code'] = df['description'].str.extract(r'(\b[A-Z]{3}-\d{4}\b)') 

# View the result 
print(df[['description', 'product_code']].head())

This example extracts product codes in the format “ABC-1234” from a description column.

Case Study Example: Analysing a Real-World Text Dataset

This section will discuss a practical example of using Pandas to analyse a real-world text dataset. We will demonstrate how to load, clean, manipulate, and extract insights from the data using various techniques discussed earlier, including string methods, regular expressions, and custom functions. We’ll use a dataset containing customer reviews for a product or service for this case study.

1. Loading the Dataset

Let’s assume we have a dataset (customer_reviews.csv) that contains product reviews and other relevant fields such as the customer name, review text, rating, and review date. We will start by loading the data into a Pandas DataFrame.

import pandas as pd 

# Load the dataset 
df = pd.read_csv('customer_reviews.csv') 

# View the first few rows of the DataFrame 
print(df.head())

The dataset contains the following columns:

  • review_id: Unique identifier for each review
  • customer_name: The name of the reviewer
  • review_text: The text content of the review
  • rating: The rating given by the customer (1 to 5)
  • review_date: The date when the review was posted

2. Cleaning and Preprocessing the Text

Next, we will clean and preprocess the review_text column by handling missing values, removing unnecessary characters, and standardising the format.

Handling Missing Values

First, we will check for missing values in the review_text column and replace them with a placeholder value like “No review.”

# Fill missing reviews with 'No review' 
df['review_text'] = df['review_text'].fillna('No review') 

# Check for missing values 
print(df['review_text'].isnull().sum())

Removing Punctuation and Lowercasing Text

To ensure uniformity, we will remove punctuation from the reviews and convert the text to lowercase for easier analysis.

import string 

# Define a function to clean the text 
def clean_review(text): 
  text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation 
  text = text.lower() # Convert to lowercase 
  return text 
  
# Apply the cleaning function to the 'review_text' column 
df['cleaned_review'] = df['review_text'].apply(clean_review) 
# View the cleaned data 
print(df[['review_text', 'cleaned_review']].head())

3. Extracting Information with Regular Expressions

Let’s extract some useful insights from the review text. We will start by checking if any reviews mention specific keywords related to the product (e.g., “quality,” “price,” “service”) using regular expressions.

Finding Keyword Mentions

We can use str.contains() to find reviews that mention certain keywords like “quality” and “price.”

# Search for reviews mentioning 'quality' or 'price' 
df['mentions_quality'] = df['cleaned_review'].str.contains(r'\bquality\b', regex=True) df['mentions_price'] = df['cleaned_review'].str.contains(r'\bprice\b', regex=True) 

# View the result 
print(df[['cleaned_review', 'mentions_quality', 'mentions_price']].head())

Here, we use word boundaries (\b) in the regex pattern to ensure we capture whole words only.

4. Sentiment Analysis Using Custom Functions

For deeper analysis, let’s define a simple sentiment analysis function that classifies reviews as “positive,” “negative,” or “neutral” based on the presence of certain keywords. For instance, reviews that mention words like “good,” “excellent,” or “happy” can be marked as positive, and those with words like “bad,” “poor,” or “disappointed” can be labelled as negative.

Custom Sentiment Function

# Define a function for basic sentiment analysis 
def classify_sentiment(text): 
  positive_words = ['good', 'great', 'excellent', 'happy', 'love'] 
  negative_words = ['bad', 'poor', 'disappointed', 'unhappy', 'hate'] 
  
  # Check for positive and negative words in the text 
  if any(word in text for word in positive_words): 
    return 'positive' 
  elif any(word in text for word in negative_words): 
    return 'negative' 
  else: 
    return 'neutral' 

# Apply the sentiment function to the cleaned reviews 
df['sentiment'] = df['cleaned_review'].apply(classify_sentiment) 

# View the sentiment classification 
print(df[['cleaned_review', 'sentiment']].head())

This simple approach gives us a basic sentiment classification based on the presence of positive or negative words.

5. Aggregating and Analysing Review Data

Now that we’ve classified the reviews, we can perform some aggregate analysis to gain insights into customer feedback patterns.

Sentiment Distribution

Let’s calculate the distribution of sentiments to understand the overall customer satisfaction.

# Calculate the sentiment distribution 
sentiment_counts = df['sentiment'].value_counts() 

# View the sentiment distribution 
print(sentiment_counts)

This shows how many reviews fall into each sentiment category (positive, negative, or neutral).

Average Rating by Sentiment

Next, we will calculate the average rating for each sentiment group to see if there’s a correlation between sentiment and the rating given.

# Calculate the average rating for each sentiment category 
average_rating_by_sentiment = df.groupby('sentiment')['rating'].mean() 

# View the result 
print(average_rating_by_sentiment)

This shows whether positive reviews tend to have higher ratings and negative reviews tend to have lower ratings.

6. Visualising the Results

Finally, we can create some simple visualisations to display the results of our analysis. For example, we can plot the distribution of sentiments and the average rating for each sentiment category using Matplotlib or Seaborn.

import matplotlib.pyplot as plt 
import seaborn as sns 

# Plot the sentiment distribution 
plt.figure(figsize=(8, 6)) 
sns.countplot(x='sentiment', data=df, palette='coolwarm') 
plt.title('Sentiment Distribution') 
plt.xlabel('Sentiment') 
plt.ylabel('Count') 
plt.show() 

# Plot the average rating by sentiment 
plt.figure(figsize=(8, 6)) 
sns.barplot(x=average_rating_by_sentiment.index, y=average_rating_by_sentiment.values, palette='coolwarm') 
plt.title('Average Rating by Sentiment') 
plt.xlabel('Sentiment') 
plt.ylabel('Average Rating') 
plt.show()
Sentiment distribution after processing text with pandas
average rating by sentiment

These visualisations give a clearer picture of how customers feel about the product and how their sentiment relates to the ratings they give.

Conclusion

This blog post explored various techniques for working with text data in Pandas, from loading and cleaning text to applying advanced manipulations using regular expressions. By leveraging Pandas’ capabilities, we demonstrated how to analyse textual information effectively and extract meaningful insights.

Throughout the sections, we covered key aspects, including:

  • Loading Text Data: Understanding how to import text data from various formats and structure it within a DataFrame.
  • String Methods: Utilising built-in string functions to manipulate and analyse text efficiently.
  • Filtering and Extracting Information: Applying filters and extraction techniques to hone in on specific data points.
  • Handling Missing Text Data: To maintain data integrity, strategies for dealing with missing values.
  • Applying Custom Functions: Creating and applying custom functions for specialised text processing tasks.
  • Concatenating and Merging Text Columns: Combining information from multiple columns to create more informative datasets.
  • Advanced Text Manipulation: Utilising regular expressions for pattern matching, validation, and complex text operations.
  • Case Study Example: Analysing a real-world text dataset to demonstrate the practical application of these techniques in deriving insights.

By mastering these techniques, you can enhance your data analysis skills and effectively handle text data in various contexts, from customer reviews to social media sentiment analysis. The ability to manipulate and analyse text data opens up numerous opportunities for gaining insights, driving decisions, and creating value from unstructured data.

As you continue to work with text data, consider exploring additional libraries, such as nltk or spaCy, for more advanced natural language processing capabilities. Combined with Pandas, these tools can empower you to tackle even more complex text analysis challenges.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

types of text annotation

Text Annotation Made Simple And 7 Popular Tools

What is Text Annotation? Text annotation is the process of labelling or tagging text data with specific information, making it more understandable and usable for...

average rating by sentiment

How To Process Text In Python With Pandas Made Simple

Introduction Text data is everywhere—from social media posts and customer reviews to emails and product descriptions. For data scientists and analysts, working with...

causes of missing data

Handling Missing Data In Machine Learning: Top 8 Techniques & How To Tutorial In Python

What is Missing Data in Machine Learning? In machine learning, the quality and completeness of data are often just as important as the algorithms and models we choose....

How does predictive analytics work?

Predictive Analytics Made Simple & How To Python Tutorial

What is Predictive Analytics? Predictive analytics uses historical data, statistical algorithms, and machine learning techniques to identify patterns and forecast...

Logistic regression

Linear Regression In Machine Learning Made Simple & How To Python Tutorial

What is Linear Regression in Machine Learning? Linear regression is one of the fundamental techniques in machine learning and statistics used to understand the...

Examples of out of vocabulary OOV

Out-of-Vocabulary (OOV) Words Explained & How To Handle Them In NLP Tasks

What are Out-of-Vocabulary (OOV) Words? In Natural Language Processing (NLP), Out-of-Vocabulary (OOV) words refer to any words a machine learning model has not...

multi task learning illustrated

Multi-Task Learning Made Simple & Popular Approaches Explained

What is Multi-Task Learning? Multi-TaskMulti-task learning (MTL) is a machine learning approach in which a single model is trained to solve multiple tasks...

glove vector example "king" is to "queen" as "man" is to "woman"

Text Representation: A Simple Explanation Of Complex Techniques

What is Text Representation? Text representation refers to how text data is structured and encoded so that machines can process and understand it. Human language is...

wavelet transform: a wave vs a wavelet

Wavelet Transform Made Simple [Foundation, Applications, Advantages]

Introduction to Wavelet Transform What is Signal Processing? Signal processing is critical in various fields, from telecommunications to medical diagnostics and...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!