How to clean string columns in pandas

Cleaning string columns in pandas removes whitespace, inconsistent capitalization, repeated spaces, and placeholder labels that make matching or grouping fail in subtle ways. Imported names, region codes, statuses, and identifiers can look readable while still containing characters that create duplicate categories or failed joins.

The Series.str accessor applies string methods column by column while preserving missing values unless a method has its own rule. pandas 3 infers text columns as the str dtype, but an explicit astype(“string”) keeps the cleanup path clear and makes missing text display as <NA> in the terminal output.

Clean into a copy first when the source column is needed for audit or rollback. After trimming and replacement, compare row counts, dtypes, missing-value counts, and category counts before using the cleaned column for joins, reports, or exports.

Steps to clean string columns in pandas:

Save a string cleanup script.

clean_string_column_demo.py

import pandas as pd
 
 
orders = pd.DataFrame(
    {
        "order_id": ["A100", "A101", "A102", "A103", "A104"],
        "customer": [
            " Ada Lovelace ",
            "LIN   CHEN",
            "Maya Patel",
            "  n/a ",
            None,
        ],
        "status": [" Paid ", "PAID", " pending", "", pd.NA],
    }
)
 
print(f"pandas {pd.__version__}")
print()
 
print("source values")
print(orders.to_string(index=False))
print()
 
cleaned = orders.copy()
cleaned["customer"] = (
    cleaned["customer"]
    .astype("string")
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
    .replace({"n/a": pd.NA, "N/A": pd.NA})
)
cleaned["status"] = (
    cleaned["status"]
    .astype("string")
    .str.strip()
    .str.casefold()
    .replace({"": pd.NA, "n/a": pd.NA})
)
text_columns = ["customer", "status"]
 
print("cleaned values")
print(cleaned.to_string(index=False))
print()
 
print("verification")
print(f"rows before: {len(orders)}")
print(f"rows after: {len(cleaned)}")
print(f"customer dtype: {cleaned['customer'].dtype}")
print(f"status dtype: {cleaned['status'].dtype}")
print()
 
print("missing values")
print(cleaned[text_columns].isna().sum())
print()
 
print("status counts")
print(cleaned["status"].value_counts(dropna=False))

Replace orders with the DataFrame already loaded from CSV, Excel, SQL, Parquet, or another source. Keep the original column until the cleaned values pass review.

Run the cleanup script.

$ python3 clean_string_column_demo.py
pandas 3.0.3

source values
order_id       customer   status
    A100  Ada Lovelace     Paid 
    A101     LIN   CHEN     PAID
    A102     Maya Patel  pending
    A103           n/a          
    A104            NaN      NaN

cleaned values
order_id     customer  status
    A100 Ada Lovelace    paid
    A101     LIN CHEN    paid
    A102   Maya Patel pending
    A103         <NA>    <NA>
    A104         <NA>    <NA>

verification
rows before: 5
rows after: 5
customer dtype: string
status dtype: string

missing values
customer    2
status      2
dtype: int64

status counts
status
paid       2
<NA>       2
pending    1
Name: count, dtype: int64[pyarrow]

Convert the target column to a string dtype before applying string methods.
```
cleaned["status"] = cleaned["status"].astype("string")
```
astype(“string”) converts non-missing numbers, booleans, and dates to text. Use it only on columns that should be treated as text.
Trim leading and trailing whitespace.
```
cleaned["status"] = cleaned["status"].str.strip()
```
str.strip() removes surrounding whitespace, including spaces, tabs, and newline characters, without changing missing values.
Normalize matching labels with case folding.
```
cleaned["status"] = cleaned["status"].str.casefold()
```
casefold() is useful for status, region, category, and code labels that should compare without capitalization differences. Preserve original case for display names when capitalization carries meaning.
Collapse repeated spaces inside a text column.
```
cleaned["customer"] = (
    cleaned["customer"]
    .astype("string")
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
)
```
str.replace() treats patterns as literal text by default in current pandas. Use regex=True when the pattern is a regular expression such as \s+.
Convert blank strings and placeholder labels after trimming.
```
cleaned["status"] = cleaned["status"].replace({"": pd.NA, "n/a": pd.NA})
cleaned["customer"] = cleaned["customer"].replace({"n/a": pd.NA, "N/A": pd.NA})
```
Empty strings are not missing values until they are replaced. Count them deliberately before dropping, filling, joining, or exporting the data.
Related: How to fill missing values in pandas
Related: How to drop missing values in pandas
Verify row count, dtypes, missing values, and cleaned categories.
```
text_columns = ["customer", "status"]
print(f"rows before: {len(orders)}")
print(f"rows after: {len(cleaned)}")
print(cleaned[text_columns].dtypes)
print(cleaned[text_columns].isna().sum())
print(cleaned["status"].value_counts(dropna=False))
```
The row counts should match unless the cleanup task intentionally removes rows. The dtype and count output should match the text and missing-value policy chosen for the cleaned columns.
Remove the temporary script after the cleanup pattern is copied into project code.
```
$ rm clean_string_column_demo.py
```

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.