Cleaning string columns in pandas removes whitespace, inconsistent capitalization, repeated spaces, and placeholder labels that make matching or grouping fail in subtle ways. Imported names, region codes, statuses, and identifiers can look readable while still containing characters that create duplicate categories or failed joins.
The Series.str accessor applies string methods column by column while preserving missing values unless a method has its own rule. pandas 3 infers text columns as the str dtype, but an explicit astype(“string”) keeps the cleanup path clear and makes missing text display as <NA> in the terminal output.
Clean into a copy first when the source column is needed for audit or rollback. After trimming and replacement, compare row counts, dtypes, missing-value counts, and category counts before using the cleaned column for joins, reports, or exports.
Related: How to read CSV files with pandas
Related: How to convert data types in pandas
Related: How to find missing values in pandas
Steps to clean string columns in pandas:
- Save a string cleanup script.
- clean_string_column_demo.py
import pandas as pd orders = pd.DataFrame( { "order_id": ["A100", "A101", "A102", "A103", "A104"], "customer": [ " Ada Lovelace ", "LIN CHEN", "Maya Patel", " n/a ", None, ], "status": [" Paid ", "PAID", " pending", "", pd.NA], } ) print(f"pandas {pd.__version__}") print() print("source values") print(orders.to_string(index=False)) print() cleaned = orders.copy() cleaned["customer"] = ( cleaned["customer"] .astype("string") .str.strip() .str.replace(r"\s+", " ", regex=True) .replace({"n/a": pd.NA, "N/A": pd.NA}) ) cleaned["status"] = ( cleaned["status"] .astype("string") .str.strip() .str.casefold() .replace({"": pd.NA, "n/a": pd.NA}) ) text_columns = ["customer", "status"] print("cleaned values") print(cleaned.to_string(index=False)) print() print("verification") print(f"rows before: {len(orders)}") print(f"rows after: {len(cleaned)}") print(f"customer dtype: {cleaned['customer'].dtype}") print(f"status dtype: {cleaned['status'].dtype}") print() print("missing values") print(cleaned[text_columns].isna().sum()) print() print("status counts") print(cleaned["status"].value_counts(dropna=False))
Replace orders with the DataFrame already loaded from CSV, Excel, SQL, Parquet, or another source. Keep the original column until the cleaned values pass review.
- Run the cleanup script.
$ python3 clean_string_column_demo.py pandas 3.0.3 source values order_id customer status A100 Ada Lovelace Paid A101 LIN CHEN PAID A102 Maya Patel pending A103 n/a A104 NaN NaN cleaned values order_id customer status A100 Ada Lovelace paid A101 LIN CHEN paid A102 Maya Patel pending A103 <NA> <NA> A104 <NA> <NA> verification rows before: 5 rows after: 5 customer dtype: string status dtype: string missing values customer 2 status 2 dtype: int64 status counts status paid 2 <NA> 2 pending 1 Name: count, dtype: int64[pyarrow] - Convert the target column to a string dtype before applying string methods.
cleaned["status"] = cleaned["status"].astype("string")
astype(“string”) converts non-missing numbers, booleans, and dates to text. Use it only on columns that should be treated as text.
- Trim leading and trailing whitespace.
cleaned["status"] = cleaned["status"].str.strip()
str.strip() removes surrounding whitespace, including spaces, tabs, and newline characters, without changing missing values.
- Normalize matching labels with case folding.
cleaned["status"] = cleaned["status"].str.casefold()
casefold() is useful for status, region, category, and code labels that should compare without capitalization differences. Preserve original case for display names when capitalization carries meaning.
- Collapse repeated spaces inside a text column.
cleaned["customer"] = ( cleaned["customer"] .astype("string") .str.strip() .str.replace(r"\s+", " ", regex=True) )
str.replace() treats patterns as literal text by default in current pandas. Use regex=True when the pattern is a regular expression such as \s+.
- Convert blank strings and placeholder labels after trimming.
cleaned["status"] = cleaned["status"].replace({"": pd.NA, "n/a": pd.NA}) cleaned["customer"] = cleaned["customer"].replace({"n/a": pd.NA, "N/A": pd.NA})
Empty strings are not missing values until they are replaced. Count them deliberately before dropping, filling, joining, or exporting the data.
Related: How to fill missing values in pandas
Related: How to drop missing values in pandas - Verify row count, dtypes, missing values, and cleaned categories.
text_columns = ["customer", "status"] print(f"rows before: {len(orders)}") print(f"rows after: {len(cleaned)}") print(cleaned[text_columns].dtypes) print(cleaned[text_columns].isna().sum()) print(cleaned["status"].value_counts(dropna=False))
The row counts should match unless the cleanup task intentionally removes rows. The dtype and count output should match the text and missing-value policy chosen for the cleaned columns.
- Remove the temporary script after the cleanup pattern is copied into project code.
$ rm clean_string_column_demo.py
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.