Cleaning string columns in pandas removes whitespace, inconsistent capitalization, repeated spaces, and placeholder labels that make matching or grouping fail in subtle ways. Imported names, region codes, statuses, and identifiers can look readable while still containing characters that create duplicate categories or failed joins.
The Series.str accessor applies string methods column by column while preserving missing values unless a method has its own rule. pandas 3 infers text columns as the str dtype, but an explicit astype(“string”) keeps the cleanup path clear and makes missing text display as <NA> in the terminal output.
Clean into a copy first when the source column is needed for audit or rollback. After trimming and replacement, compare row counts, dtypes, missing-value counts, and category counts before using the cleaned column for joins, reports, or exports.
Related: How to read CSV files with pandas
Related: How to convert data types in pandas
Related: How to find missing values in pandas
import pandas as pd orders = pd.DataFrame( { "order_id": ["A100", "A101", "A102", "A103", "A104"], "customer": [ " Ada Lovelace ", "LIN CHEN", "Maya Patel", " n/a ", None, ], "status": [" Paid ", "PAID", " pending", "", pd.NA], } ) print(f"pandas {pd.__version__}") print() print("source values") print(orders.to_string(index=False)) print() cleaned = orders.copy() cleaned["customer"] = ( cleaned["customer"] .astype("string") .str.strip() .str.replace(r"\s+", " ", regex=True) .replace({"n/a": pd.NA, "N/A": pd.NA}) ) cleaned["status"] = ( cleaned["status"] .astype("string") .str.strip() .str.casefold() .replace({"": pd.NA, "n/a": pd.NA}) ) text_columns = ["customer", "status"] print("cleaned values") print(cleaned.to_string(index=False)) print() print("verification") print(f"rows before: {len(orders)}") print(f"rows after: {len(cleaned)}") print(f"customer dtype: {cleaned['customer'].dtype}") print(f"status dtype: {cleaned['status'].dtype}") print() print("missing values") print(cleaned[text_columns].isna().sum()) print() print("status counts") print(cleaned["status"].value_counts(dropna=False))
Replace orders with the DataFrame already loaded from CSV, Excel, SQL, Parquet, or another source. Keep the original column until the cleaned values pass review.
$ python3 clean_string_column_demo.py
pandas 3.0.3
source values
order_id customer status
A100 Ada Lovelace Paid
A101 LIN CHEN PAID
A102 Maya Patel pending
A103 n/a
A104 NaN NaN
cleaned values
order_id customer status
A100 Ada Lovelace paid
A101 LIN CHEN paid
A102 Maya Patel pending
A103 <NA> <NA>
A104 <NA> <NA>
verification
rows before: 5
rows after: 5
customer dtype: string
status dtype: string
missing values
customer 2
status 2
dtype: int64
status counts
status
paid 2
<NA> 2
pending 1
Name: count, dtype: int64[pyarrow]
cleaned["status"] = cleaned["status"].astype("string")
astype(“string”) converts non-missing numbers, booleans, and dates to text. Use it only on columns that should be treated as text.
cleaned["status"] = cleaned["status"].str.strip()
str.strip() removes surrounding whitespace, including spaces, tabs, and newline characters, without changing missing values.
cleaned["status"] = cleaned["status"].str.casefold()
casefold() is useful for status, region, category, and code labels that should compare without capitalization differences. Preserve original case for display names when capitalization carries meaning.
cleaned["customer"] = ( cleaned["customer"] .astype("string") .str.strip() .str.replace(r"\s+", " ", regex=True) )
str.replace() treats patterns as literal text by default in current pandas. Use regex=True when the pattern is a regular expression such as \s+.
cleaned["status"] = cleaned["status"].replace({"": pd.NA, "n/a": pd.NA}) cleaned["customer"] = cleaned["customer"].replace({"n/a": pd.NA, "N/A": pd.NA})
Empty strings are not missing values until they are replaced. Count them deliberately before dropping, filling, joining, or exporting the data.
Related: How to fill missing values in pandas
Related: How to drop missing values in pandas
text_columns = ["customer", "status"] print(f"rows before: {len(orders)}") print(f"rows after: {len(cleaned)}") print(cleaned[text_columns].dtypes) print(cleaned[text_columns].isna().sum()) print(cleaned["status"].value_counts(dropna=False))
The row counts should match unless the cleanup task intentionally removes rows. The dtype and count output should match the text and missing-value policy chosen for the cleaned columns.
$ rm clean_string_column_demo.py