How to convert columns to categorical data in pandas

Converting columns to categorical data in pandas stores repeated labels as the category dtype instead of ordinary string values. Use it for fields such as teams, regions, states, ratings, or priorities when a limited set of labels drives grouping, sorting, plotting, or memory use.

A plain astype(“category”) conversion lets pandas infer the labels that are already present in a column. Use CategoricalDtype when the allowed labels or sort order must be explicit, especially for ordered values such as low, normal, and urgent.

Categories are not a general replacement for every text column. They work best when values repeat; if most rows contain unique strings, the category metadata can use as much or more memory than the original column. Values outside an explicit category list become missing, so check the converted columns before using them downstream.

Steps to convert pandas columns to category dtype:

  1. Save a short categorical conversion script.
    categorical_dtype.py
    import pandas as pd
    from pandas.api.types import CategoricalDtype
     
    df = pd.DataFrame(
        {
            "ticket": [101, 102, 103, 104, 105, 106],
            "team": ["api", "frontend", "api", "ops", "frontend", "api"],
            "priority": ["normal", "urgent", "low", "normal", "low", "urgent"],
        }
    )
     
    print(f"pandas {pd.__version__}")
    print()
     
    print("source dtypes")
    print(df.dtypes)
    print()
     
    df["team"] = df["team"].astype("category")
     
    print("team dtype")
    print(df["team"].dtype)
    print()
     
    print("team categories")
    print(df["team"].cat.categories)
    print()
     
    label_sample = pd.Series(["api", "frontend", "ops", "api"] * 1000, dtype="str")
    memory = pd.Series(
        {
            "str_bytes": label_sample.memory_usage(deep=True),
            "category_bytes": label_sample.astype("category").memory_usage(deep=True),
        }
    )
     
    print("memory check")
    print(memory)
    print()
     
    priority_dtype = CategoricalDtype(
        categories=["low", "normal", "urgent"],
        ordered=True,
    )
    df["priority"] = df["priority"].astype(priority_dtype)
     
    print("priority dtype")
    print(df["priority"].dtype)
    print()
     
    print("priority categories")
    print(df["priority"].cat.categories)
    print()
     
    print("priority ordered")
    print(df["priority"].cat.ordered)
    print()
     
    print("sorted by priority")
    print(df.sort_values("priority").filter(["ticket", "priority"]).to_string(index=False))
    print()
     
    print("missing after conversion")
    print(df.filter(["team", "priority"]).isna().sum())

    Replace the sample df with the DataFrame already loaded in the working script. Keep the CategoricalDtype categories in the intended sort order.

  2. Run the script and confirm that the source label columns become categorical.
    $ python3 categorical_dtype.py
    pandas 3.0.3
    
    source dtypes
    ticket      int64
    team          str
    priority      str
    dtype: object
    
    team dtype
    category
    
    team categories
    Index(['api', 'frontend', 'ops'], dtype='str')
    
    memory check
    str_bytes         49132
    category_bytes     4171
    dtype: int64
    
    priority dtype
    category
    
    priority categories
    Index(['low', 'normal', 'urgent'], dtype='str')
    
    priority ordered
    True
    
    sorted by priority
     ticket priority
        103      low
        105      low
        101   normal
        104   normal
        102   urgent
        106   urgent
    
    missing after conversion
    team        0
    priority    0
    dtype: int64

    The small memory check uses repeated labels to show the kind of column where category can reduce memory. Measure the real column before treating memory savings as guaranteed.

  3. Convert an unordered label column when the existing labels are the allowed categories.
    df["team"] = df["team"].astype("category")

    pandas infers the category labels from the values present in the column. Inspect df[“team”].cat.categories before relying on the label set for validation or export.

  4. Convert an ordered column with an explicit category list.
    priority_dtype = CategoricalDtype(
        categories=["low", "normal", "urgent"],
        ordered=True,
    )
    df["priority"] = df["priority"].astype(priority_dtype)

    Any non-missing value that is not listed in categories becomes missing after conversion. Add or correct allowed labels before converting production data.

  5. Verify the converted dtype, categories, and order flag.
    print(df["priority"].dtype)
    print(df["priority"].cat.categories)
    print(df["priority"].cat.ordered)

    cat.ordered must be True for ordered comparisons and order-aware sort behavior.

  6. Check that conversion did not create unexpected missing values.
    print(df.filter(["team", "priority"]).isna().sum())

    A nonzero count after conversion usually means the explicit category list missed at least one source value.

  7. Sort or compare the ordered column before using it in downstream output.
    print(df.sort_values("priority").filter(["ticket", "priority"]).to_string(index=False))

    The sorted output should follow the category order low, normal, urgent rather than alphabetical or source-row order.
    Related: How to sort a pandas DataFrame