The Complete Data Cleaning & Preparation Checklist

Garbage in, garbage out. No model -however sophisticated -can compensate for a poorly prepared dataset. Data cleaning is often the difference between a model that performs well in production and one that fails silently.

This checklist covers the essential steps, in order, with practical Python snippets for each.

Step 1 -Understand Your Data First

Before touching anything, spend time looking at the data. Cleaning without understanding leads to mistakes you won’t catch until much later.

import pandas as pd

df = pd.read_csv('data.csv')

print(df.shape)              # rows × columns
print(df.dtypes)             # data type per column
print(df.describe())         # statistical summary (numerical)
print(df.describe(include='object'))  # summary for categorical columns
print(df.head(10))           # first 10 rows

Ask yourself:

What does each column represent? What are the expected ranges and categories?
What is the target variable and how is it distributed?
Are there any obvious issues visible immediately (e.g., dates stored as strings)?

Step 2 -Handle Missing Values

Missing values are almost always present and must be addressed explicitly. Never let a model silently drop or zero-fill them without your decision.

# Identify missing values
print(df.isnull().sum())
print(df.isnull().mean().sort_values(ascending=False) * 100)   # % missing per column

Strategy by severity:

Missing %	Recommended Approach
< 5%	Impute (median / mode) or drop rows
5–30%	Impute; add a `_missing` indicator column to preserve the signal
> 30%	Consider dropping the column; check if missingness itself is informative

from sklearn.impute import SimpleImputer

# Numerical: median imputation (robust to outliers)
imputer = SimpleImputer(strategy='median')
df['age'] = imputer.fit_transform(df[['age']])

# Categorical: most frequent value
df['city'].fillna(df['city'].mode()[0], inplace=True)

# Preserve missingness signal before imputing
df['income_was_missing'] = df['income'].isnull().astype(int)

Step 3 -Remove Duplicates

Duplicate rows inflate your dataset and silently bias model training -especially in classification tasks where they can shift class distributions.

print(f"Duplicate rows: {df.duplicated().sum()}")
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)

Also check for near-duplicates -rows identical except for minor formatting differences (e.g., "New York" vs "new york" vs "New York ").

Step 4 -Fix Data Types

Columns are frequently imported with the wrong type: numbers stored as strings, dates as objects, categories as unconstrained text.

# Convert to correct types
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
df['price']       = pd.to_numeric(df['price'], errors='coerce')
df['segment']     = df['segment'].astype('category')

# Strip whitespace from all string columns
str_cols = df.select_dtypes(include='object').columns
df[str_cols] = df[str_cols].apply(lambda col: col.str.strip())

# Extract useful components from datetime columns
df['signup_year']  = df['signup_date'].dt.year
df['signup_month'] = df['signup_date'].dt.month

Step 5 -Detect and Handle Outliers

Outliers can distort model parameters, especially in linear models, and lead to poor generalisation.

import numpy as np

# IQR method -robust for non-normal distributions
Q1  = df['revenue'].quantile(0.25)
Q3  = df['revenue'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

n_outliers = ((df['revenue'] < lower) | (df['revenue'] > upper)).sum()
print(f"Outliers detected: {n_outliers}")

Handling options:

Method	When to Use
Winsorise (cap)	The outlier is valid data but extreme -retain the row, limit the value
Log transform	Right-skewed distribution (revenue, population)
Remove	Confirmed data errors, not real observations

# Winsorise
df['revenue'] = df['revenue'].clip(lower=lower, upper=upper)

# Log transform for skewed data (add 1 to handle zeros)
df['revenue_log'] = np.log1p(df['revenue'])

Important: Do not remove outliers blindly. In fraud detection, outliers are the signal.

Step 6 -Standardise Inconsistent Values

String columns often contain multiple representations of the same category -a silent source of errors that corrupts groupings and encodings.

print(df['country'].value_counts().head(20))

df['country'] = (
    df['country']
    .str.lower()
    .str.strip()
    .replace({
        'usa': 'united states',
        'us':  'united states',
        'uk':  'united kingdom',
        'gb':  'united kingdom',
    })
)

Step 7 -Encode Categorical Variables

Machine learning models require numerical inputs. The encoding strategy you choose matters and depends on the nature of the category.

# Ordinal encoding -for ordered categories
order_map = {'low': 0, 'medium': 1, 'high': 2}
df['priority'] = df['priority'].map(order_map)

# One-hot encoding -for nominal categories (no inherent order)
df = pd.get_dummies(df, columns=['city'], drop_first=True)

Watch out for high cardinality. A column with 500 unique cities will create 499 new columns. Group rare categories into an 'Other' bucket, or use target encoding.

# Cap rare categories
threshold = 0.01   # categories with < 1% frequency become 'Other'
freq = df['city'].value_counts(normalize=True)
df['city'] = df['city'].where(df['city'].isin(freq[freq >= threshold].index), other='Other')

Step 8 -Scale Numerical Features

Many algorithms are sensitive to feature scale: linear models, SVMs, neural networks, and KNN all assume features are on comparable scales. Tree-based models (Random Forest, XGBoost) do not require scaling.

from sklearn.preprocessing import StandardScaler

# StandardScaler: mean=0, std=1 -the default choice for most models
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

Critical: Fit the scaler only on training data, then use transform() on test data. Fitting on the full dataset leaks test information into training.

Step 9 -Split Before You Fit Preprocessors (Prevent Leakage)

This is the most commonly violated rule. Perform your train/test split before any transformations that require fitting (imputers, scalers, encoders). Fitting on the full dataset -then splitting -leaks future information into training.

from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y          # preserves class distribution in both splits
)

# Fit on training data only -then transform both sets
scaler.fit(X_train[numerical_cols])
X_train[numerical_cols] = scaler.transform(X_train[numerical_cols])
X_test[numerical_cols]  = scaler.transform(X_test[numerical_cols])

Using sklearn.pipeline.Pipeline enforces this pattern by design and is the recommended approach for production code.

Step 10 -Validate the Cleaned Dataset

Before training any model, run a final sanity check. Skipping this step is how silent bugs make it to production.

# No remaining missing values
assert X_train.isnull().sum().sum() == 0, "Missing values remain in training set"
assert X_test.isnull().sum().sum()  == 0, "Missing values remain in test set"

# Correct shapes
print(f"Train: {X_train.shape} | Test: {X_test.shape}")

# Target distribution -check for class imbalance
print(y_train.value_counts(normalize=True).round(3))

# Feature ranges after scaling
print(X_train.describe().round(2))

# Correlation matrix -flag multicollinearity
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(X_train.corr(), cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

Quick-Reference Checklist

#	Step	Done?
1	Load and explore -shapes, dtypes, describe	☐
2	Identify and handle missing values	☐
3	Remove duplicate rows	☐
4	Fix incorrect data types and extract datetime features	☐
5	Detect and handle outliers	☐
6	Standardise inconsistent string / category values	☐
7	Encode categorical variables (ordinal / one-hot)	☐
8	Scale numerical features	☐
9	Split before fitting any preprocessors	☐
10	Validate -no nulls, correct distributions, no leakage	☐

Final Thought

Data cleaning is not a one-time step -it’s an iterative process. You’ll discover new issues as you explore deeper, build features, and evaluate models. The best practitioners treat it as a first-class part of the workflow, not a box to tick.

Work through this checklist before every project, and your models will consistently start from a much stronger foundation.