Garbage in, garbage out. No model -however sophisticated -can compensate for a poorly prepared dataset. Data cleaning is often the difference between a model that performs well in production and one that fails silently.
This checklist covers the essential steps, in order, with practical Python snippets for each.
Step 1 -Understand Your Data First
Before touching anything, spend time looking at the data. Cleaning without understanding leads to mistakes you won’t catch until much later.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.shape) # rows × columns
print(df.dtypes) # data type per column
print(df.describe()) # statistical summary (numerical)
print(df.describe(include='object')) # summary for categorical columns
print(df.head(10)) # first 10 rows
Ask yourself:
- What does each column represent? What are the expected ranges and categories?
- What is the target variable and how is it distributed?
- Are there any obvious issues visible immediately (e.g., dates stored as strings)?
Step 2 -Handle Missing Values
Missing values are almost always present and must be addressed explicitly. Never let a model silently drop or zero-fill them without your decision.
# Identify missing values
print(df.isnull().sum())
print(df.isnull().mean().sort_values(ascending=False) * 100) # % missing per column
Strategy by severity:
| Missing % | Recommended Approach |
|---|---|
| < 5% | Impute (median / mode) or drop rows |
| 5–30% | Impute; add a _missing indicator column to preserve the signal |
| > 30% | Consider dropping the column; check if missingness itself is informative |
from sklearn.impute import SimpleImputer
# Numerical: median imputation (robust to outliers)
imputer = SimpleImputer(strategy='median')
df['age'] = imputer.fit_transform(df[['age']])
# Categorical: most frequent value
df['city'].fillna(df['city'].mode()[0], inplace=True)
# Preserve missingness signal before imputing
df['income_was_missing'] = df['income'].isnull().astype(int)
Step 3 -Remove Duplicates
Duplicate rows inflate your dataset and silently bias model training -especially in classification tasks where they can shift class distributions.
print(f"Duplicate rows: {df.duplicated().sum()}")
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
Also check for near-duplicates -rows identical except for minor formatting differences (e.g., "New York" vs "new york" vs "New York ").
Step 4 -Fix Data Types
Columns are frequently imported with the wrong type: numbers stored as strings, dates as objects, categories as unconstrained text.
# Convert to correct types
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df['segment'] = df['segment'].astype('category')
# Strip whitespace from all string columns
str_cols = df.select_dtypes(include='object').columns
df[str_cols] = df[str_cols].apply(lambda col: col.str.strip())
# Extract useful components from datetime columns
df['signup_year'] = df['signup_date'].dt.year
df['signup_month'] = df['signup_date'].dt.month
Step 5 -Detect and Handle Outliers
Outliers can distort model parameters, especially in linear models, and lead to poor generalisation.
import numpy as np
# IQR method -robust for non-normal distributions
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
n_outliers = ((df['revenue'] < lower) | (df['revenue'] > upper)).sum()
print(f"Outliers detected: {n_outliers}")
Handling options:
| Method | When to Use |
|---|---|
| Winsorise (cap) | The outlier is valid data but extreme -retain the row, limit the value |
| Log transform | Right-skewed distribution (revenue, population) |
| Remove | Confirmed data errors, not real observations |
# Winsorise
df['revenue'] = df['revenue'].clip(lower=lower, upper=upper)
# Log transform for skewed data (add 1 to handle zeros)
df['revenue_log'] = np.log1p(df['revenue'])
Important: Do not remove outliers blindly. In fraud detection, outliers are the signal.
Step 6 -Standardise Inconsistent Values
String columns often contain multiple representations of the same category -a silent source of errors that corrupts groupings and encodings.
print(df['country'].value_counts().head(20))
df['country'] = (
df['country']
.str.lower()
.str.strip()
.replace({
'usa': 'united states',
'us': 'united states',
'uk': 'united kingdom',
'gb': 'united kingdom',
})
)
Step 7 -Encode Categorical Variables
Machine learning models require numerical inputs. The encoding strategy you choose matters and depends on the nature of the category.
# Ordinal encoding -for ordered categories
order_map = {'low': 0, 'medium': 1, 'high': 2}
df['priority'] = df['priority'].map(order_map)
# One-hot encoding -for nominal categories (no inherent order)
df = pd.get_dummies(df, columns=['city'], drop_first=True)
Watch out for high cardinality. A column with 500 unique cities will create 499 new columns. Group rare categories into an
'Other'bucket, or use target encoding.
# Cap rare categories
threshold = 0.01 # categories with < 1% frequency become 'Other'
freq = df['city'].value_counts(normalize=True)
df['city'] = df['city'].where(df['city'].isin(freq[freq >= threshold].index), other='Other')
Step 8 -Scale Numerical Features
Many algorithms are sensitive to feature scale: linear models, SVMs, neural networks, and KNN all assume features are on comparable scales. Tree-based models (Random Forest, XGBoost) do not require scaling.
from sklearn.preprocessing import StandardScaler
# StandardScaler: mean=0, std=1 -the default choice for most models
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
Critical: Fit the scaler only on training data, then use
transform()on test data. Fitting on the full dataset leaks test information into training.
Step 9 -Split Before You Fit Preprocessors (Prevent Leakage)
This is the most commonly violated rule. Perform your train/test split before any transformations that require fitting (imputers, scalers, encoders). Fitting on the full dataset -then splitting -leaks future information into training.
from sklearn.model_selection import train_test_split
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # preserves class distribution in both splits
)
# Fit on training data only -then transform both sets
scaler.fit(X_train[numerical_cols])
X_train[numerical_cols] = scaler.transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])
Using sklearn.pipeline.Pipeline enforces this pattern by design and is the recommended approach for production code.
Step 10 -Validate the Cleaned Dataset
Before training any model, run a final sanity check. Skipping this step is how silent bugs make it to production.
# No remaining missing values
assert X_train.isnull().sum().sum() == 0, "Missing values remain in training set"
assert X_test.isnull().sum().sum() == 0, "Missing values remain in test set"
# Correct shapes
print(f"Train: {X_train.shape} | Test: {X_test.shape}")
# Target distribution -check for class imbalance
print(y_train.value_counts(normalize=True).round(3))
# Feature ranges after scaling
print(X_train.describe().round(2))
# Correlation matrix -flag multicollinearity
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(X_train.corr(), cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()
Quick-Reference Checklist
| # | Step | Done? |
|---|---|---|
| 1 | Load and explore -shapes, dtypes, describe | ☐ |
| 2 | Identify and handle missing values | ☐ |
| 3 | Remove duplicate rows | ☐ |
| 4 | Fix incorrect data types and extract datetime features | ☐ |
| 5 | Detect and handle outliers | ☐ |
| 6 | Standardise inconsistent string / category values | ☐ |
| 7 | Encode categorical variables (ordinal / one-hot) | ☐ |
| 8 | Scale numerical features | ☐ |
| 9 | Split before fitting any preprocessors | ☐ |
| 10 | Validate -no nulls, correct distributions, no leakage | ☐ |
Final Thought
Data cleaning is not a one-time step -it’s an iterative process. You’ll discover new issues as you explore deeper, build features, and evaluate models. The best practitioners treat it as a first-class part of the workflow, not a box to tick.
Work through this checklist before every project, and your models will consistently start from a much stronger foundation.