Skip to content Skip to footer

5 Critical Feature Engineering Mistakes That Kill Machine Learning Projects


5 Critical Feature Engineering Mistakes That Kill Machine Learning Projects
Image by Editor

 

Introduction

 
Feature engineering is the unsung hero of machine learning, and also its most common villain. While teams obsess over whether to use XGBoost or a neural network, the features feeding those models quietly determine whether the project lives or dies. The uncomfortable truth? Most machine learning projects fail not because of bad algorithms, but because of bad features.

The five mistakes covered in this article are responsible for countless failed deployments, wasted months of development time, and the dreaded “it worked in the notebook” syndrome. Each one is preventable. Each one is fixable. Understanding them transforms feature engineering from a guessing game into a systematic discipline that produces models worth deploying.

 

1. Data Leakage and Temporal Integrity: The Silent Model Killer

 

// The Problem

Data leakage is the most devastating mistake in feature engineering. It creates an illusion of success, showing exceptional validation accuracy, while guaranteeing complete failure in production where performance often drops to random chance. Leakage occurs when information from outside the training period, or information that would not be available at prediction time, influences features.

 

// How It Shows Up

→ Future Information Leakage

  • Using complete transaction history (including future) when predicting customer churn.
  • Including post-diagnosis medical tests to predict the diagnosis itself.
  • Training on historical data but using future statistics for normalization.

→ Pre-Split Contamination

  • Fitting scalers, encoders, or imputers on the entire dataset before the train-test split.
  • Computing aggregations across both training and test sets.
  • Allowing test set statistics to influence training.

→ Target Leakage

  • Computing target encodings without cross-fold validation.
  • Creating features that are perfect proxies for the target.
  • Using the target variable to create ‘predictive’ features.

 

// Real-World Example

A fraud detection model achieved exceptional accuracy in development by including “transaction_reversal” as a feature. The problem was that reversals only happen after fraud is confirmed. In production, this feature did not exist at prediction time, and accuracy dropped to barely better than a coin flip.

 

// The Solution

→ Prevent Temporal Leakage
Always split data first, then engineer features. Never touch the test set during feature creation.

# Preventing test set leakage
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# NOT PREFERRED: Test set leakage
scaler = StandardScaler()
# This uses test set statistics which is a form of leakage
scaler.fit(X_full)  
X_train_leak, X_test_leak, y_train_leak, y_test_leak = train_test_split(X_scaled, y)

# PREFERRED: No leakage
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
scaler.fit(X_train)  # Only training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

 

→ Use Time-Based Validation
For temporal data, random splits are inappropriate. Time-based splits respect the chronological order.

# Time-based validation
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)

for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    # Engineer features using only X_train
    # Validate on X_test

 

2. The Dimensionality Trap: Multicollinearity and Redundancy

 

// The Problem

Creating correlated, redundant, or irrelevant features leads to overfitting, where models memorize training data noise instead of learning real patterns. This results in impressive validation scores that completely fall apart in production. The curse of dimensionality means that as features increase relative to samples, models need exponentially more data to maintain performance.

 

// How It Shows Up

→ Multicollinearity and Redundancy

  • Including age and birth_year simultaneously.
  • Adding both raw features and their aggregations (sum, mean, max of same data).
  • Creating multiple representations of the same underlying information.

→ High-Cardinality Encoding Disasters

  • One-hot encoding ZIP codes, creating tens of thousands of sparse columns.
  • Encoding user IDs, product SKUs, or other unique identifiers.
  • Creating more columns than training samples.

 

// Real-World Example

A customer churn model included highly correlated features and high-cardinality encodings, resulting in over 800 total features. With only 5,000 training samples, the model achieved impressive validation accuracy but performed poorly in production. After systematically pruning to 30 validated features, production accuracy improved significantly, training time dropped dramatically, and the model became interpretable enough to drive business decisions.

 

// The Solution

→ Maintain Healthy Dimensionality Ratios
The sample-to-feature ratio is the first line of defense against overfitting. A minimum ratio of 10:1 is recommended, meaning ten training samples for every feature. A ratio of 20:1 or higher is preferable for stable, generalizable models.

→ Validate Every Feature’s Contribution
Every feature in the final model should earn its place. Testing each feature by temporarily removing it and measuring the impact on cross-validation scores reveals redundant or harmful features.

# Test each feature's actual contribution
from sklearn.model_selection import cross_val_score

# Establish a baseline with all features
baseline_score = cross_val_score(model, X_train, y_train, cv=5).mean()

for feature in X_train.columns:
    X_temp = X_train.drop(columns=[feature])
    score = cross_val_score(model, X_temp, y_train, cv=5).mean()
    
    # If the score doesn't drop significantly (or improves), the feature might be noise
    if score >= baseline_score - 0.01:
        print(f"Consider removing: {feature}")

 

→ Use Learning Curves to Diagnose Problems
Learning curves reveal whether a model is suffering from high dimensionality. A large, persistent gap between training accuracy (high) and validation accuracy (low) signals overfitting.

# Learning curves to diagnose problems
from sklearn.model_selection import learning_curve
import numpy as np

train_sizes, train_scores, val_scores = learning_curve(
    model, X_train, y_train, cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10)
)

# Large gap between curves = overfitting (reduce features)
# Both curves low and converged = underfitting

 

3. Target Encoding Traps: When Features Secretly Contain the Answer

 

// The Problem

Target encoding replaces categorical values with statistics derived from the target variable, such as the mean target value for each category. Done correctly, it is powerful. Done incorrectly, it creates features that leak target information directly into training data, producing spectacular validation metrics that collapse entirely in production. The model is not learning patterns; it is memorizing answers.

 

// How It Shows Up

  • Naive Target Encoding: Computing category means using the entire training set, then training on that same data. Applying target statistics without any form of regularization or smoothing.
  • Validation Contamination: Fitting target encoders before the train-validation split. Using global target statistics that include validation or test set rows.
  • Rare Category Disasters: Encoding categories with one or two samples using their exact target values. No smoothing toward global mean for low-frequency categories.

 

// The Solution

→ Use Out-of-Fold Encoding
The fundamental rule is simple: never let a row see target statistics computed from itself. The most robust approach is k-fold encoding, where training data is split into folds and each fold is encoded using statistics computed only from the other folds.

 
→ Apply Smoothing for Rare Categories
Small sample sizes produce unreliable statistics. Smoothing blends the category-specific mean with the global mean, weighted by sample size. A common formula is:

\[
\text{smoothed} = \frac{n \times \text{category\_mean} + m \times \text{global\_mean}}{n + m}
\]

where \( n \) is the category count and \( m \) is a smoothing parameter.

# Safe target encoding with cross-validation
from sklearn.model_selection import KFold
import numpy as np

def safe_target_encode(X, y, column, n_splits=5, min_samples=10):
    X_encoded = X.copy()
    global_mean = y.mean()
    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    
    # Initialize the new column
    X_encoded[f'{column}_enc'] = np.nan
    
    for train_idx, val_idx in kfold.split(X):
        fold_train = X.iloc[train_idx]
        fold_y_train = y.iloc[train_idx]
        
        # Calculate stats on training fold only
        stats = fold_train.groupby(column)[y.name].agg(['mean', 'count'])
        stats.columns = ['mean', 'count'] # Rename for clarity
        
        # Apply smoothing
        smoothing = stats['count'] / (stats['count'] + min_samples)
        stats['smoothed'] = smoothing * stats['mean'] + (1 - smoothing) * global_mean
        
        # Map to validation fold
        X_encoded.loc[val_idx, f'{column}_enc'] = X.iloc[val_idx][column].map(stats['smoothed'])
    
    # Fill missing values (unseen categories) with global mean
    X_encoded[f'{column}_enc'] = X_encoded[f'{column}_enc'].fillna(global_mean)
    
    return X_encoded

 

→ Validate Encoding Safety
After encoding, checking the correlation between the encoded feature and the target helps identify potential leakage. Legitimate target encodings typically show correlations between 0.1 and 0.5. Correlations above 0.8 are a red flag.

# Check encoding safety
import numpy as np

def check_encoding_safety(encoded_feature, target):
    correlation = np.corrcoef(encoded_feature, target)[0, 1]
    
    if abs(correlation) > 0.8:
        print(f"DANGER: Correlation {correlation:.3f} suggests target leakage")
    elif abs(correlation) > 0.5:
        print(f"WARNING: Correlation {correlation:.3f} is high")
    else:
        print(f"OK: Correlation {correlation:.3f} appears reasonable")

 

4. Outlier Mismanagement: The Data Points That Destroy Models

 

// The Problem

Outliers are extreme values that deviate significantly from the rest of the data. Mishandling them, whether through blind removal, naive capping, or complete ignorance, corrupts a model’s understanding of reality. The critical mistake is treating outlier handling as a mechanical step rather than a domain-informed decision that requires understanding why the outliers exist.

 

// How It Shows Up

  • Blind Removal: Deleting all points beyond 1.5 IQR without investigation. Using z-score thresholds without considering the underlying distribution.
  • Naive Capping: Winsorizing at arbitrary percentiles across all features. Capping values that represent legitimate rare events.
  • Complete Ignorance: Training models on raw data with extreme values distorting learned relationships. Letting data entry errors propagate through the pipeline.

 

// Real-World Example

An insurance pricing model removed all claims above the 99th percentile as “outliers” without investigation. This eliminated legitimate catastrophic claims, precisely the events the model needed to price correctly. The model performed beautifully on average claims but catastrophically underpriced policies for high-risk customers. The “outliers” were not errors; they were the most important data points in the entire dataset.

 

// The Solution

→ Investigate Before Acting
Never remove or transform outliers without understanding their source. Asking the right questions is essential: Are these data entry errors? Are these legitimate rare events? Are these from a different population?

# Investigate outliers before acting
import numpy as np

def investigate_outliers(df, column, threshold=3):
    mean, std = df[column].mean(), df[column].std()
    outliers = df[np.abs((df[column] - mean) / std) > threshold]
    
    print(f"Found {len(outliers)} outliers")
    print(f"Outlier summary: {outliers[column].describe()}")
    
    return outliers

 

→ Create Outlier Indicators Instead of Removing
Preserving outlier information as features instead of removing it maintains valuable signal while mitigating distortion.

# Create outlier features instead of removing
import numpy as np

def create_outlier_features(df, columns, threshold=3):
    df_result = df.copy()
    
    for col in columns:
        mean, std = df[col].mean(), df[col].std()
        z_scores = np.abs((df[col] - mean) / std)
        
        # Flag outliers as a feature
        df_result[f'{col}_is_outlier'] = (z_scores > threshold).astype(int)
        
        # Create capped version while keeping original
        lower, upper = df[col].quantile(0.01), df[col].quantile(0.99)
        df_result[f'{col}_capped'] = df[col].clip(lower, upper)
        
    return df_result

 

→ Use Robust Methods Instead of Removal
Robust scaling uses median and IQR instead of mean and standard deviation. Tree-based models are naturally robust to outliers.

# Robust methods instead of removal
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import HuberRegressor
from sklearn.ensemble import RandomForestRegressor

# Robust scaling: Uses median and IQR instead of mean and std
robust_scaler = RobustScaler()
X_scaled = robust_scaler.fit_transform(X)

# Robust regression: Downweights outliers
huber = HuberRegressor(epsilon=1.35)

# Tree-based models: Naturally robust to outliers
rf = RandomForestRegressor()

 

5. Model-Feature Mismatch and Over-Engineering

 

// The Problem

Different algorithms have fundamentally different capabilities for learning patterns from data. A common and costly mistake is applying the same feature engineering approach regardless of the model being used. This leads to wasted effort, unnecessary complexity, and often worse performance. Additionally, over-engineering creates unnecessarily complex feature transformations that add no predictive value while dramatically increasing maintenance burden.

 

// How It Shows Up

  • Over-Engineering for Tree Models: Creating polynomial features for Random Forest or XGBoost. Manually encoding interactions when trees can learn them automatically.
  • Under-Engineering for Linear Models: Using raw features with Linear/Logistic Regression. Expecting linear models to learn non-linear relationships without explicit interaction terms.
  • Pipeline Proliferation: Chaining dozens of transformers when three would suffice. Building “flexible” systems with hundreds of configuration options that no one understands.

 

// Model Capability Matrix

Model Type Non-Linearity? Interactions? Needs Scaling? Missing Values? Feature Eng.
Linear/Logistic NO NO YES NO HIGH
Decision Tree YES YES NO YES LOW
XGBoost/LGBM YES YES NO YES LOW
Neural Network YES YES YES NO MEDIUM
SVM Kernel Kernel YES NO MEDIUM

 

// The Solution

→ Start with Baselines
Always establish performance with minimal preprocessing before adding complexity. This provides a reference point to measure whether additional engineering is worthwhile.

# Start with baselines
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Start simple, add complexity only when justified
baseline_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# Pass the full pipeline to cross_val_score to prevent leakage
baseline_score = cross_val_score(
    baseline_pipeline, X, y, cv=5
).mean()

print(f"Baseline: {baseline_score:.3f}")

 

→ Measure Complexity Cost
Every addition to the pipeline should be justified by measurable improvement. Tracking both performance gain and computational cost helps make informed decisions.

# Measure complexity cost
import time
from sklearn.model_selection import cross_val_score

def evaluate_pipeline_tradeoff(simple_pipe, complex_pipe, X, y):
    start = time.time()
    simple_score = cross_val_score(simple_pipe, X, y, cv=5).mean()
    simple_time = time.time() - start
    
    start = time.time()
    complex_score = cross_val_score(complex_pipe, X, y, cv=5).mean()
    complex_time = time.time() - start
    
    improvement = complex_score - simple_score
    time_increase = complex_time / simple_time if simple_time > 0 else 0
    
    print(f"Performance gain: {improvement:.3f}")
    print(f"Time increase: {time_increase:.1f}x")
    print(f"Worth it: {improvement > 0.01 and time_increase < 5}")

 

→ Follow the Rule of Three
Before implementing a custom solution, verifying that three standard approaches have failed prevents unnecessary complexity.

# Try standard approaches first (Rule of Three)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from category_encoders import TargetEncoder
from sklearn.model_selection import cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

# Example setup for categorical feature evaluation
def evaluate_encoders(X, y, cat_cols, model):
    strategies = [
        ('onehot', OneHotEncoder(handle_unknown='ignore')),
        ('target', TargetEncoder()),
    ]
    
    for name, encoder in strategies:
        preprocessor = ColumnTransformer(
            transformers=[('enc', encoder, cat_cols)],
            remainder="passthrough"
        )
        pipe = make_pipeline(preprocessor, model)
        score = cross_val_score(pipe, X, y, cv=5).mean()
        print(f"{name}: {score:.3f}")

# Only build custom solution if ALL standard approaches fail

 

Conclusion

 
Feature engineering remains the highest-leverage activity in machine learning, but it is also where most projects fail. The five critical mistakes covered in this article represent the most common and devastating pitfalls that doom machine learning projects.

Data leakage creates an illusion of success that evaporates in production. The dimensionality trap leads to overfitting through redundant and correlated features. Target encoding traps allow features to secretly contain the answer. Outlier mismanagement either destroys valuable signal or allows errors to corrupt the model. Finally, model-feature mismatch and over-engineering waste resources on unnecessary complexity.

Mastering these concepts dramatically increases the chances of building models that actually work in production. The key principles are consistent: understand the data deeply before transforming it, validate every feature’s contribution, respect temporal boundaries, match engineering effort to model capabilities, and prefer simplicity over complexity. Following these guidelines saves weeks of debugging and transforms feature engineering from a source of failure into a competitive advantage.
 
 

Rachel Kuznetsov has a Master’s in Business Analytics and thrives on tackling complex data puzzles and searching for fresh challenges to take on. She’s committed to making intricate data science concepts easier to understand and is exploring the various ways AI makes an impact on our lives. On her continuous quest to learn and grow, she documents her journey so others can learn alongside her. You can find her on LinkedIn.



Source link

Leave a comment

0.0/5