v2 Simplified (Single-Model RandomForest)

Kaggle Notebook

View code on Kaggle

Score

LB: 0.77990
~0.01 improvement over v1 baseline (~0.77)

Changes

Feature engineering expansion:
Extract Title from Name (Mr/Mrs/Miss/Master/Rare)
FamilySize = SibSp + Parch + 1
IsAlone = (FamilySize == 1)
FareLog = log1p(Fare)
Missing-value strategy:
Age: filled with group median by Pclass + Sex
Fare: filled with global median
Embarked: filled with mode
Model simplification:
Removed XGB/LGBM/LR/GBDT multi-model comparison
Removed RandomizedSearchCV hyperparameter search
Single-model RandomForest with conservative parameters: max_depth=5, min_samples_split=10, min_samples_leaf=5
Feature pruning:
Reduced from 14 to 8 features
Dropped: CabinLetter, TicketPrefix, AgeBin, FareBin, Pclass_Sex
Kept: Pclass, Sex, Age, FareLog, FamilySize, IsAlone, Title, Embarked

Rationale

The v1 advanced version (multi-model ensemble) suffered severe overfitting: CV 0.83 but LB only 0.75, an 8% gap. Root causes: - 891 rows cannot sustain 5-model comparison + 40 parameter searches + 4-model Soft Voting - CabinLetter missing rate 77%, too noisy - AgeBin/FareBin binning loses information - Pclass_Sex redundant with Pclass+Sex

Simplification strategy: less is more. On small datasets, fewer features + simpler model + conservative parameters → more stable.

Result Analysis

CV: ~0.81 (more realistic than the inflated 0.83 of the multi-model version)
LB: 0.77990
CV-LB gap: ~3% (reasonable, public test only ~200 rows)
Feature importance ranking: Sex > Title > FareLog > Pclass > Age > FamilySize > IsAlone > Embarked

Problem Diagnosis

LB stalled at 0.78 without reaching 0.80, indicating unseen patterns in test: 1. Cabin missing itself is a strong signal for First vs Third Class 2. Title=Master (boys) has much higher survival than Mr but was mixed together 3. Dr group's survival pattern differs from other Rare titles 4. Fare=0 outliers exist in test (crew / free tickets) 5. RF parameters too conservative (max_depth=5), complex interactions in test not learned

Next Version (v3)

Add HasCabin feature
Keep Dr as a separate class in Title grouping
Slightly relax RF parameters