v2 Simplified (Single-Model RandomForest)
Kaggle Notebook
Score
- LB: 0.77990
- ~0.01 improvement over v1 baseline (~0.77)
Changes
- Feature engineering expansion:
- Extract Title from Name (Mr/Mrs/Miss/Master/Rare)
- FamilySize = SibSp + Parch + 1
- IsAlone = (FamilySize == 1)
-
FareLog = log1p(Fare)
-
Missing-value strategy:
- Age: filled with group median by Pclass + Sex
- Fare: filled with global median
-
Embarked: filled with mode
-
Model simplification:
- Removed XGB/LGBM/LR/GBDT multi-model comparison
- Removed RandomizedSearchCV hyperparameter search
-
Single-model RandomForest with conservative parameters: max_depth=5, min_samples_split=10, min_samples_leaf=5
-
Feature pruning:
- Reduced from 14 to 8 features
- Dropped: CabinLetter, TicketPrefix, AgeBin, FareBin, Pclass_Sex
- Kept: Pclass, Sex, Age, FareLog, FamilySize, IsAlone, Title, Embarked
Rationale
The v1 advanced version (multi-model ensemble) suffered severe overfitting: CV 0.83 but LB only 0.75, an 8% gap. Root causes: - 891 rows cannot sustain 5-model comparison + 40 parameter searches + 4-model Soft Voting - CabinLetter missing rate 77%, too noisy - AgeBin/FareBin binning loses information - Pclass_Sex redundant with Pclass+Sex
Simplification strategy: less is more. On small datasets, fewer features + simpler model + conservative parameters → more stable.
Result Analysis
- CV: ~0.81 (more realistic than the inflated 0.83 of the multi-model version)
- LB: 0.77990
- CV-LB gap: ~3% (reasonable, public test only ~200 rows)
- Feature importance ranking: Sex > Title > FareLog > Pclass > Age > FamilySize > IsAlone > Embarked
Problem Diagnosis
LB stalled at 0.78 without reaching 0.80, indicating unseen patterns in test: 1. Cabin missing itself is a strong signal for First vs Third Class 2. Title=Master (boys) has much higher survival than Mr but was mixed together 3. Dr group's survival pattern differs from other Rare titles 4. Fare=0 outliers exist in test (crew / free tickets) 5. RF parameters too conservative (max_depth=5), complex interactions in test not learned
Next Version (v3)
- Add HasCabin feature
- Keep Dr as a separate class in Title grouping
- Slightly relax RF parameters