Skip to content

v2 Simplified (Single-Model RandomForest)

Kaggle Notebook

Score

  • LB: 0.77990
  • ~0.01 improvement over v1 baseline (~0.77)

Changes

  1. Feature engineering expansion:
  2. Extract Title from Name (Mr/Mrs/Miss/Master/Rare)
  3. FamilySize = SibSp + Parch + 1
  4. IsAlone = (FamilySize == 1)
  5. FareLog = log1p(Fare)

  6. Missing-value strategy:

  7. Age: filled with group median by Pclass + Sex
  8. Fare: filled with global median
  9. Embarked: filled with mode

  10. Model simplification:

  11. Removed XGB/LGBM/LR/GBDT multi-model comparison
  12. Removed RandomizedSearchCV hyperparameter search
  13. Single-model RandomForest with conservative parameters: max_depth=5, min_samples_split=10, min_samples_leaf=5

  14. Feature pruning:

  15. Reduced from 14 to 8 features
  16. Dropped: CabinLetter, TicketPrefix, AgeBin, FareBin, Pclass_Sex
  17. Kept: Pclass, Sex, Age, FareLog, FamilySize, IsAlone, Title, Embarked

Rationale

The v1 advanced version (multi-model ensemble) suffered severe overfitting: CV 0.83 but LB only 0.75, an 8% gap. Root causes: - 891 rows cannot sustain 5-model comparison + 40 parameter searches + 4-model Soft Voting - CabinLetter missing rate 77%, too noisy - AgeBin/FareBin binning loses information - Pclass_Sex redundant with Pclass+Sex

Simplification strategy: less is more. On small datasets, fewer features + simpler model + conservative parameters → more stable.

Result Analysis

  • CV: ~0.81 (more realistic than the inflated 0.83 of the multi-model version)
  • LB: 0.77990
  • CV-LB gap: ~3% (reasonable, public test only ~200 rows)
  • Feature importance ranking: Sex > Title > FareLog > Pclass > Age > FamilySize > IsAlone > Embarked

Problem Diagnosis

LB stalled at 0.78 without reaching 0.80, indicating unseen patterns in test: 1. Cabin missing itself is a strong signal for First vs Third Class 2. Title=Master (boys) has much higher survival than Mr but was mixed together 3. Dr group's survival pattern differs from other Rare titles 4. Fare=0 outliers exist in test (crew / free tickets) 5. RF parameters too conservative (max_depth=5), complex interactions in test not learned

Next Version (v3)

  1. Add HasCabin feature
  2. Keep Dr as a separate class in Title grouping
  3. Slightly relax RF parameters