Skip to content

v3 Optimized (HasCabin + Title Split + Relaxed RF)

Kaggle Notebook

Score

  • LB: 0.78947
  • ~0.01 improvement over v2 simplified (0.77990)

Changes

  1. Added HasCabin feature:
    full['HasCabin'] = full['Cabin'].notna().astype(int)
    
  2. Cabin missing is not random; having a Cabin record corresponds to First Class passengers with significantly higher survival

  3. Title grouping keeps Dr as a separate class:

  4. Removed Dr from Rare; doctors/scholars have a different survival pattern from other Rare (Capt/Rev/Major, etc.)
  5. Master (boys) already kept, survival rate much higher than adult males

  6. RF parameters relaxed:

  7. max_depth: 5 → 6
  8. min_samples_split: 10 → 5
  9. min_samples_leaf: 5 → 2
  10. n_estimators: 200 → 300
  11. Moderately relaxed constraints to capture more complex interactions in test

Rationale

v2 simplified stalled at 0.78 without reaching 0.80. Three optimization directions tried simultaneously: 1. Cabin information itself is a strong discriminative signal for First vs Third Class 2. Dr group's survival pattern differs from other Rare, should not be mixed 3. max_depth=5 too conservative; some complex interactions in test (e.g., Pclass + Title + Fare) not learned

Result Analysis

  • CV: ~0.82 (slight improvement after parameter relaxation)
  • LB: 0.78947 (+0.00957 vs v2)
  • Improvement: Effective but limited, still below the 0.80 threshold
  • Feature importance: Sex > Title > FareLog > HasCabin > Pclass > Age > FamilySize > IsAlone > Embarked

Problem Diagnosis

CV and LB still have a gap, indicating unseen patterns in test: 1. Fare=0 outliers: passengers with Fare=0 in test may be crew or free-ticket family members, conflicting with FareLog handling 2. Master boys group: Title=Master boys have very high survival in train, but different distribution in test 3. Single model: only RandomForest tested; XGB/GBDT/LR performance on the same feature set not verified 4. Pclass + HasCabin interaction: First Class has Cabin, but HasCabin as an independent feature may be insufficient

Next Version (v4)

  1. Add Fare=0 flag: mark Fare=0 passengers separately
  2. Add Title=Master weight: test Master as an independent strong signal
  3. Model swap: GradientBoosting / XGBoost may be more stable on partial data than RF
  4. Pclass * HasCabin interaction feature: First Class with Cabin vs Third Class without is inherently a high-order signal
  5. Ticket prefix simplification: group ticket prefixes by letter, may contain deck/area information