v3 Optimized (HasCabin + Title Split + Relaxed RF)
Kaggle Notebook
Score
- LB: 0.78947
- ~0.01 improvement over v2 simplified (0.77990)
Changes
- Added HasCabin feature:
-
Cabin missing is not random; having a Cabin record corresponds to First Class passengers with significantly higher survival
-
Title grouping keeps Dr as a separate class:
- Removed Dr from Rare; doctors/scholars have a different survival pattern from other Rare (Capt/Rev/Major, etc.)
-
Master (boys) already kept, survival rate much higher than adult males
-
RF parameters relaxed:
max_depth: 5 → 6min_samples_split: 10 → 5min_samples_leaf: 5 → 2n_estimators: 200 → 300- Moderately relaxed constraints to capture more complex interactions in test
Rationale
v2 simplified stalled at 0.78 without reaching 0.80. Three optimization directions tried simultaneously: 1. Cabin information itself is a strong discriminative signal for First vs Third Class 2. Dr group's survival pattern differs from other Rare, should not be mixed 3. max_depth=5 too conservative; some complex interactions in test (e.g., Pclass + Title + Fare) not learned
Result Analysis
- CV: ~0.82 (slight improvement after parameter relaxation)
- LB: 0.78947 (+0.00957 vs v2)
- Improvement: Effective but limited, still below the 0.80 threshold
- Feature importance: Sex > Title > FareLog > HasCabin > Pclass > Age > FamilySize > IsAlone > Embarked
Problem Diagnosis
CV and LB still have a gap, indicating unseen patterns in test: 1. Fare=0 outliers: passengers with Fare=0 in test may be crew or free-ticket family members, conflicting with FareLog handling 2. Master boys group: Title=Master boys have very high survival in train, but different distribution in test 3. Single model: only RandomForest tested; XGB/GBDT/LR performance on the same feature set not verified 4. Pclass + HasCabin interaction: First Class has Cabin, but HasCabin as an independent feature may be insufficient
Next Version (v4)
- Add Fare=0 flag: mark Fare=0 passengers separately
- Add Title=Master weight: test Master as an independent strong signal
- Model swap: GradientBoosting / XGBoost may be more stable on partial data than RF
- Pclass * HasCabin interaction feature: First Class with Cabin vs Third Class without is inherently a high-order signal
- Ticket prefix simplification: group ticket prefixes by letter, may contain deck/area information