2026-04-25
Session 37: Native setup() phase 3: remove_outliers + feature_selection
Engineering log for session 37.
The last two preprocessing flags on the supervised Experiment constructor go native. After this session, every constructor flag works without legacy.setup() — only caller-supplied setup_kwargs and unsupervised / TS tasks still hit the legacy path.
CHANGED — engine#
CHANGED—Experiment._native_setup_supervisednow applies:- Outlier removal (
remove_outliers=True) viasklearn.ensemble.IsolationForest(contamination=0.05, random_state=session_id, n_jobs=self.n_jobs). Fit onX_train_transformed; rows wherepredict() == -1are dropped fromX_train,y_train,X_train_transformed,y_train_transformed— and the union views are recomputed. Test set is untouched. Mirrors legacy'soutliers_threshold=0.05. - Feature selection (
feature_selection=True) viasklearn.feature_selection.SelectFromModel(estimator, threshold="median")withExtraTreesClassifier(clf) orExtraTreesRegressor(reg) at 100 estimators. Selected features are kept onX_train_transformed/X_test_transformed/X_transformed; rawX_train/X_test/Xkeep all columns so user-facing accessors don't lose information. - Selector appended to the preprocess pipeline so predict-time preprocessing reapplies the column drop. Order:
("preprocess", ColumnTransformer)→("feature_selection", SelectFromModel). - Defensive empty-selection guard: if SelectFromModel picks zero features (rare but possible on tiny / pathological datasets), keep the first column to avoid an empty matrix downstream.
- Outlier removal (
CHANGED—_can_use_native_setuppredicate finalised. Every constructor preprocessing flag now routes native. Onlysetup_kwargsand non-supervised tasks force legacy.ADDED— Three new_fit_stateslots:feature_selector(the fitted SelectFromModel or None),selected_features(list of kept column names or None),outliers_dropped(count int).
ADDED — tests#
ADDED—packages/engine/tests/test_session37_native_setup_phase3.py— 8 tests:- Drain-lock for
remove_outliers(poisonlegacy.setup+ verify native). - Drain-lock for
feature_selection. - Outlier-drop count is in the expected range (30-50 on the juice dataset).
- Test set untouched.
- Feature selector trims columns + raw splits keep all 18.
- Combined flags compose: outliers dropped first, then features selected on the smaller train.
- End-to-end
create + tune + predictchain with all 4 phase-1/2/3 flags on. - Regression
feature_selectionusesExtraTreesRegressor. - y_train alignment after row drops.
- Predicate test for the phase-3 contract.
- Drain-lock for
CHANGED — existing tests#
CHANGED—test_session35_native_setupandtest_session36_native_setup_phase2had assertions that flags forced legacy. Updated to reflect "all flags native" reality:test_complex_preprocessing_falls_back_to_legacyswitched fromnormalize=Truetosetup_kwargsas the canonical "still legacy" trigger.test_remove_outliers_still_falls_back_to_legacyrenamed totest_remove_outliers_now_native_post_phase3with inverted assertion.- Predicate tests collapsed — flag-by-flag tests live in their session-specific files.
INTERNAL#
INTERNAL— Why outlier removal happens after preprocessing.IsolationForestworks on numeric input. Running it on rawX_trainwould fail on the categorical columns. Running onX_train_transformed(post-impute, post-encode) is straightforward + matches legacy semantics where the outlier detector saw imputed numerics + ordinal codes for categoricals.INTERNAL— Why feature selection runs after outlier removal. Outliers can heavily distort tree-based feature importance. Removing them first means the ExtraTrees estimator picks features based on the cleaned distribution. Same order as legacy.INTERNAL— Selector appended to preprocess_pipeline, not stored separately.predict_modelrunspipeline.predict(X)which transforms via every step. Without appending the selector, the saved Pipeline would expect 18 features but the post-transform output would have 9 → ValueError. Appending makes the Pipeline self-contained —joblib.dumpthe result + load anywhere +predict()works.INTERNAL—SelectFromModel(threshold="median")is the median-importance threshold — keeps the top-half. Legacy default wasfeature_selection_method="classic"with the same median rule under the hood; same default + simpler code path.
Session 37 delta summary#
| Metric | Session 36 end | Session 37 end |
|---|---|---|
| Drainable preprocessing flags handled natively | + StandardScaler + PowerTransformer | + IsolationForest + SelectFromModel |
| Engine tests (fast + slow) | 165 | 173 |
| Combined tests | 311 | 319 |