2026-04-25
Session 36: Native setup() phase 2: normalize + transformation
Engineering log for session 36.
Baseline: session 35 shipped phase 1 of the setup() drain (simple supervised). Session 36 extends the native preprocessing chain with two more legacy options: normalize (StandardScaler) and transformation (PowerTransformer). Both can be combined.
CHANGED — engine#
CHANGED—Experiment._native_setup_supervisednumeric branch promoted from a singleSimpleImputerto a sklearnPipelinewith optional steps wired by the experiment's flags:imputer→SimpleImputer(strategy="mean")(always)transformer→PowerTransformer(method="yeo-johnson", standardize=False)(iftransformation=True)scaler→StandardScaler()(ifnormalize=True) Order: transform → scale. Yeo-Johnson handles negatives so the scaler sees finite values and produces ~zero mean / unit std on the train-fold output.
CHANGED—Experiment._can_use_native_setuppredicate.normalizeandtransformationno longer force legacy.remove_outliers/feature_selection/ setup_kwargs / non-supervised tasks still do.CHANGED—numerical_imputerstep renamed tonumerical_pipelinein theColumnTransformer. The single-imputer case still works (the helper unwraps the pipeline when only one step is present), but the canonical name is nownumerical_pipelineto mirrorcategorical_pipeline.
ADDED — tests#
ADDED—packages/engine/tests/test_session36_native_setup_phase2.py— 10 tests:- Drain-locks for
normalize=Trueandtransformation=True. - Numeric output has |mean| ≤ 1e-6 and std ≈ 1 with
normalize=True. transformation=Trueputs aPowerTransformer(yeo-johnson)in the numeric pipeline.- Combined: pipeline step names are
["imputer", "transformer", "scaler"]exactly. - Phase-3 flags still route to legacy.
- End-to-end
create + predictchain withnormalize=True. - Regression with
transformation=Trueworks natively. - Predicate test for the new phase-2 contract.
- Drain-locks for
CHANGED — existing tests#
CHANGED—test_session35_native_setupupdated.test_complex_preprocessing_falls_back_to_legacynow usesremove_outliers=True(still phase-3) instead ofnormalize=True(now native).test_can_use_native_setup_predicatesimplified to cover only the still-legacy cases (other phases have their own files).
INTERNAL#
INTERNAL— Why Yeo-Johnson, not Box-Cox. Box-Cox requires strictly positive inputs. PyCaret 3.x defaulted to Yeo-Johnson for the same reason. Matching that default here means the existing notebooks that passtransformation=Truekeep producing comparable feature distributions.INTERNAL—PowerTransformer(standardize=False)is intentional. The transformer's built-instandardize=Truewould z-score after the power transform, but we do that explicitly via thescalerstep whennormalize=Trueis set. Decoupling means a user who setstransformation=Truealone gets the power-transformed values without z-scoring (matches legacy: power transformation alone preserves scale information; onlynormalize=Trueflips that).
Session 36 delta summary#
| Metric | Session 35 end | Session 36 end |
|---|---|---|
| Native setup options | impute + ordinal + label-encode | + StandardScaler + PowerTransformer |
| Engine tests (fast + slow) | 155 | 165 |
| Combined tests | 301 | 311 |