2026-04-24
Session 26: God-class drain: `compare_models` (supervised)
Engineering log for session 26.
Baseline: session 25 drained tune_model. Session 26 drains the heart of the AutoML loop — compare_models — by reusing the already-drained create_model in a per-model loop.
CHANGED — engine#
CHANGED, BREAKING—packages/engine/pycaret/core/supervised.py—SupervisedExperiment.compare_models(supervised path). No longer delegates toself._legacy.compare_modelsfor classification + regression. Iterates the engine's_all_models_internalregistry (filtered byinclude/exclude/turbo), callsself.create_modelfor each candidate, and assembles the leaderboard from each candidate'sMeanmetrics row. Time-series / clustering / anomaly still delegate via_compare_models_legacy.CHANGED, BREAKING— Signature slim-down. Kept:include,exclude,fold,cross_validation,sort,n_select,turbo,errors,fit_kwargs,round,verbose. Dropped 3.x cruft:budget_time,experiment_custom_tags,probability_threshold,groups,caller_params. All gone for the same reasons as previous session drains — either dead code, MLflow integration that was already killed, or one-line post-hoc overrides on the result.CHANGED— All slots are keyword-only. Decorator-stylecompare_models(include=, n_select=)is the only valid call shape. The legacy positional form (compare_models(["lr", "dt"], None, None, 4)) is gone.CHANGED— Auto-detect ascending vs descending sort. Error metrics (MAE,MSE,RMSE,MAPE,RMSLE+ sklearnneg_*family) sort ascending; everything else descending. The legacy code required the caller to know which way each metric sorted.CHANGED—CompareResult.leaderboardrow source. Each row is theMeanrow ofcreated.metrics(the per-fold DataFrame from session 24's drainedcreate_model), prepended with aModelcolumn for the registry ID. So leaderboard column schema is identical across classification (Accuracy/AUC/Recall/Prec./F1/Kappa/MCC) and across all 4 supervised result types now (CreateResult,TuneResult,CompareResult,PredictResult).CHANGED—errors="ignore"no longer hides errors silently. When a candidate raises, the exception is swallowed + the candidate is dropped from the leaderboard. Future enhancement: log the exception type / message via the event stream so users can see what failed (currently the per-candidate loop is silent on failure for noise reasons; tracked as a polish item).
ADDED — tests#
ADDED—packages/engine/tests/test_session26_compare.py— 10 tests covering: top-N return shape, default-sort defaults, ascending sort for error metrics,exclude=removes models,turbo=Trueblocks slow models, drain-lock againstself._legacy.compare_models, end-to-endcompare → predictchain,errors="ignore"skips a bogus model id, NotFittedError on unfit, and thatresult.bestis a real Pipeline.
INTERNAL#
INTERNAL— Reusing already-drained verbs. The nativecompare_modelsis ~50 LoC of glue aroundself.create_model. No new search / metric registry / fold logic — that all lives increate_modelalready. Each new drain reuses upstream drained verbs, which is why later sessions are progressively shorter despite covering more surface area. The_cross_validate_supervisedhelper from session 24 is now indirectly used by every supervised verb (create_modelcalls it directly;tune_modelandcompare_modelscallcreate_model).INTERNAL— Empty-result soft-handling. If every candidate fails (impossible in practice with the default registry, buterrors="ignore"+ a custominclude=could trigger it), we return an emptyCompareResult(best=None, models=[], leaderboard=DataFrame(), ranked_ids=[])rather than raising. Caller code that checksif result.best is not Nonegets a clear path; callers expecting at least one result must provideerrors="raise"instead.INTERNAL— Per-candidate error swallowing keeps the leaderboard reproducible. Withouterrors="ignore"as the default, a single new model in the registry that breaks on a particular dataset would sink every notebook in the wild. With it, the registry can grow without breaking historical comparisons.
Session 26 delta summary#
| Metric | Session 25 end | Session 26 end |
|---|---|---|
Supervised OOP verbs still on self._legacy | 2 | 1 |
| Engine tests (fast + slow) | 70 | 80 |
| Combined tests | 216 | 226 |