2026-04-25
Session 32: Per-Experiment metric registry + add_metric / remove_metric drain
Engineering log for session 32.
Baseline: session 31 drained pull / models / get_metrics. Session 32 promotes the metric registry into per-Experiment state, drains add_metric and remove_metric, and fixes a real bug: custom metrics registered via add_metric now actually show up in CV results (previously they lived in the legacy holder while the native CV path read from the global container helpers, so the metric was registered but never computed).
CHANGED — engine#
CHANGED—Experiment._get_metric_registry()is the new single source of truth for the per-Experiment metric registry. Lazily builds from the task helper, caches inself._fit_state["metric_registry"]post-fit, returns a fresh build pre-fit (for the fit-sentinel test pattern). ReturnsNonefor time-series — caller falls back to legacy.CHANGED— 6 metric-registry callsites consolidated._compute_predict_metrics,_cross_validate_supervised, and the publicget_metrics()now all funnel through_get_metric_registry(). The 4-way classification/regression/clustering/anomaly task switch lives in one place.CHANGED, BREAKING—Experiment.add_metric(classification + regression) no longer delegates toself._legacy.add_metric. Builds the right<Task>MetricContainerfor the current task and inserts it into the snapshot. Subsequentcreate_model/tune_model/compare_models/predict_modelcalls compute the metric on every fold + include it in the leaderboard. Time-series falls through to legacy.- Signature:
(id, name, score_func, target="pred", greater_is_better=True, args=None, is_multiclass=True, **kwargs). Mirrors legacy.
- Signature:
CHANGED, BREAKING—Experiment.remove_metricdrained. Accepts the metric'sidor its display name (legacy semantics). RaisesValueErrorwhen no match — was previously silent.
ADDED — tests#
ADDED—packages/engine/tests/test_session32_metric_registry.py— 10 tests:- The killer test:
add_metric(...)→ subsequentcreate_model.metricsincludes the new column. This was broken before the drain. add_metricshows up inget_metrics()withCustom=True.remove_metricdrops the metric from CV.remove_metricaccepts display name.remove_metricunknown →ValueError.- Drain-locks for both verbs.
- Custom metric persists across
create_model→tune_model→compare_models. - Regression
add_metricworks. NotFittedErrorpre-fit.
- The killer test:
INTERNAL#
INTERNAL— Why a per-Experiment registry (vs. the global container helpers). The legacy global registry is shared across experiments — adding a metric on one Experiment instance would visibly affect another. Promoting to_fit_state["metric_registry"](adictcopy of the global registry, taken at fit time) decouples experiments cleanly: eachExperimentcarries its own metrics,add_metricmutates only that experiment's registry, and CV / leaderboard / predict all read from the same source.INTERNAL— The fit-sentinel pattern test fix. A small set of fast tests intest_session23_predict.pyuseexp._fitted = True(a fake fit-sentinel) to test predict_model without spinning up the engine. Session-32's first iteration broke them:_get_metric_registry()required_fit_stateto exist. Fix: when_fit_statedoesn't exist yet (or doesn't have"metric_registry"),_get_metric_registry()builds a fresh registry on every call without caching. The fit-sentinel tests are explicitly read-only — they don't need the cache.INTERNAL— Whyadd_metricalways setsis_custom=True. BothClassificationMetricContainerandRegressionMetricContainerhave anis_customflag intended exactly for user-added metrics. Setting it on everyadd_metriccall letsget_metrics()distinguish built-in vs custom in its DataFrame output, which the UI / Control Plane can use to allow editing only the custom ones.
Session 32 delta summary#
| Metric | Session 31 end | Session 32 end |
|---|---|---|
Drainable secondary verbs still on _legacy | 2 | 0 ✅ |
| Engine tests (fast + slow) | 121 | 131 |
| Combined tests | 267 | 277 |