← Back to blog
2026-04-23

Session 6: Cleanup pass 2 + Application-Platform plan authored

Engineering log for session 6.

Baseline: end of session 5 (v4 branch live on GitHub, CI green). Environment: unchanged.

Theme: user asked for "one more round of clean ups. get rid of any garbage from 3.0. keep the bare minimum. the core logic that we will use and thats it." Then laid out the Part-2 vision — PyCaret as an enterprise-grade open-source application platform (CLI + FastAPI + SQL DB + React UI + Docker). Session 6 executed the cleanup and captured the platform plan.

REMOVED#

  • REMOVEDpycaret/distributions.py (0 callers) deleted.

  • REMOVEDpycaret/internal/cloudpickle_compat.py (0 callers) deleted.

  • REMOVEDpycaret/internal/cuml_wrappers.py (143 LOC) deleted. cuml is not a 4.0 dep; GPU fallback via NVIDIA cuml is out of scope for the 4.0 engine.

  • REMOVEDpycaret/loggers/ shim package deleted. Re-pointed 7 BaseLogger import sites to pycaret.logging.base directly (1 in each of: classification/oop.py, regression/oop.py, time_series/forecasting/oop.py, internal/pycaret_experiment/tabular_experiment.py, internal/pycaret_experiment/unsupervised_experiment.py; 2 others already migrated). The 4.0 BaseLogger lives in pycaret.logging.base; the shim was legacy-compat and had no user after session 3.

  • REMOVED, BREAKING9 killed-verb methods deleted across god-class + task oop wrappers (no replacement; public API didn't expose them):

    FileMethods deleted~LOC
    internal/pycaret_experiment/pycaret_experiment.pydeploy_model (stub)9
    internal/pycaret_experiment/tabular_experiment.pydeploy_model, convert_model, create_api, create_docker361
    internal/pycaret_experiment/supervised_experiment.pycheck_fairness, create_app, dashboard, check_drift353
    classification/oop.pydeploy_model, dashboard174
    regression/oop.pydeploy_model, dashboard168
    time_series/forecasting/oop.pydeploy_model91
    Total15 method definitions~1,156

    Lazy imports inside those methods (mlflow / comet / wandb / dagshub / fairlearn / evidently / gradio / fastapi / boto3 / m2cgen) disappeared with the bodies.

CHANGED#

  • CHANGEDModel containers (containers/models/{classification,regression,clustering,anomaly}.py) — cuml branches now raise NotImplementedError. Deleted the import pycaret.internal.cuml_wrappers imports + the pycaret.internal.cuml_wrappers.get_*() call sites inside if gpu_imported: blocks, and replaced import cuml.X lines inside if experiment.gpu_param == "force": / elif experiment.gpu_param: blocks with a raise. These branches were unreachable with default gpu_param=False + cuml-not-installed, so no behaviour change; the code is now honest about it. (10 more cuml imports in containers/models/time_series.py left as-is — same dead-branch pattern; they'll go with the Phase-5 god-class drain.)
  • INTERNALfrom functools import partial removed from supervised_experiment.py (only the deleted check_fairness method used it).

ADDED#

  • DOCS, ADDEDdocs/revamp/PLATFORM_PLAN.md (~350 lines) — detailed design for the Part-2 application platform:
    • Vision: credible open-source alternative to DataRobot / H2O.ai for teams under ~20 people.
    • Architecture: monorepo with 4 sibling packages — pycaret (library, current) + pycaret-server (FastAPI) + pycaret-ui (React) + pycaret-cli (CLI).
    • Data model: Workspace → Project → Experiment → Run → Pipeline. 11 SQLAlchemy tables.
    • First-run flow: docker compose up → self-service admin setup wizard → no external config.
    • Database: SQLite default, Postgres/MySQL opt-in via DATABASE_URL.
    • Auth: local user store + JWT; OAuth as plugin; admin/member roles.
    • Tech choices: Vite + React 18 + Tailwind + TanStack Query + Zustand + Plotly.js; FastAPI + uvicorn + SQLAlchemy + Alembic; Typer + Rich for CLI.
    • 6 new phases (7-12) added to ROADMAP.
    • Gated on Phase 5pycaret==4.0.0alpha0 shipping — so engine stays focused.
    • Explicit "out of scope": Celery/Redis v1, K8s operator, GraphQL, multi-tenant SaaS, hosted billing, model serving.

DOCS#

  • DOCSROADMAP.md restructured into Part 1 (Engine, Phases 0-6) and Part 2 (Platform, Phases 7-12). Every checkbox reflects actual state: Phases 0, 1, 3.5 ✅ COMPLETE; Phase 2 / 4 / 6 ✅ MOSTLY / 🟡 PARTIAL; Phase 5 🟡 IN FLIGHT (god-class drain, 10-verb migration order spelled out); Phases 7-12 🔴 NOT STARTED.
  • DOCSSTATUS.md updated with session-6 delta table + platform-plan summary.
  • DOCSdocs/revamp/README.md hub index updated to include ARCHITECTURE.md, PLATFORM_PLAN.md, github_issues/. New "Two parts, one programme" section. Reading order reorganized.

TESTS#

  • TESTS32/32 still green on Python 3.13 + sklearn 1.7.2 + NumPy 2.3.5 + pandas 2.x, in 1:37 (was 2:07 in session 5 — slightly faster with less code to import).

ADDED — 6 resolved platform decisions#

Owner answered the six parked questions from PLATFORM_PLAN.md §7. Each answer is now baked into the plan and recorded as an ADR in DECISIONS.md:

  • DOCS, ADDEDDecision 1: Run notebooks are first-class artifacts. Every Run persists run.ipynb + fitted_pipeline.pkl + leaderboard.json + events.jsonl + preview.html. Immutable, downloadable, shareable via signed URL, previewable in-app. Storage: local disk v1, S3 when deployed.
  • DOCS, ADDEDDecision 2: Data-source connectors v1 = CSV upload + S3 + Postgres. DataSourceConnector ABC allows adding Snowflake / GSheets / MySQL later without core changes. AWS-first since immediate deploy target.
  • DOCS, ADDEDDecision 3: Pipelines are workspace-scoped + shareable across projects. Pipeline moves out of Project into Workspace; pipeline_project_links many-to-many joins them. Workspace gets a top-level "Pipelines" screen.
  • DOCS, ADDEDDecision 4: In-house serving system, not MLServer/BentoML. DeploymentRegistry loads pickles into memory; single catch-all POST /api/v1/deployments/{slug}/predict handles inference. Per-deployment auth: workspace / api-key / public. Per-deployment metrics: count, p50/p95 latency, error rate. Phase 11 renamed "In-house serving + Docker/deploy".
  • DOCS, ADDEDDecision 5: Dual-license the platform packages. Engine pycaret stays MIT. pycaret-server / pycaret-cli / pycaret-ui become MIT + BSL 1.1 (BSL for multi-tenant hosted SaaS only; converts to MIT after 3 years). CLA added to CONTRIBUTING.md. Mirrors Sentry / Cal.com / Supabase / Plausible posture.
  • DOCS, ADDEDDecision 6: Metrics stored as summary AND per-fold. Two tables — runs.metrics_summary (leaderboard shape) and fold_metrics (per-fold × per-model × per-metric). Summary drives leaderboard; per-fold unlocks variance / stability / time-to-train analysis.

Data model in PLATFORM_PLAN.md §3 expanded to 14 SQLAlchemy tables (from 11): added fold_metrics, deployments, api_keys, pipeline_project_links. Phase 11 now covers the serving subsystem in detail. Dep discipline §6 updated with nbconvert (notebook preview), boto3 (S3 extra), psycopg[binary] (postgres extra), python-multipart (CSV upload), joblib (deployment loading).

New §8 "Licensing posture" added to PLATFORM_PLAN.md. Reading order updated.

Session 6 delta summary#

MetricSession 5 endSession 6 endΔ
Source LOC in pycaret/51,97650,544−1,432
Zero-import leaf files30−3
Killed-verb methods in source150−15
cuml-coupled files with runtime risk50 (branches raise)
Part-2 platform plannonePLATFORM_PLAN.md (~350 lines)+1 doc
Roadmap phases defined6 engine phases12 (6 engine + 6 platform)+6
Test pass rate100% (32/32)100% (32/32)