Session 6: Cleanup pass 2 + Application-Platform plan authored
Engineering log for session 6.
Baseline: end of session 5 (v4 branch live on GitHub, CI green). Environment: unchanged.
Theme: user asked for "one more round of clean ups. get rid of any garbage from 3.0. keep the bare minimum. the core logic that we will use and thats it." Then laid out the Part-2 vision — PyCaret as an enterprise-grade open-source application platform (CLI + FastAPI + SQL DB + React UI + Docker). Session 6 executed the cleanup and captured the platform plan.
REMOVED#
-
REMOVED—pycaret/distributions.py(0 callers) deleted. -
REMOVED—pycaret/internal/cloudpickle_compat.py(0 callers) deleted. -
REMOVED—pycaret/internal/cuml_wrappers.py(143 LOC) deleted. cuml is not a 4.0 dep; GPU fallback via NVIDIA cuml is out of scope for the 4.0 engine. -
REMOVED—pycaret/loggers/shim package deleted. Re-pointed 7BaseLoggerimport sites topycaret.logging.basedirectly (1 in each of:classification/oop.py,regression/oop.py,time_series/forecasting/oop.py,internal/pycaret_experiment/tabular_experiment.py,internal/pycaret_experiment/unsupervised_experiment.py; 2 others already migrated). The 4.0BaseLoggerlives inpycaret.logging.base; the shim was legacy-compat and had no user after session 3. -
REMOVED, BREAKING— 9 killed-verb methods deleted across god-class + task oop wrappers (no replacement; public API didn't expose them):File Methods deleted ~LOC internal/pycaret_experiment/pycaret_experiment.pydeploy_model(stub)9 internal/pycaret_experiment/tabular_experiment.pydeploy_model,convert_model,create_api,create_docker361 internal/pycaret_experiment/supervised_experiment.pycheck_fairness,create_app,dashboard,check_drift353 classification/oop.pydeploy_model,dashboard174 regression/oop.pydeploy_model,dashboard168 time_series/forecasting/oop.pydeploy_model91 Total 15 method definitions ~1,156 Lazy imports inside those methods (mlflow / comet / wandb / dagshub / fairlearn / evidently / gradio / fastapi / boto3 / m2cgen) disappeared with the bodies.
CHANGED#
CHANGED— Model containers (containers/models/{classification,regression,clustering,anomaly}.py) — cuml branches now raiseNotImplementedError. Deleted theimport pycaret.internal.cuml_wrappersimports + thepycaret.internal.cuml_wrappers.get_*()call sites insideif gpu_imported:blocks, and replacedimport cuml.Xlines insideif experiment.gpu_param == "force":/elif experiment.gpu_param:blocks with a raise. These branches were unreachable with defaultgpu_param=False+ cuml-not-installed, so no behaviour change; the code is now honest about it. (10 more cuml imports incontainers/models/time_series.pyleft as-is — same dead-branch pattern; they'll go with the Phase-5 god-class drain.)INTERNAL—from functools import partialremoved fromsupervised_experiment.py(only the deletedcheck_fairnessmethod used it).
ADDED#
DOCS, ADDED—docs/revamp/PLATFORM_PLAN.md(~350 lines) — detailed design for the Part-2 application platform:- Vision: credible open-source alternative to DataRobot / H2O.ai for teams under ~20 people.
- Architecture: monorepo with 4 sibling packages —
pycaret(library, current) +pycaret-server(FastAPI) +pycaret-ui(React) +pycaret-cli(CLI). - Data model: Workspace → Project → Experiment → Run → Pipeline. 11 SQLAlchemy tables.
- First-run flow:
docker compose up→ self-service admin setup wizard → no external config. - Database: SQLite default, Postgres/MySQL opt-in via
DATABASE_URL. - Auth: local user store + JWT; OAuth as plugin; admin/member roles.
- Tech choices: Vite + React 18 + Tailwind + TanStack Query + Zustand + Plotly.js; FastAPI + uvicorn + SQLAlchemy + Alembic; Typer + Rich for CLI.
- 6 new phases (7-12) added to ROADMAP.
- Gated on Phase 5 —
pycaret==4.0.0alpha0shipping — so engine stays focused. - Explicit "out of scope": Celery/Redis v1, K8s operator, GraphQL, multi-tenant SaaS, hosted billing, model serving.
DOCS#
DOCS—ROADMAP.mdrestructured into Part 1 (Engine, Phases 0-6) and Part 2 (Platform, Phases 7-12). Every checkbox reflects actual state: Phases 0, 1, 3.5 ✅ COMPLETE; Phase 2 / 4 / 6 ✅ MOSTLY / 🟡 PARTIAL; Phase 5 🟡 IN FLIGHT (god-class drain, 10-verb migration order spelled out); Phases 7-12 🔴 NOT STARTED.DOCS—STATUS.mdupdated with session-6 delta table + platform-plan summary.DOCS—docs/revamp/README.mdhub index updated to includeARCHITECTURE.md,PLATFORM_PLAN.md,github_issues/. New "Two parts, one programme" section. Reading order reorganized.
TESTS#
TESTS— 32/32 still green on Python 3.13 + sklearn 1.7.2 + NumPy 2.3.5 + pandas 2.x, in 1:37 (was 2:07 in session 5 — slightly faster with less code to import).
ADDED — 6 resolved platform decisions#
Owner answered the six parked questions from PLATFORM_PLAN.md §7. Each answer is now baked into the plan and recorded as an ADR in DECISIONS.md:
DOCS, ADDED— Decision 1: Run notebooks are first-class artifacts. Every Run persistsrun.ipynb+fitted_pipeline.pkl+leaderboard.json+events.jsonl+preview.html. Immutable, downloadable, shareable via signed URL, previewable in-app. Storage: local disk v1, S3 when deployed.DOCS, ADDED— Decision 2: Data-source connectors v1 = CSV upload + S3 + Postgres.DataSourceConnectorABC allows adding Snowflake / GSheets / MySQL later without core changes. AWS-first since immediate deploy target.DOCS, ADDED— Decision 3: Pipelines are workspace-scoped + shareable across projects.Pipelinemoves out ofProjectintoWorkspace;pipeline_project_linksmany-to-many joins them. Workspace gets a top-level "Pipelines" screen.DOCS, ADDED— Decision 4: In-house serving system, not MLServer/BentoML.DeploymentRegistryloads pickles into memory; single catch-allPOST /api/v1/deployments/{slug}/predicthandles inference. Per-deployment auth:workspace/api-key/public. Per-deployment metrics: count, p50/p95 latency, error rate. Phase 11 renamed "In-house serving + Docker/deploy".DOCS, ADDED— Decision 5: Dual-license the platform packages. Enginepycaretstays MIT.pycaret-server/pycaret-cli/pycaret-uibecome MIT + BSL 1.1 (BSL for multi-tenant hosted SaaS only; converts to MIT after 3 years). CLA added to CONTRIBUTING.md. Mirrors Sentry / Cal.com / Supabase / Plausible posture.DOCS, ADDED— Decision 6: Metrics stored as summary AND per-fold. Two tables —runs.metrics_summary(leaderboard shape) andfold_metrics(per-fold × per-model × per-metric). Summary drives leaderboard; per-fold unlocks variance / stability / time-to-train analysis.
Data model in PLATFORM_PLAN.md §3 expanded to 14 SQLAlchemy tables (from 11): added fold_metrics, deployments, api_keys, pipeline_project_links. Phase 11 now covers the serving subsystem in detail. Dep discipline §6 updated with nbconvert (notebook preview), boto3 (S3 extra), psycopg[binary] (postgres extra), python-multipart (CSV upload), joblib (deployment loading).
New §8 "Licensing posture" added to PLATFORM_PLAN.md. Reading order updated.
Session 6 delta summary#
| Metric | Session 5 end | Session 6 end | Δ |
|---|---|---|---|
Source LOC in pycaret/ | 51,976 | 50,544 | −1,432 |
| Zero-import leaf files | 3 | 0 | −3 |
| Killed-verb methods in source | 15 | 0 | −15 |
| cuml-coupled files with runtime risk | 5 | 0 (branches raise) | − |
| Part-2 platform plan | none | PLATFORM_PLAN.md (~350 lines) | +1 doc |
| Roadmap phases defined | 6 engine phases | 12 (6 engine + 6 platform) | +6 |
| Test pass rate | 100% (32/32) | 100% (32/32) | — |