PyCaret 4.0 ships one feature-engineering knob on the constructor:
remove_outliers=True. Other 3.x knobs (polynomial features, group
features, bin numeric features, trigonometric features, …) were
deemed too domain-specific to keep as built-ins — do those upstream
in your preprocessing code if you need them.
remove_outliers=True#
Fits an IsolationForest(contamination=0.05, random_state=session_id)
on the transformed training set, drops the top 5% most anomalous
rows from X_train / y_train / X_train_transformed /
y_train_transformed, and leaves the test set untouched.
from pycaret.classification import ClassificationExperiment
exp = ClassificationExperiment(
target="Purchase",
remove_outliers=True,
).fit(data)
print(exp._fit_state["outliers_dropped"]) # → e.g. 37
print(exp.X_train.shape) # 5% smaller than without the flagThe fitted IsolationForest is not kept around — it's used once at fit time to filter the training rows, then discarded. The persisted preprocessing pipeline doesn't apply outlier filtering at predict time (your test / new data goes through unfiltered, by design).
Why post-encoding, pre-scaling?#
IsolationForest works on numeric input. Running it on raw X_train
would fail on categorical columns. Running on the imputed + ordinal-
encoded train set works regardless of dtype. Running it before
PowerTransformer / StandardScaler would mean the contamination
threshold is computed on differently-distributed data than the
estimator sees — so the order is:
impute → ordinal-encode → IsolationForest → power → scaleWhy not on the test set?#
You want to evaluate on the data that mirrors production. Production data isn't outlier-filtered — neither should your holdout be.
Combining with other flags#
remove_outliers composes with everything else:
exp = ClassificationExperiment(
target="Purchase",
normalize=True,
transformation=True,
remove_outliers=True,
feature_selection=True,
).fit(data)Order is fixed: outliers are filtered first (so feature selection sees the cleaner distribution), then power transform, then scaling, then feature selection.
Tuning the contamination#
Set the contamination differently from the default 0.05? It's not
exposed as a flag yet — open an issue if you need it. The current
0.05 matches the legacy outliers_threshold=0.05 default, which is
"throw out the obvious 5% of weird rows."
If you need finer control, fit your own IsolationForest upstream
and pass the cleaned frame to Experiment.fit().