How To Use Cross-validation After Transforming Features

August 21, 2024 Post a Comment

I have dataset with categorical and non categorical values. I applied OneHotEncoder for categorical values and StandardScaler for continues values. transformerVectoriser = ColumnTr

Solution 1:

desertnaut already teased the answer in his comment. I shall just explicate and complete:

When you want to cross-validate several data processing steps together with an estimator, the best way is to use Pipeline objects. According to the user guide, a Pipeline serves multiple purposes, one of them being safety:

Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

With your definitions like above, you would wrap your transformations and classifier in a Pipeline the following way:

from sklearn.pipeline import Pipeline


pipeline = Pipeline([
    ('transformer', transformerVectoriser),
    ('classifier', clf)
])

The steps in the pipeline can now be cross-validated togehter:

cv_score = cross_val_score(pipeline, features, results, cv=5)
print(cv_score)

This will ensure that all transformers and the final estimator in the pipeline are only fit and transformed according to the training data, and only call the transform and predict methods on the test data in each iteration.

If you want to read up more on the usage of Pipeline, check the documentation.

Getting Started with Python

How To Use Cross-validation After Transforming Features

Solution 1:

Post a Comment for "How To Use Cross-validation After Transforming Features"