How To Use Cross-validation After Transforming Features
Solution 1:
desertnaut already teased the answer in his comment. I shall just explicate and complete:
When you want to cross-validate several data processing steps together with an estimator, the best way is to use Pipeline
objects. According to the user guide, a Pipeline
serves multiple purposes, one of them being safety:
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
With your definitions like above, you would wrap your transformations and classifier in a Pipeline
the following way:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('transformer', transformerVectoriser),
('classifier', clf)
])
The steps in the pipeline can now be cross-validated togehter:
cv_score = cross_val_score(pipeline, features, results, cv=5)
print(cv_score)
This will ensure that all transformers and the final estimator in the pipeline are only fit and transformed according to the training data, and only call the transform and predict methods on the test data in each iteration.
If you want to read up more on the usage of Pipeline
, check the documentation.
Post a Comment for "How To Use Cross-validation After Transforming Features"