机器学习-数据泄露

Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps. Often it is easiest to preprocess data before applying scikit-learn methods, for example using pandas. Processing your data before passing it to scikit-learn might be problematic for one of the following reasons:

Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.
You may want to include the parameters of the preprocessors in a parameter search.

posted @ 2018-11-03 15:38 Lucas_Yu 阅读(632) 评论(0) 收藏举报

刷新页面返回顶部

Lucas_Yu

The blogs of Mr6 cover his work in clinical domain with the toolkit including statistics and machine learning and DoE etc..

机器学习-数据泄露

公告