Machine Learning Project Checklist

This checklist can guide you through your Machine Learning projects. There are eight main steps:

这个清单可以指导你完成机器学习项目。主要有如下8个步骤

1. Frame the problem and look at the big picture.

建立问题框架并确立问题（问题框架化，视野宏观化）

2. Get the data.

获取数据

3. Explore the data to gain insights.

探索数据获取信息

4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.

处理好数据，让它方便的被机器学习算法使用

5. Explore many different models and shortlist the best ones.

尝试不同的模型，找出最好的一个

6. Fine-tune your models and combine them into a great solution.

微调你的模型，组合有效的微调来获取更好的解决方案

7. Present your solution.

展示你的解决方案

8. Launch, monitor, and maintain your system.

启动，监控，维护你的系统

Obviously, you should feel free to adapt this checklist to your needs.

很显然，你可以根据你的需要随意的调整这个清单

Frame the Problem and Look at the Big Picture

1. Define the objective in business terms.

用业务术语定义目标

2. How will your solution be used?

你将使用什么来解决

3. What are the current solutions/workarounds (if any)?

当前有什么解决方案

4. How should you frame this problem (supervised/unsupervised, online/offline,

etc.)? 你如何解决这个问题

5. How should performance be measured?

如何度量模型的性能

6. Is the performance measure aligned with the business objective?

性能指标是否与目标指标一致

7. What would be the minimum performance needed to reach the business objective?

达到业务目标的最低性能要求是多少

8. What are comparable problems? Can you reuse experience or tools?

9. Is human expertise available?

10. How would you solve the problem manually?

你将如何动手解决

11. List the assumptions you (or others) have made so far.

12. Verify assumptions if possible.

Get the Data

Note: automate as much as possible so you can easily get fresh data.

1. List the data you need and how much you need.

列出你需要的数据，和数据的数量

2. Find and document where you can get that data.

查找并记录你可以从哪里获取这些数据

3. Check how much space it will take.

检查这些数据会占用多少空间

4. Check legal obligations, and get authorization if necessary.

检查法律义务，并在必要时获得授权

5. Get access authorizations.

获取数据的访问权限

6. Create a workspace (with enough storage space).

创建工作区

7. Get the data.

获取数据

8. Convert the data to a format you can easily manipulate (without changing the data itself).

将数据转换为您可以轻松操作的格式

9. Ensure sensitive information is deleted or protected (e.g., anonymized).

确保删除或者保护敏感信息

10. Check the size and type of data (time series, sample, geographical, etc.).

检查数据的大小和类型

11. Sample a test set, put it aside, and never look at it (no data snooping!).

对测试集进行采样，将其放在一边，不要看它

Explore the Data

Note: try to get insights from a field expert for these steps.

1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary).

创建用于探索的数据副本

2. Create a Jupyter notebook to keep a record of your data exploration.

创建一个Jupyter笔记本以记录您的数据探索

3. Study each attribute and its characteristics:

研究每个属性及其特征

• Name

名称

• Type (categorical, int/float, bounded/unbounded, text, structured, etc.)

类型（分类，整数/浮点数，有界/无界，文本，结构化等）

• % of missing values

缺失值的百分比

• Noisiness and type of noise (stochastic, outliers, rounding errors, etc.)

噪音和噪音类型（随机，异常值，舍入误差等）

• Usefulness for the task

可能对任务有用吗？

• Type of distribution (Gaussian, uniform, logarithmic, etc.)

分布类型（高斯，均匀，对数等）

4. For supervised learning tasks, identify the target attribute(s).

对于监督学习任务，确定目标属性

5. Visualize the data.

可视化数据

6. Study the correlations between attributes.

研究属性之间的相关性

7. Study how you would solve the problem manually.

研究如何手动解决问题

8. Identify the promising transformations you may want to apply.

确定你可能想要应用的转换

9. Identify extra data that would be useful (go back to “Get the Data” on page 756).

确定有用的额外数据

10. Document what you have learned.

记录你学到的东西

Prepare the Data

Notes:

• Work on copies of the data (keep the original dataset intact).

处理数据副本（保持原始数据集完整）

• Write functions for all data transformations you apply, for five reasons:

写自动化函数去处理数据转换的5个理由

—So you can easily prepare the data the next time you get a fresh dataset

你可以在下次获取新数据集时轻松准备数据

—So you can apply these transformations in future projects

你可以在将来的项目中应用这些转换

—To clean and prepare the test set

清理和准备测试集

—To clean and prepare new data instances once your solution is live

在解决方案生效后清理和准备新的数据实例

—To make it easy to treat your preparation choices as hyperparameters

使你可以轻松将你的准备选择视为超参数

1. Data cleaning:

数据清理

• Fix or remove outliers (optional).

修复或者删除异常值（可选）

• Fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).

填写缺失值（列入，零，均值，中位数。。。）或删除他们的行（或列）

2. Feature selection (optional):

特征选择（可选）

• Drop the attributes that provide no useful information for the task.

删除无法为任务提供有用信息的属性

3. Feature engineering, where appropriate:

特征工程，适当时

• Discretize continuous features.

使连续特征具体化

• Decompose features (e.g., categorical, date/time, etc.).

分解特征（列入，分类，日期/时间等）

• Add promising transformations of features (e.g., log(x), sqrt(x), x2, etc.).

添加有用的特征转换（列入 log（x）, sqrt(x）, x*x等)

• Aggregate features into promising new features.

将特征聚合为有用的新特征

4. Feature scaling:

特征缩放

• Standardize or normalize features.

将特征归一化或者标准化

Shortlist Promising Models

探索不同的模型并列出最优模型

Notes:

• If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or Random Forests).

如果数据量很大，你可能需要对较小的训练集进行采样，以便于在合理的时间训练不同的模型

• Once again, try to automate these steps as much as possible.

再次强调，尽可能自动化这些步骤

1. Train many quick-and-dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forest, neural net, etc.) using standard parameters.

训练来自不同类别的许多速成模型

2. Measure and compare their performance.

度量并比较他们的性能

• For each model, use N-fold cross-validation and compute the mean and standard deviation of the

performance measure on the N folds.

对每个模型，使用N-fold交叉验证并计算平均值和N folds上的性能测量的标准偏差

3. Analyze the most significant variables for each algorithm.

分析每种算法的最重要变量

4. Analyze the types of errors the models make.

分析模型所犯错误的类型

• What data would a human have used to avoid these errors?

人类会用什么数据来避免这些错误

5. Perform a quick round of feature selection and engineering.

进行快速的特征选择和特征工程

6. Perform one or two more quick iterations of the five previous steps.

对前面5个步骤进行多次的迭代

7. Shortlist the top three to five most promising models, preferring models that make different types of errors.

列出前3到5个最有希望的模型，更倾向于那些出现不同类型错误的模型

Fine-Tune the System

微调系统

Notes: 注意

• You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning.

你将在这个步骤中使用尽可能多的数据，尤其是在进行到快微调结束时

• As always, automate what you can.

始终如一的实现自动化

1. Fine-tune the hyperparameters using cross-validation:

使用交叉验证微调超参数

• Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., if you’re not sure whether to replace missing values with zeros or with the median value, or to just drop the rows).

将数据转换选择视为超参数，尤其是在你不确定它们时（列如，我应该用零或中值更换缺失值？或者直接删除这一行）

• Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using Gaussian process priors, as described by Jasper Snoek et al.).1

除非要探索的超参数值非常少，否则更倾向于使用随机搜索网格搜索。如果训练时间很长，你可能更喜欢贝叶斯训练优化方法（例如，使用高斯过程先验。。。）

2. Try Ensemble methods. Combining your best models will often produce better performance than running them individually.

尝试Ensemble方法，结合最佳模型通常会表现的更好，而不是单独运行他们。

3. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.

一旦你对自己的最终模型充满信心，就可以在测试集上来泛化误差进而衡量它的表现

TIPS:

Don’t tweak your model after measuring the generalization error: you would just start overfitting the test set.

Present Your Solution

展示你的解决方案

1. Document what you have done.

记录你所做的事情

2. Create a nice presentation.

创建一个漂亮的演示文稿

• Make sure you highlight the big picture first.

确保首先突出显示大图

3. Explain why your solution achieves the business objective.

解释为什么您的解决方案可以实现项目目标

4. Don’t forget to present interesting points you noticed along the way.

不要忘记提供沿途注意到的有趣点

• Describe what worked and what did not.

描述哪些有效，哪些无效

• List your assumptions and your system’s limitations.

列出您的假设和系统的限制

5. Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., “the median income is the number-one predictor of housing prices”).

确保通过精美的可视化或易于理解的表达来阐述你的关键点

Launch!

1. Get your solution ready for production (plug into production data inputs, write unit tests, etc.)

准备好生产解决方案（插入生产数据输入，写入单元测试等）

2. Write monitoring code to check your system’s live performance at regular interval and trigger alerts when it drops.

编写监控代码，以定期检查系统的实时性能，并在它下降时触发警报

• Beware of slow degradation: models tend to “rot” as data evolves.

谨防缓慢退化：随着数据的发展，模型往往会“腐烂”

• Measuring performance may require a human pipeline (e.g., via a crowdsourcing service).

测量性能可能需要人工管道（例如，通过众包服务）

• Also monitor your inputs’ quality (e.g., a malfunctioning sensor sending random values, or another team’s output becoming stale). This is particularly important for online learning systems.

同时监控你输入的质量（例如，发送随机值的故障传感器，或其他团队的输出变得陈旧）这对在线学习系统尤为重要

3. Retrain your models on a regular basis on fresh data (automate as much aspossible).

定期根据新数据重新训练模型（尽可能自动化）

posted @ 2020-05-07 18:12 1101011 阅读(190) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

1101011

Machine Learning Project Checklist

公告