Lessons learned developing a practical large scale machine learning system
原文:http://googleresearch.blogspot.jp/2010/04/lessons-learned-developing-practical.html
Lessons learned developing a practical large scale machine learning system
Tuesday, April 06, 2010
- Binary classification (produces a probability estimate of the class label)
- Parallelized
- Scales to process hundreds of billions of instances and beyond
- Scales to billions of features and beyond
- Automatically identifies useful combinations of features
- Accuracy is competitive with state-of-the-art classifiers
- Reacts to new data within minutes
Training set size | Unique features | |
Mean | 100 Billion | 1 Billion |
Median | 1 Billion | 10 Million |
Having good accuracy across a variety of domains is very important, and we were tempted to focus exclusively on this aspect of the algorithm. However, in a practical system there are several other aspects of an algorithm that are equally critical:
Seti is typically used in places where a machine learning system will provide a significant improvement in accuracy over the existing system. The gains are usually large enough that most teams do not care about the small differences in accuracy between different flavors of algorithms. And, in practice, the small differences are often washed out by other effects such as better data filtering, adding another useful feature, parameter tuning, etc. Teams much prefer having a stable, scalable and easy-to-use classification system. We found that these other aspects can be the difference between a deployable system and one that gets abandoned.
It is perhaps less academically interesting to design an algorithm that is slightly worse in accuracy, but that has greater ease of use and system reliability. However, in our experience, it is very valuable in practice.
- Ease of use: Teams are more willing to experiment with a machine learning system that is simple to set up and use. Those teams are not necessarily die-hard machine learning experts, and so they do not want to waste much time figuring out how to get a system up and running.
- System reliability: Teams are much more willing to deploy a reliable machine learning system in a live environment. They want a system that is dependable and unlikely to crash or need constant attention. Early versions of Seti had marginally better accuracy on large data sets, but were complex, stressed the network and GFS architecture considerably, and needed constant babysitting. The number of teams willing to deploy these versions was low.
Seti is typically used in places where a machine learning system will provide a significant improvement in accuracy over the existing system. The gains are usually large enough that most teams do not care about the small differences in accuracy between different flavors of algorithms. And, in practice, the small differences are often washed out by other effects such as better data filtering, adding another useful feature, parameter tuning, etc. Teams much prefer having a stable, scalable and easy-to-use classification system. We found that these other aspects can be the difference between a deployable system and one that gets abandoned.
It is perhaps less academically interesting to design an algorithm that is slightly worse in accuracy, but that has greater ease of use and system reliability. However, in our experience, it is very valuable in practice.
It was tempting to build a learning system without focusing on any particular application. After all, our goal was to create a large scale system that would be useful on a wide variety of present and future classification tasks. Nevertheless, we decided to focus primarily on a small handful of initial applications. We believe this decision was useful in several ways:
- We could examine what the small number of domains had in common. By building something that would work for a few domains, it was likely the resulting system would be useful for others.
- More importantly, it helped us quickly decide what aspects were unnecessary. We noticed that it was surprisingly easy to over-generalize or over-engineer a machine learning system. The domains grounded our project in reality and drove our decision making. Without them, even deciding how broad to make the input file format would have been harder (e.g., is it important to permit binary/categorical/real-valued features? Multiple classes? Fractional labels? Weighted instances?).
- Working with a few different teams as initial guinea pigs allowed us to learn about common teething problems, and helped us smooth the process of deployment for future teams.
We have a hammer, but we don't want to end up with bent screws. Being machine learning practitioners, it was very tempting for us to always recommend using machine learning for a problem. We saw very early on that, despite its many significant benefits, machine learning typically adds complexity, opacity and unpredictability to a system. In reality, simpler techniques are sometimes good enough for the task at hand. And in the long run, the extra effort that would have been spent integrating, maintaining and diagnosing issues with a live machine learning system could be spent on other way of improving the system instead.
Seti is often used in places where there is a good chance of significantly improving predictive accuracy over the incumbent system. And we usually advise teams against trying the system when we believe there is likely to be only a small improvement.
Seti is often used in places where there is a good chance of significantly improving predictive accuracy over the incumbent system. And we usually advise teams against trying the system when we believe there is likely to be only a small improvement.