Python & 机器学习入门指导
Getting started with Python & Machine Learning
(阅者注:这是一篇关于机器学习的指导入门,作者大致描述了用Python来开始机器学习的优劣,以及如果用哪些Python 的package 来开始机器学习。)
Machine learning is eating the world right now. Everyone and their mother are learning about machine learning models, classification, neural networks, and Andrew Ng. You’ve decided you want to be a part of it, but where to start?
In this article we’ll cover some important characteristics of Python and why it’s great for machine learning. We’ll also cover some of the most important libraries it has for ML, and if it piques your interest, some places where you can learn more.
Why is Python used for machine learning?
Python is a great choice for machine learning for several reasons. First and foremost, it’s a simple language on the surface; even if you’re not familiar with Python, getting up to speed is very quick if you’ve ever used any other language with C-like syntax (i.e. every language out there). Second, Python has a great community, which results in good documentation and friendly, comprehensive answers in StackOverflow (fundamental!). Third, also stemming from the great community, there are plenty of useful libraries for Python (both as “batteries included” and third party), which solve basically any problem that you can have (including machine learning).
But I heard Python is slow!
Yeah and it’s true. Python isn’t the fastest language out there: all those handy abstractions come at a cost.
But here’s the trick: libraries can and do offload the expensive calculations to the much more performant (but harder to use) C and C++. For instance, there’s NumPy, which is a library for numerical computation. It’s written in C, and it’s fast. Practically every library out there that involves intensive calculations uses it — almost all the libraries listed next use it in some form. So if you read NumPy, think fast.
Therefore, you can make your scripts run basically as fast as straight up writing them in a lower level language. So there’s really nothing to worry about when it comes to speed.
Python libraries to check out
Scikit-learn
Are you starting out in machine learning? Want something that covers everything from feature engineering to training and testing a model? Look no further than scikit-learn! This fantastic piece of free software provides every tool necessary for machine learning and data mining. It’s the de facto standard library for machine learning in Python, recommended for most of the ‘old’ ML algorithms.
This library does both classification and regression, supporting basically every algorithm out there (support vector machines, random forest, naive bayes, and so on). It’s built in such a way that allows easy switching of algorithms, so experimentation is easy. These ‘older’ algorithms are surprisingly resilient and work very well in a lot of cases.
But that’s not all! Scikit-learn also does dimensionality reduction, clustering, you name it. It’s also blazingly fast since it runs on NumPy and SciPy (meaning that all the heavy number crunching is run on C instead of Python).
Check out some examples to see everything this library is capable of, and the tutorials if you want to learn how it works.
NLTK
While not a machine learning library per se, NLTK is a must when working with natural language processing (NLP). It comes with a bundle of datasets and other lexical resources (useful for training models) in addition to libraries for working with text — for functions such as classification, tokenization, stemming, tagging, parsing and more.
The usefulness of having all of this stuff neatly packaged can’t be overstated. So if you are interested in NLP, check out some tutorials!
Theano
Used widely in research and academia, Theano is the grandfather of all deep learning frameworks. Written in Python, it’s tightly integrated with NumPy. Theano allows you to create neural networks, which are represented as mathematical expressions with multi-dimensional arrays. Theano handles this for you so you don’t have to worry about the actual implementation of the math involved.
It supports offloading calculations to the much faster GPU, which is a feature that everyone supports today, but back when they introduced it this wasn’t the case. The library is very mature at this point and supports a very wide range of operations, which is a great plus when it comes to comparing it with other similar libraries.
The biggest complaint out there is that the API may be unwieldy for some, making the library hard to use for beginners. However, there are wrappers that ease the pain and make working with Theano simple, such as Keras, Blocks and Lasagne.
Interested in learning about Theano? Check out this Jupyter Notebook tutorial.
TensorFlow
The Google Brain team created TensorFlow for internal use in machine learning applications, and open sourced it in late 2015. They wanted something that could replace their older, closed source machine learning framework, DistBelief, which they said wasn’t flexible enough and too tightly coupled to their infrastructure to be shared with other researchers around the world.
And so TensorFlow was created. Learning from the mistakes of the past, many consider this library to be an improvement over Theano, claiming more flexibility and a more intuitive API. Not only can it be used for research but also for production environments, supporting huge clusters of GPUs for training. While it doesn’t support as wide a range of operations as Theano, it has better computational graph visualizations.
TensorFlow is very popular nowadays. In fact, if you’ve heard about a single library on this list, it’s probably this one: there isn’t a day that goes by without a new blog post or paper mentioning TensorFlow gets published. This popularity translates into a lot of new users and a lot of tutorials, making it very welcoming to beginners.
Keras
Keras is a fantastic library that provides a high-level API for neural networks and is capable of running on top of either Theano or TensorFlow. It makes harnessing the full power of these complex pieces of software much easier than using them directly. It’s very user-friendly, putting user experience as a top priority. They manage this by using simple APIs and excellent feedback on errors.
It’s also modular, meaning that different models (neural layers, cost functions, and so on) can be plugged together with little restrictions. This also makes it very easy to extend, since it’s simple to add new modules and connect them with the existing ones.
Some people have called Keras so good that it is effectively cheating in machine learning. So if you’re starting out with deep learning, go through the examples and documentation to get a feel for what you can do with it. And if you want to learn, start out with this tutorial and see where you can go from there.
Two similar alternatives are Lasagne and Blocks, but they only run on Theano. So if you tried Keras and are unhappy with it, maybe try out one of these alternatives to see if they work out for you.
PyTorch
Another popular deep learning framework is Torch, which is written in Lua. Facebook open-sourced a Python implementation of Torch called PyTorch, which allows you to conveniently use the same low-level libraries that Torch uses, but from Python instead of Lua.
PyTorch is much better for debugging since one of the biggest differences between Theano/TensorFlow and PyTorch is that the former use symbolic computation while the latter doesn’t. Symbolic computation means that coding an operation (say, ‘x + y’), it’s not computed when that line is interpreted. Before getting executed it has to be compiled (translated to CUDA or C). This makes debugging harder in Theano/TensorFlow, since an error is much harder to associate with the line of code that caused it. Of course, doing things this way has its advantages, but debugging isn’t one of them.
If you want to start out with PyTorch the official tutorials are very friendly to beginners but get to advanced topics as well.
First steps in machine learning?
Alright, you’ve presented me with a lot of alternatives for machine learning libraries in Python. What should I choose? How do I compare these things? Where do I start?
Our Ape Advice™ for beginners is to try and not get bogged down by details. If you’ve never done anything machine learning related, try out scikit-learn. You’ll get an idea of how the cycle of tagging, training and testing work and how a model is developed.
Now, if you want to try out deep learning, start out with Keras — which is widely agreed to be the easiest framework — and see where that takes you. After you have more experience, you will start to see what it is that you actually want from the framework: greater speed, a different API, or maybe something else, and you’ll be able to make a more informed decision.
And even then, there is an endless supply of articles out there comparing Theano, Torch, and TensorFlow. There’s no real way to tell which one is the good one. It’s important to take into account that all of them have wide support and are improving constantly, making comparisons harder to make. A six month old benchmark may be outdated, and year old claims of framework X doesn’t support operation Y could no longer be valid.
Finally, if you’re interested in doing machine learning specifically applied to NLP, why not check out MonkeyLearn! Our platform provides a unique UX that makes it super easy to build, train and improve NLP models. You can either use pre-trained models for common use cases (like sentiment analysis, topic detection or keyword extraction) or train custom algorithms using your particular data. Also, you don’t have to worry about the underlying infrastructure or deploying your models, our scalable cloud does this for you. You can start for free and integrate right away with our beautiful API.
Want to learn more?
There are plenty of online resources out there to learn about machine learning ! Here are a few:
- A comprehensive guide of a machine learning project on a Jupyter Notebook, if you want to see what some code looks like.
- Our Gentle Guide to Machine Learning, if you want to read more about the concepts of machine learning.
- Andrew Ng’s Stanford CS229 on Coursera, if you’re ready to get serious about this machine learning thing. If you are looking for a course on practical deep learning, check out the one at fast.ai.
Final words
So that was a brief intro to machine learning in Python and some of its libraries. The important part is not getting bogged down by details and just trying stuff out. Follow your curiosity, and don’t be afraid to experiment.
Know about a python library that was left out? Share it in the comments below!