Python & 机器学习入门指导
Getting started with Python & Machine Learning
(阅者注:这是一篇关于机器学习的指导入门,作者大致描述了用Python来开始机器学习的优劣,以及如果用哪些Python 的package 来开始机器学习。)
Machine learning is eating the world right now. Everyone and their mother are learning about machine learning models, classification, neural networks, and Andrew Ng. You’ve decided you want to be a part of it, but where to start?
In this article we’ll cover some important characteristics of Python and why it’s great for machine learning. We’ll also cover some of the most important libraries it has for ML, and if it piques your interest, some places where you can learn more.
Why is Python used for machine learning?
Python is a great choice for machine learning for several reasons. First and foremost, it’s a simple language on the surface; even if you’re not familiar with Python, getting up to speed is very quick if you’ve ever used any other language with C-like syntax (i.e. every language out there). Second, Python has a great community, which results in good documentation and friendly, comprehensive answers in StackOverflow (fundamental!). Third, also stemming from the great community, there are plenty of useful libraries for Python (both as “batteries included” and third party), which solve basically any problem that you can have (including machine learning).
But I heard Python is slow!
Yeah and it’s true. Python isn’t the fastest language out there: all those handy abstractions come at a cost.
But here’s the trick: libraries can and do offload the expensive calculations to the much more performant (but harder to use) C and C++. For instance, there’s NumPy, which is a library for numerical computation. It’s written in C, and it’s fast. Practically every library out there that involves intensive calculations uses it — almost all the libraries listed next use it in some form. So if you read NumPy, think fast.
Therefore, you can make your scripts run basically as fast as straight up writing them in a lower level language. So there’s really nothing to worry about when it comes to speed.
Python libraries to check out
Scikit-learn
Are you starting out in machine learning? Want something that covers everything from feature engineering to training and testing a model? Look no further than scikit-learn! This fantastic piece of free software provides every tool necessary for machine learning and data mining. It’s the de facto standard library for machine learning in Python, recommended for most of the ‘old’ ML algorithms.
This library does both classification and regression, supporting basically every algorithm out there (support vector machines, random forest, naive bayes, and so on). It’s built in such a way that allows easy switching of algorithms, so experimentation is easy. These ‘older’ algorithms are surprisingly resilient and work very well in a lot of cases.
But that’s not all! Scikit-learn also does dimensionality reduction, clustering, you name it. It’s also blazingly fast since it runs on NumPy and SciPy (meaning that all the heavy number crunching is run on C instead of Python).
Check out some examples to see everything this library is capable of, and the tutorials if you want to learn how it works.
NLTK
While not a machine learning library per se, NLTK is a must when working with natural language processing (NLP). It comes with a bundle of datasets and other lexical resources (useful for training models) in addition to libraries for working with text — for functions such as classification, tokenization, stemming, tagging, parsing and more.
The usefulness of having all of this stuff neatly packaged can’t be overstated. So if you are interested in NLP, check out some tutorials!
Theano
Used widely in research and academia, Theano is the grandfather of all deep learning frameworks. Written in Python, it’s tightly integrated with NumPy. Theano allows you to create neural networks, which are represented as mathematical expressions with multi-dimensional arrays. Theano handles this for you so you don’t have to worry about the actual implementation of the math involved.
It supports offloading calculations to the much faster GPU, which is a feature that everyone supports today, but back when they introduced it this wasn’t the case. The library is very mature at this point and supports a very wide range of operations, which is a great plus when it comes to comparing it with other similar libraries.
The biggest complaint out there is that the API may be unwieldy for some, making the library hard to use for beginners. However, there are wrappers that ease the pain and make working with Theano simple, such as Keras, Blocks and Lasagne.
Interested in learning about Theano? Check out this Jupyter Notebook tutorial.
TensorFlow
The Google Brain team created TensorFlow for internal use in machine learning applications, and open sourced it in late 2015. They wanted something that could replace their older, closed source machine learning framework, DistBelief, which they said wasn’t flexible enough and too tightly coupled to their infrastructure to be shared with other researchers around the world.
And so TensorFlow was created. Learning from the mistakes of the past, many consider this library to be an improvement over Theano, claiming more flexibility and a more intuitive API. Not only can it be used for research but also for production environments, supporting huge clusters of GPUs for training. While it doesn’t support as wide a range of operations as Theano, it has better computational graph visualizations.
TensorFlow is very popular nowadays. In fact, if you’ve heard about a single library on this list, it’s probably this one: there isn’t a day that goes by without a new blog post or paper mentioning TensorFlow gets published. This popularity translates into a lot of new users and a lot of tutorials, making it very welcoming to beginners.
Keras
Keras is a fantastic library that provides a high-level API for neural networks and is capable of running on top of either Theano or TensorFlow. It makes harnessing the full power of these complex pieces of software much easier than using them directly. It’s very user-friendly, putting user experience as a top priority. They manage this by using simple APIs and excellent feedback on errors.
It’s also modular, meaning that different models (neural layers, cost functions, and so on) can be plugged together with little restrictions. This also makes it very easy to extend, since it’s simple to add new modules and connect them with the existing ones.
Some people have called Keras so good that it is effectively cheating in machine learning. So if you’re starting out with deep learning, go through the examples and documentation to get a feel for what you can do with it. And if you want to learn, start out with this tutorial and see where you can go from there.
Two similar alternatives are Lasagne and Blocks, but they only run on Theano. So if you tried Keras and are unhappy with it, maybe try out one of these alternatives to see if they work out for you.
PyTorch
Another popular deep learning framework is Torch, which is written in Lua. Facebook open-sourced a Python implementation of Torch called PyTorch, which allows you to conveniently use the same low-level libraries that Torch uses, but from Python instead of Lua.
PyTorch is much better for debugging since one of the biggest differences between Theano/TensorFlow and PyTorch is that the former use symbolic computation while the latter doesn’t. Symbolic computation means that coding an operation (say, ‘x + y’), it’s not computed when that line is interpreted. Before getting executed it has to be compiled (translated to CUDA or C). This makes debugging harder in Theano/TensorFlow, since an error is much harder to associate with the line of code that caused it. Of course, doing things this way has its advantages, but debugging isn’t one of them.
If you want to start out with PyTorch the official tutorials are very friendly to beginners but get to advanced topics as well.
First steps in machine learning?
Alright, you’ve presented me with a lot of alternatives for machine learning libraries in Python. What should I choose? How do I compare these things? Where do I start?
Our Ape Advice™ for beginners is to try and not get bogged down by details. If you’ve never done anything machine learning related, try out scikit-learn. You’ll get an idea of how the cycle of tagging, training and testing work and how a model is developed.
Now, if you want to try out deep learning, start out with Keras — which is widely agreed to be the easiest framework — and see where that takes you. After you have more experience, you will start to see what it is that you actually want from the framework: greater speed, a different API, or maybe something else, and you’ll be able to make a more informed decision.
And even then, there is an endless supply of articles out there comparing Theano, Torch, and TensorFlow. There’s no real way to tell which one is the good one. It’s important to take into account that all of them have wide support and are improving constantly, making comparisons harder to make. A six month old benchmark may be outdated, and year old claims of framework X doesn’t support operation Y could no longer be valid.
Finally, if you’re interested in doing machine learning specifically applied to NLP, why not check out MonkeyLearn! Our platform provides a unique UX that makes it super easy to build, train and improve NLP models. You can either use pre-trained models for common use cases (like sentiment analysis, topic detection or keyword extraction) or train custom algorithms using your particular data. Also, you don’t have to worry about the underlying infrastructure or deploying your models, our scalable cloud does this for you. You can start for free and integrate right away with our beautiful API.
Want to learn more?
There are plenty of online resources out there to learn about machine learning ! Here are a few:
- A comprehensive guide of a machine learning project on a Jupyter Notebook, if you want to see what some code looks like.
- Our Gentle Guide to Machine Learning, if you want to read more about the concepts of machine learning.
- Andrew Ng’s Stanford CS229 on Coursera, if you’re ready to get serious about this machine learning thing. If you are looking for a course on practical deep learning, check out the one at fast.ai.
Final words
So that was a brief intro to machine learning in Python and some of its libraries. The important part is not getting bogged down by details and just trying stuff out. Follow your curiosity, and don’t be afraid to experiment.
Know about a python library that was left out? Share it in the comments below!
Python & 机器学习入门指导的更多相关文章
- python机器学习入门-(1)
机器学习入门项目 如果你和我一样是一个机器学习小白,这里我将会带你进行一个简单项目带你入门机器学习.开始吧! 1.项目介绍 这个项目是针对鸢尾花进行分类,数据集是含鸢尾花的三个亚属的分类信息,通过机器 ...
- Python机器学习入门(1)之导学+无监督学习
Python Scikit-learn *一组简单有效的工具集 *依赖Python的NumPy,SciPy和matplotlib库 *开源 可复用 sklearn库的安装 DOS窗口中输入 pip i ...
- Python机器学习入门
# NumPy Python科学计算基础包 import numpy as np # 导入numpy库并起别名为npnumpy_array = np.array([[1,3,5],[2,4,6]])p ...
- 零起点PYTHON机器学习快速入门 PDF |网盘链接下载|
点击此处进入下载地址 提取码:2wg3 资料简介: 本书采用独创的黑箱模式,MBA案例教学机制,结合一线实战案例,介绍Sklearn人工智能模块库和常用的机器学习算法.书中配备大量图表说明,没有枯 ...
- [Python]-numpy模块-机器学习Python入门《Python机器学习手册》-01-向量、矩阵和数组
<Python机器学习手册--从数据预处理到深度学习> 这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习 ...
- [Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理
<Python机器学习手册--从数据预处理到深度学习> 这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习 ...
- [Python]-pandas模块-机器学习Python入门《Python机器学习手册》-02-加载数据:加载文件
<Python机器学习手册--从数据预处理到深度学习> 这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习 ...
- [Python]-sklearn模块-机器学习Python入门《Python机器学习手册》-02-加载数据:加载数据集
<Python机器学习手册--从数据预处理到深度学习> 这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习 ...
- 《Python机器学习及实践:从零开始通往Kaggle竞赛之路》
<Python 机器学习及实践–从零开始通往kaggle竞赛之路>很基础 主要介绍了Scikit-learn,顺带介绍了pandas.numpy.matplotlib.scipy. 本书代 ...
随机推荐
- mount ntfs-3g , fstab里的配置没有效果
把ntfs-3g配置在 fstab 里,mount 时会报 No such device 网上也有在嵌入式系统里发生的类似例子. 没有解决方法,也不准备再研究了. 准备在机器启动之后,手动下面的命令 ...
- English trip -- Review Unit2 At school 在学校
What do you need,Loki? I need an eraser What does he need? He needs a dictionary Where's my pencil? ...
- spoj Help the Military Recruitment Office!
题意:给出名字和地方,地方会重定向,最后再给出名字,问现在属于哪里? 用并查集. //#pragma comment(linker,"/STACK:1024000000,1024000000 ...
- Confluence 6 教程:在 Confluence 中导航
当你对 Confluence 有所了解后,你会发现 Confluence 使用起来非常简单.这个教程主要是针对你使用的 Confluence 界面进行一些说明,同时向你展示在那里可以进行一些通用的任务 ...
- A Creative Cutout CodeForces - 933D (计数)
大意:给定$n$个圆, 圆心均在原点, 第$k$个圆半径为$\sqrt{k}$ 定义一个点的美丽值为所有包含这个点的圆的编号和 定义函数$f(n)$为只有$n$个圆时所有点的贡献,求$\sum_{k= ...
- ccf窗口
#include<iostream> #include<cstring> #include<algorithm> #include<vector> us ...
- ps和fireworks切图网页优化,jpg为80时
- 多种方法实现 python 线程池
最近在做一个爬虫相关的项目,单线程的整站爬虫,耗时真的不是一般的巨大,运行一次也是心累,,,所以,要想实现整站爬虫,多线程是不可避免的,那么python多线程又应该怎样实现呢?这里主要要几个问题(关于 ...
- Flask初级(六)flash模板渲染
Project name :Flask_Plan templates:templates static:static 继续上篇的模板 我们已经可以静态调用模板,包括继承模板,保证了页面的一致性,但是我 ...
- L1-030 一帮一
“一帮一学习小组”是中小学中常见的学习组织方式,老师把学习成绩靠前的学生跟学习成绩靠后的学生排在一组.本题就请你编写程序帮助老师自动完成这个分配工作,即在得到全班学生的排名后,在当前尚未分组的学生中, ...