Journey from a Python noob to a Kaggler on Python

So, you want to become a data scientist or may be you are already one and want to expand your tool repository. You have landed at the right place. The aim of this page is to provide a comprehensive learning path to people new to python for data analysis. This path provides a comprehensive overview of steps you need to learn to use Python for data analysis. If you already have some background, or don’t need all the components, feel free to adapt your own paths and let us know how you made changes in the path.

You can also check the mini version of this learning path –> Infographic: Quick Guide to learn Data Science in Python

Step 0: Warming up

Before starting your journey, the first question to answer is:

Why use Python?

or

How would Python be useful?

Watch the first 30 minutes of this talk from Jeremy, Founder of DataRobot at PyCon 2014, Ukraine to get an idea of how useful Python could be.

Step 1: Setting up your machine

Now that you have made up your mind, it is time to set up your machine. The easiest way to proceed is to just download Anaconda from Continuum.io . It comes packaged with most of the things you will need ever. The major downside of taking this route is that you will need to wait for Continuum to update their packages, even when there might be an update available to the underlying libraries. If you are a starter, that should hardly matter.

If you face any challenges in installing, you can find more detailed instructions for various OS here

Step 2: Learn the basics of Python language

You should start by understanding the basics of the language, libraries and data structure. The python track from Codecademy is one of the best places to start your journey. By end of this course, you should be comfortable writing small scripts on Python, but also understand classes and objects.

Specifically learn: Lists, Tuples, Dictionaries, List comprehensions, Dictionary comprehensions 

Assignment: Solve the python tutorial questions on HackerRank. These should get your brain thinking on Python scripting

Alternate resources: If interactive coding is not your style of learning, you can also look at TheGoogle Class for Python. It is a 2 day class series and also covers some of the parts discussed later.

Step 3: Learn Regular Expressions in Python

You will need to use them a lot for data cleansing, especially if you are working on text data. The best way tolearn Regular expressions is to go through the Google class and keep this cheat sheet handy.

Assignment: Do the baby names exercise

If you still need more practice, follow this tutorial for text cleaning. It will challenge you on various steps involved in data wrangling.

Step 4: Learn Scientific libraries in Python – NumPy, SciPy, Matplotlib and Pandas

This is where fun begins! Here is a brief introduction to various libraries. Let’s start practicing some common operations.

  • Practice the NumPy tutorial thoroughly, especially NumPy arrays. This will form a good foundation for things to come.
  • Next, look at the SciPy tutorials. Go through the introduction and the basics and do the remaining ones basis your needs.
  • If you guessed Matplotlib tutorials next, you are wrong! They are too comprehensive for our need here. Instead look at this ipython notebook till Line 68 (i.e. till animations)
  • Finally, let us look at Pandas. Pandas provide DataFrame functionality (like R) for Python. This is also where you should spend good time practicing. Pandas would become the most effective tool for all mid-size data analysis. Start with a short introduction, 10 minutes to pandas. Then move on to a more detailed tutorial on pandas.

You can also look at Exploratory Data Analysis with Pandas and Data munging with Pandas

Additional Resources:

  • If you need a book on Pandas and NumPy, “Python for Data Analysis by Wes McKinney”
  • There are a lot of tutorials as part of Pandas documentation. You can have a look at them here

Assignment: Solve this assignment from CS109 course from Harvard.

Step 5: Effective Data Visualization

Go through this lecture form CS109. You can ignore the initial 2 minutes, but what follows after that is awesome! Follow this lecture up with this assignment

Step 6: Learn Scikit-learn and Machine Learning

Now, we come to the meat of this entire process. Scikit-learn is the most useful library on python for machine learning. Here is a brief overview of the library. Go through lecture 10 to lecture 18 fromCS109 course from Harvard. You will go through an overview of machine learning, Supervised learning algorithms like regressions, decision trees, ensemble modeling and non-supervised learning algorithms like clustering. Follow individual lectures with the assignments from those lectures.

Additional Resources:

Assignment: Try out this challenge on Kaggle

Step 7: Practice, practice and Practice

Congratulations, you made it!

You now have all what you need in technical skills. It is a matter of practice and what better place to practice than compete with fellow Data Scientists on Kaggle. Go, dive into one of the live competitions currently running on Kaggle and give all what you have learnt a try!

Step 8: Deep Learning

Now that you have learnt most of machine learning techniques, it is time to give Deep Learning a shot. There is a good chance that you already know what is Deep Learning, but if you still need a brief intro, here it is.

I am myself new to deep learning, so please take these suggestions with a pinch of salt. The most comprehensive resource is deeplearning.net. You will find everything here – lectures, datasets, challenges, tutorials. You can also try the course from Geoff Hinton a try in a bid to understand the basics of Neural Networks.

Get Started with Python: A Complete Tutorial To Learn Data Science with Python From Scratch

P.S. In case you need to use Big Data libraries, give Pydoop and PyMongo a try. They are not included here as Big Data learning path is an entire topic in itself.

【转】Comprehensive learning path – Data Science in Python的更多相关文章

  1. Comprehensive learning path – Data Science in Python深入学习路径-使用python数据中学习

    http://blog.csdn.net/pipisorry/article/details/44245575 关于怎么学习python,并将python用于数据科学.数据分析.机器学习中的一篇非常好 ...

  2. A Complete Tutorial to Learn Data Science with Python from Scratch

    A Complete Tutorial to Learn Data Science with Python from Scratch Introduction It happened few year ...

  3. Machine Learning and Data Science 教授大师

    http://www.cs.cmu.edu/~avrim/courses.html Foundations of Data Science Avrim Blum, www.cs.cornell.edu ...

  4. R8:Learning paths for Data Science[continuous updating…]

    Comprehensive learning path – Data Science in Python Journey from a Python noob to a Kaggler on Pyth ...

  5. 【转】The most comprehensive Data Science learning plan for 2017

    I joined Analytics Vidhya as an intern last summer. I had no clue what was in store for me. I had be ...

  6. Intermediate Python for Data Science learning 2 - Histograms

    Histograms from:https://campus.datacamp.com/courses/intermediate-python-for-data-science/matplotlib? ...

  7. 学习笔记之Introduction to Data Visualization with Python | DataCamp

    Introduction to Data Visualization with Python | DataCamp https://www.datacamp.com/courses/introduct ...

  8. Data science blogs

    Data science blogs A curated list of data science blogs Agile Data Science http://blog.sense.io/ (RS ...

  9. 学习Data Science/Deep Learning的一些材料

    原文发布于我的微信公众号: GeekArtT. 从CFA到如今的Data Science/Deep Learning的学习已经有一年的时间了.期间经历了自我的兴趣.擅长事务的探索和试验,有放弃了的项目 ...

随机推荐

  1. 黄聪:如何使用Add-on SDK开发一个自己的火狐扩展

    火狐开放了扩展的开发权限给程序员们,相信很多人都会希望自己做一些扩展来方便一些使用. 我最近做一些项目也需要开发一个火狐扩展,方便收集自己需要的数据,因此研究了几天怎么开发,现在已经差不多完成了,就顺 ...

  2. linux中日志文件查找,根据关键字,vi命令,awk和wc

    参考: http://www.cnblogs.com/ggjucheng/archive/2013/01/13/2856896.html 当时需求:查看系统日志,统计系统的处理时间(从请求进去系统到系 ...

  3. 1135 Is It A Red-Black Tree

    题意:给出k个二叉搜索树的前序序列,判断该树是否为红黑树. 红黑树的定义: 结点的颜色非红即黑 根结点的颜色必须是黑色 每个叶子结点(指的是空结点,图中并没有画出来)都是黑色的 如果某个结点为红色,则 ...

  4. ROS6.16开始支持802.11ac了,扫盲下

    Wi-Fi的5G频段与802.11AC背后那些事儿本文章来自某路由论坛,作者为张导,本人转载,原地址http://bbs.hiwifi.com/thread-9086-1-1.html 曾几何时,大家 ...

  5. 用CSS绘制最常见的40种形状和图形

    今天在国外的网站上看到了很多看似简单却又非常强大的纯CSS绘制的图形,里面有最简单的矩形.圆形和三角形,也有各种常见的多边形,甚至是阴阳太极和网站小图标,真的非常强大,分享给大家. Square(正方 ...

  6. Python ord(char)

    Given a string of length one, return an integer representing the Unicode code point of the character ...

  7. python执行报错 configparser.NoSectionError: No section: 'section_1'

    场景:请求获取验证码模块regVC.py读取配置文件config.ini时,regVC.py模块单独执行正常,但通过run_all.py模块批量执行时报错,找不到section 解决办法:配置文件路径 ...

  8. vs2010 安装 Ajax Control Toolkit

    Ajax Control Toolkit 7.1005.0 The Ajax Control Toolkit contains a rich set of controls that you can ...

  9. TCP/IP协议详解之广播和多播

    广播和多播仅应用于 U D P,它们对需将报文同时传往多个接收者的应用来说十分重要.T C P是一个面向连接的协议,它意味着分别运行于两主机(由 I P地址确定)内的两进程(由端口号确定)间存在一条连 ...

  10. Rhythmk 一步一步学 JAVA (13) Spring-2 之Ben懒加载以及生命周期,单例

    1.定义Demo类: package com.rhythmk.spring; public class User { public void Init () { System.out.println( ...