【Repost】A Practical Intro to Data Science
Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zipfianacademy.com.
There are plenty of articles and discussions on the web about what data science is, what qualitiesdefine a data scientist, how to nurture them, and how you should position yourself to be acompetitive applicant. There are far fewer resources out there about the steps to take in order to obtain the skills necessary to practice this elusive discipline. Here we will provide a collection of freely accessible materials and content to jumpstart your understanding of the theory and tools of Data Science.
At Zipfian Academy, we believe that everyone learns at different paces and in different ways. If you prefer a more structured and intentional learning environment, we run a 12 week immersive bootcamp training people to become data scientists through hands-on projects and real-world applications.
We would love to hear your opinions on what qualities make great data scientists, what a data science curriculum should cover, and what skills are most valuable for data scientists to know.
While the information contained in these resources is a great guide and reference, the best way to become a data scientist is to make, create, and share!
Environment
While the emerging field of data science is not tied to any specific tools, there are certain languages and frameworks that have become the bread and butter for those working in the field. We recommend Python as the programming language of choice for aspiring data scientists due to its general purpose applicability, a gentle (or firm) learning curve, and — perhaps the most compelling reason — the rich ecosystem of resources and libraries actively used by the scientific community.
Development
When learning a new language in a new domain, it helps immensely to have an interactive environment to explore and to receive immediate feedback. IPython provides an interactive REPL which also allows you to integrate a wide variety of frameworks (including R) into your Python programs.
Statistics
It is often said that a data scientist is someone who is better at software engineering than a statistician and better at statistics than any software engineer. As such, statistical inference underpins much of the theory behind data analysis and a solid foundation of statistical methods and probability serves as a stepping stone into the world of data science.
Courses
edx: Introduction to Statistics: A basic introductory statistics course.
Coursera: Statistics, Making sense of Data: A applied Statistics course that teaches the complete pipeline of statistical analysis.
MIT: Statistical Thinking and Data Analysis: Introduction to probability, sampling, regression, common distributions, and inference.
While R is the de facto standard for performing statistical analysis, it has quite a high learning curve and there are other areas of data science for which it is not well suited. To avoid learning a new language for a specific problem domain, we recommend trying to perform the exercises of these courses with Python and its numerous statistical libraries. You will find that much of the functionality of R can be replicated with NumPy, SciPy, matplotlib, and pandas.
Books
Well written books can be a great reference (and supplement) to these courses, and also provide a more independent learning experience. These may be useful if you already have some knowledge of the subject or just need to fill in some gaps in your understanding:
O’Reilly: Think Stats: An introduction to Probability and Statistics for Python programmers.
Introduction to Probability: Textbook for Berkeley’s Stats 134 class, an introductory treatment of probability with complementary exercises.
Lecture notes for Introduction to Probability: Compiled lecture notes of above textbook, complete with exercises.
OpenIntro: Statistics: Introductory text book with supplementary exercises and labs in an online portal.
Think Bayes: An simple introduction to Bayesian Statistics with Python code examples.
Machine Learning/Algorithms
A solid base of Computer Science and algorithms is essential for an aspiring data scientist. Luckily there are a wealth of great resources online, and machine learning is one of the more lucrative (and advanced) skills of a data scientist.
Courses
Coursera: Machine Learning: Stanford’s famous machine learning course taught by Andrew Ng.
Coursera: Computational Methods for Data Analysis: Statistical methods and data analysis applied to physical, engineering, and biological sciences.
MIT: Data Mining: An introduction to the techniques of data mining and how to apply ML algorithms to garner insights.
edx: Introduction to Artificial Intelligence: The first half of Berkeley’s popular AI course that teaches you to build autonomous agents to efficiently make decisions in stochastic and adversarial settings.
edx: Introduction to Computer Science and Programming: MIT’s introductory course to the theory and application of Computer Science.
Books
A first encounter with Machine Learning: An introduction to machine learning concepts focusing on the intuition and explanation behind why they work.
A Programmer’s Guide to Data Mining: A web based book complete with code samples (in Python) and exercises.
Data Structures and Algorithms: An introduction to computer science with code examples in Python — covers algorithm analysis, data structures, sorting algorithms, and object oriented design.
An Introduction to Data Mining: An interactive Decision Tree guide (with hyperlinked lectures) to learning data mining and ML.
Elements of Statistical Learning: One of the most comprehensive treatments of data mining and ML, often used as a university textbook.
An Introduction to Information Retrieval: Textbook from a Stanford course on NLP and information retrieval with sections on text classification, clustering, indexing, and web crawling.
Data ingestion and cleaning
One of the most under-appreciated aspects of data science is the cleaning and munging of data that often represents the most significant time sink during analysis. While there is never a silver bullet for such a problem, knowing the right tools, techniques, and approaches can help minimize time spent wrangling data.
Courses
- School of Data: A gentle introduction to cleaning data: A hands on approach to learning to clean data, with plenty of exercises and web resources.
Tutorials
- Predictive Analytics: Data Preparation: An introduction to the concepts and techniques of sampling data, accounting for erroneous values, and manipulating the data to transform it into acceptable formats.
Tools
OpenRefine (formerly Google Refine): A powerful tool for working with messy data, cleaning, transforming, extending it with web services, and linking to databases. Think Excel on steroids.
DataWrangler: Stanford research project that provides an interactive tool for data cleaning and transformation.
sed: “The ultimate stream editor” — used to process files with regular expressions often used for substitution.
awk: “Another cornerstone of UNIX shell programming” — used for processing rows and columns of information.
Visualization
The most insightful data analysis is useless unless you can effectively communicate your results. The art of visualization has a long history, and while being one of the most qualitative aspects of data science its methods and tools are well documented.
Courses
UC Berkeley: Visualization: Graduate class on the techniques and algorithms for creating effective visualizations.
Rice: Data Visualization: A treatment of data visualization and how to meaningfully present information from the perspective of Statistics.
Harvard: Introduction to Computing, Modeling, and Visualization: Connects the concepts of computing with data to the process of interactively visualizing results.
Books
- Tufte: The Visual Display of Quantitative Information: Not freely available, but perhaps the most influential text for the subject of data visualization. A classic that defined the field.
Tutorials
School of Data: From Data to Diagrams: A gentle introduction to plotting and charting data, with exercises.
Predictive Analytics: Overview and Data visualization: An introduction to the process of predictive modeling, and a treatment of the visualization of its results.
Tools
D3.js: Data-Driven Documents — Declarative manipulation of DOM elements with data dependent functions (with Python port).
Vega: A visualization grammer built on top of D3 for declarative visualizations in JSON. Released by the dream team at Trifacta, it provides a higher level abstraction than D3 for creating “ or SVG based graphics.
Rickshaw: A charting library built on top of D3 with a focus on interactive time series graphs.
modest maps: A lightweight library with a simple interface for working with maps in the browser (with ports to multiple languages).
Chart.js: Very simple (only six charts) HTML5 “ based plotting library with beautiful styling and animation.
Computing at Scale
When you start operating with data at the scale of the web (or greater), the fundamental approach and process of analysis must change. To combat the ever increasing amount of data, Google developed the MapReduce paradigm. This programming model has become the de facto standard for large scale batch processing since the release of Apache Hadoop in 2007, the open-source MapReduce framework.
Courses
UC Berkeley: Analyzing Big Data with Twitter: A course — taught in close collaboration with Twitter — that focuses on the tools and algorithms for data analysis as applied to Twitter microblog data (with project based curriculum).
Coursera: Web Intelligence and Big Data: An introduction to dealing with large quantities of data from the web; how the tools and techniques for acquiring, manipulating, querying, and analyzing data change at scale.
CMU: Machine Learning with Large Datasets: A course on scaling machine learning algorithms on Hadoop to handle massive datasets.
U of Chicago: Large Scale Learning: A treatment of handling large datasets through dimensionality reduction, classification, feature parametrization, and efficient data structures.
UC Berkeley: Scalable Machine Learning: A broad introduction to the systems, algorithms, models, and optimizations necessary at scale.
Books
Mining Massive Datasets: Stanford course resources on large scale machine learning and MapReduce with accompanying book.
Data-Intensive Text Processing with MapReduce: An introduction to algorithms for the indexing and processing of text that teaches you to “think in MapReduce.”
Hadoop: The Definitive Guide: The most thorough treatment of the Hadoop framework, a great tutorial and reference alike.
Programming Pig: An introduction to the Pig framework for programming data flows on Hadoop.
Putting it all together
Data Science is an inherently multidisciplinary field that requires a myriad of skills to be a proficient practitioner. The necessary curriculum has not fit into traditional course offerings, but as awareness of the need for individuals who have such abilities is growing, we are seeing universities and private companies creating custom classes.
Courses
UC Berkeley: Introduction to Data Science: A course taught by Jeff Hammerbacher and Mike Franklin that highlights each of the varied skills that a Data Scientist must be proficient with.
How to Process, Analyze and Visualize Data: A lab oriented course that teaches you the entire pipeline of data science; from acquiring datasets and analyzing them at scale to effectively visualizing the results.
Coursera: Introduction to Data Science: A tour of the basic techniques for Data Science including SQL and NoSQL databases, MapReduce on Hadoop, ML algorithms, and data visualization.
Columbia: Introduction to Data Science: A very comprehensive course that covers all aspects of data science, with an humanistic treatment of the field.
Columbia: Applied Data Science (with book): Another Columbia course — teaches applied software development fundamentals using real data, targeted towards people with mathematical backgrounds.
Coursera: Data Analysis (with notes and lectures): An applied statistics course that covers algorithms and techniques for analyzing data and interpreting the results to communicate your findings.
Books
- An Introduction to Data Science: The companion textbook to Syracuse University’s flagship course for their new Data Science program.
Tutorials
- Kaggle: Getting Started with Python for Data Science: A guided tour of setting up a development environment, an introduction to making your first competition submission, and validating your results.
Conclusion
Now this just scratches the surface of the infinitely deep field of Data Science and we encourage everyone to go out and try your hand at some science! We would love for you to join theconversation over @zipfianacademy and let us know if you want to learn more about any of these topics.
Blogs
Data Beta: Professor Joe Hellerstein’s blog about education, computing, and data.
Dataists: Hilary Mason and Vince Buffalo’s old blog that has a wealth of information and resources about the field and practice of data science.
Five Thirty Eight: Nate Silver’s famous NYT blog where he discusses predictive modeling and political forecasts.
grep alex: Alex Holmes’s blog about distributed computing and the intricacies of Hadoop.
Data Science 101: One man’s personal journey to becoming a data scientist (with plenty of resources)
no free hunch: Kaggle’s blog about the practice of data science and its competition highlights.
Resources
Berkeley: Introduction to Data Science: One of the most comprehensive lists of resources about all things data science.
Cloudera: New to Data Science: Resources about data science from Cloudera’s introduction to data science course/certification.
Kaggle: Tutorials: A set of tutorials, books, courses, and competitions for statistics, data analysis, and machine learning.
If you made it this far, you should check out our 12-week intensive program. Apply here!
Source: http://blog.zipfianacademy.com/post/46864003608/a-practical-intro-to-data-scienc
【Repost】A Practical Intro to Data Science的更多相关文章
- 【12c】扩展数据类型(Extended Data Types)-- MAX_STRING_SIZE
[12c]扩展数据类型(Extended Data Types)-- MAX_STRING_SIZE 在12c中,与早期版本相比,诸如VARCHAR2, NAVARCHAR2以及 RAW这些数据类型的 ...
- 【repost】H5总结
1.新增的语义化标签: <nav>: 导航 <header>: 页眉 <footer>: 页脚 <section>:区块 <article> ...
- 【LeetCode】170. Two Sum III – Data structure design
Difficulty:easy More:[目录]LeetCode Java实现 Description Design and implement a TwoSum class. It should ...
- 【MongoDB】mongoimport and mongoexport of data (一)
In the software development, we usually are faced with a common question of exporting or importing d ...
- 【repost】如何学好编程 (精挑细选编程教程,帮助现在在校学生学好编程,让你门找到编程的方向)四个方法总有一个学好编程的方法适合你
方法(一)编了这么久的程序,一直想找机会总结下其中的心得和方法,但回想我这段编程道路,又很难说清楚,如果按照我走过的所有路来说,显然是不可能的!当我看完了云风的<游戏之旅--编程感悟>和梁 ...
- 【转】Jmeter中使用CSV Data Set Config参数化不重复数据执行N遍
Jmeter中使用CSV Data Set Config参数化不重复数据执行N遍 要求: 今天要测试上千条数据,且每条数据要求执行多次,(模拟多用户多次抽奖) 1.用户id有175个,且没有任何排序规 ...
- 【hbase】Unable to read additional data from client sessionid 0x15c92bd1fca0003, likely client has closed socket
启动hbase ,验证出错 Master is initializing 查看zk日志,发现Unable to read additional data from client sessionid 0 ...
- 【repost】H5的新特性及部分API详解
h5新特性总览 移除的元素 纯表现的元素: basefont.big.center.font等 对可用性产生负面影响的元素: frame.frameset.noframes 新增的API 语义: 能够 ...
- 【repost】JavaScript 基本语法
JavaScript 基本语法,JavaScript 引用类型, JavaScript 面向对象程序设计.函数表达式和异步编程 三篇笔记是对<JavaScript 高级程序设计>和 < ...
随机推荐
- 使用MediaPlayer和SurfaceView播放视频
使用VideoView播放视频简单.方便,丹有些早期的开发者更喜欢使用MediaPlayer来播放视频,但由于MediaPlayer主要用于播放音频,因此它没有提供图像输出界面,此时 需要借助于Sur ...
- 使用drawBitmapMesh扭曲图像
Canvas提供了一个drawBitmapMesh(bitmap, meshWidth, meshHeight, verts, vertOffset, colors, colorOffset, pai ...
- Operating System Concepts with java 项目: Shell Unix 和历史特点
线程间通信,fork(),waitpid(),signal,捕捉信号,用c执行shell命令,共享内存,mmap 实验要求: 1.简单shell: 通过c实现基本的命令行shell操作,实现两个函数, ...
- 2440 lcd10分钟休眠修改
在我们的系统中,LCD的虚拟控制台和控制台TTY不是同一个设备,也就是说,如果在程序里单纯的printf是不行的!这样只能修改你正在使用的TTY的blankinterval,而你用的却是文本方式的设备 ...
- 面试题目-c和c++的区别
在很大程度上,标准C++是标准C的超集.实际上,所有C程序也是C++程序,然而,两者之间有少量区别.下面简要介绍一下最重要的区别. 1. 在C++中,局部变量可以在一个程序块内在任何地方声明,在 ...
- 基于K2 BPM的大型连锁企业开关店选址管理解决方案
业内有句名言:“门店最重要的是什么?第一是选址,第二是选址,第三还是选址” 选址是一个很复杂的综合性商业决策过程,需要定性考虑和定向分析.K2开关店&选址管理方案重点关注:如何开出更好的店?在 ...
- 【转发】CentOS 7 巨大变动之 systemd 取代 SysV的Init
1 systemd是什么 首先systmed是一个用户空间的程序,属于应用程序,不属于Linux内核范畴,Linux内核的主要特征在所有发行版中是统一的,厂商可以自由改变的是用户空间的应用程序. ...
- APP store 官方统计工具的常见的Q&A
Apple最近在iTunesConnect里最新发布了官方统计工具,提供了现有友盟统计平台和自有统计平台无法统计的数据,具有自己的独有特点,尤其是下面几个最让人头疼的流量分析转化,可以在App Ana ...
- html练习——个人简介
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...
- hdu1116 欧拉回路
//Accepted 248 KB 125 ms //欧拉回路 //以26个字母为定点,一个单词为从首字母到末尾字母的一条边 //下面就是有向图判断欧拉回路 //连通+节点入度和==出度和 或者 存在 ...