深数据 - Deep Data
暂无中文方面的信息,E文的也非常少,原文连接:
A lot of great pieces have been written about the relatively recent surge in interest in big data and data science, but in this piece I want to address the importance of deep data analysis: what we can learn from the statistical outliers by drilling down and asking, “What’s different here? What’s special about these outliers and what do they tell us about our models and assumptions?”
The reason that big data proponents are so excited about the burgeoning data revolution isn’t just because of the math. Don’t get me wrong, the math is fun, but we’re excited because we can begin to distill patterns that were previously invisible to us due to a lack of information.
That’s big data.
Of course, data are just a collection of facts; bits of information that are only given context — assigned meaning and importance — by human minds. It’s not until we do something with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.
And therein lies the rub.
Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?
(Semi)Automated science
In 2010, Cornell researchers Michael Schmidt and Hod Lipson published a groundbreaking paper in “Science” titled, “Distilling Free-Form Natural Laws from Experimental Data”. The premise was simple, and it essentially boiled down to the question, “can we algorithmically extract models to fit our data?”
So they hooked up a double pendulum — a seemingly chaotic system whose movements are governed by classical mechanics — and trained a machine learning algorithm on the motion data.
Their results were astounding.
In a matter of minutes the algorithm converged on Newton’s second law of motion: f = ma. What took humanity tens of thousands of years to accomplish was completed on 32-cores in essentially no time at all.
In 2011, some neuroscience colleagues of mine, lead by Tal Yarkoni, published a paper in “Nature Methods” titled “Large-scale automated synthesis of human functional neuroimaging data”. In this paper the authors sought to extract patterns from the overwhelming flood of brain imaging research.
To do this they algorithmically extracted the 3D coordinates of significant brain activations from thousands of neuroimaging studies, along with words that frequently appeared in each study. Using these two pieces of data along with some simple (but clever) mathematical tools, they were able to create probabilistic maps of brain activation for any given term.
In other words, you type in a word such as “learning” on their website search and visualization tool, NeuroSynth, and they give you back a pattern of brain activity that you should expect to see during a learning task.
But that’s not all. Given a pattern of brain activation, the system can perform a reverse inference, asking, “given the data that I’m observing, what is the most probable behavioral state that this brain is in?”
Similarly, in late 2010, my wife (Jessica Voytek) and I undertook a project to algorithmically discover associations between concepts in the peer-reviewed neuroscience literature. As a neuroscientist, the goal of my research is to understand relationships between the human brain, behavior, physiology, and disease. Unfortunately, the facts that tie all that information together are locked away in more than 21 million static peer-reviewed scientific publications.
How many undergrads would I need to hire to read through that many papers? Any volunteers?
Even more mind-boggling, each year more than 30,000 neuroscientists attend the annual Society for Neuroscience conference. If we assume that only two-thirds of those people actually do research, and if we assume that they only work a meager (for the sciences) 40 hours a week, that’s around 40 million person-hours dedicated to but one branch of the sciences.
Annually.
This means that in the 10 years I’ve been attending that conference, more than 400 million person-hours have gone toward the pursuit of understanding the brain. Humanity built the pyramids in 30 years. The Apollo Project got us to the moon in about eight.
So my wife and I said to ourselves, “there has to be a better way”.
Which lead us to create brainSCANr, a simple (simplistic?) tool (currently itself under peer review) that makes the assumption that the more often that two concepts appear together in the titles or abstracts of published papers, the more likely they are to be associated with one another.
For example, if 10,000 papers mention “Alzheimer’s disease” that also mention “dementia,” then Alzheimer’s disease is probably related to dementia. In fact, there are 17,087 papers that mention Alzheimer’s anddementia, whereas there are only 14 papers that mention Alzheimer’s and, for example, creativity.
From this, we built what we’re calling the “cognome”, a mapping between brain structure, function, and disease.
Big data, data mining, and machine learning are becoming critical tools in the modern scientific arsenal. Examples abound: text mining recipes to find cultural food taste preferences, analyzing cultural trends via word use in books (“culturomics”), identifying seasonality of mood from tweets, and so on.
But so what?
Deep data
What those three studies show us is that it’s possible to automate, or at least semi-automate, critical aspects of the scientific method itself. Schmidt and Lipson show that it is possible to extract equations that perfectly model even seemingly chaotic systems. Yarkoni and colleagues show that it is possible to infer a complex behavioral state given input brian data.
My wife and I wanted to show that brainSCANr could be put to work for something more useful than just quantifying relationships between terms. So we created a simple algorithm to perform what we’re calling “semi-automated hypothesis generation,” which is predicated on a basic “the friend of a friend should be a friend” concept.
In the example below, the neurotransmitter “serotonin” has thousands of shared publications with “migraine,” as well as with the brain region “striatum.” However, migraine and striatum only share 16 publications.
That’s very odd. Because in medicine there is a serotonin hypothesis for the root cause of migraines. And we (neuroscientists) know that serotonin is released in the striatum to modulate brain activity in that region. Given that those two things are true, why is there so little research regarding the role of the striatum in migraines?
Perhaps there’s a missing connection?
Such missing links and other outliers in our models are the essence of deep data analytics. Sure, any data scientist worth their salt can take a mountain of data and reduce it down to a few simple plots. And such plots are important because they tell a story. But those aren’t the only stories that our data can tell us.
For example, in my geoanalytics work as the data evangelist for Uber, I put some of my (definitely rudimentary) neuroscience network analytic skills to work to figure out how people move from neighborhood to neighborhood in San Francisco.
At one point, I checked to see if men and women moved around the city differently. A very simple regression model showed that the number of men who go to any given neighborhood significantly predicts the number of woman who go to that same neighborhood.
No big deal.
But what’s cool was seeing where the outliers were. When I looked at the models’ residuals, that’s where I found the far more interesting story. While it’s good to have a model that fits your data, knowing where the modelbreaks down is not only important for internal metrics, but it also makes for a more interesting story:
What’s happening in the Marina district that so many more women want to go there? And why are there so many more men in SoMa?
The paradox of information
The interpretation of big data analytics can be a messy game. Maybe there are more men in SoMa because that’s where AT&T Park is. But maybe there are just five guys who live in SoMa who happen to take Uber 100 times more often than average.
While data-driven posts make for fun reading (and writing), in the sciences we need to be more careful that we don’t fall prey to ad hoc, just-so stories that sound perfectly reasonable and plausible, but which we cannot conclusively prove.
In 2008, psychologists David McCabe and Alan Castel published a paper in the journal “Cognition,” titled,“Seeing is believing: The effect of brain images on judgments of scientific reasoning”. In that paper, they showed that summaries of cognitive neuroscience findings that are accompanied by an image of a brain scan were rated as more credible by the readers.
This should cause any data scientist serious concern. In fact, I’ve formulated three laws of statistical analyses:
- The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
- The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
- Any sufficiently advanced statistics can trick people into believing the results reflect truth.
The first law is closely related to the “bike shed effect” (also known as Parkinson’s Law of Triviality) which states that, “the time spent on any item of the agenda will be in inverse proportion to the sum involved.”
In other words, if you try to build a simple thing such as a public bike shed, there will be endless town hall discussions wherein people argue over trivial details such as the color of the door. But if you want to build a nuclear power plant — a project so vast and complicated that most people can’t understand it — people will defer to expert opinion.
Such is the case with statistics.
If you make the mistake of going into the comments section of any news piece discussing a scientific finding, invariably someone will leave the comment, “correlation does not equal causation.”
We’ll go ahead and call that truism Voytek’s fourth law.
But people rarely have the capacity to argue against the methods and models used by, say, neuroscientists or cosmologists.
But sometimes we get perfect models without any understanding of the underlying processes. What do we learn from that?
The always fantastic Radiolab did a follow-up story on the Schmidt and Lipson “automated science” research in an episode titled “Limits of Science”. It turns out, a biologist contacted Schmidt and Lipson and gave them data to run their algorithm on. They wanted to figure out the principles governing the dynamics of a single-celled bacterium. Their result?
Well sometimes the stories we tell with data … they just don’t make sense to us.
They found, “two equations that describe the data.”
But they didn’t know what the equations meant. They had no context. Their variables had no meaning. Or, as Radiolab co-host Jad Abumrad put it, “the more we turn to computers with these big questions, the more they’ll give us answers that we just don’t understand.”
So while big data projects are creating ridiculously exciting new vistas for scientific exploration and collaboration, we have to take care to avoid the Paradox of Information wherein we can know too many things without knowing what those “things” are.
Because at some point, we’ll have so much data that we’ll stop being able to discern the map from the territory. Our goal as (data) scientists should be to distill the essence of the data into something that tells as true a story as possible while being as simple as possible to understand. Or, to operationalize that sentence better, we should aim to find balance between minimizing the residuals of our models and maximizing our ability to make sense of those models.
Recently, Stephen Wolfram released the results of a 20-year long experiment in personal data collection, including every keystroke he’s typed and every email he’s sent. In response, Robert Krulwich, the other co-host of Radiolab, concludes by saying “I’m looking at your data [Dr. Wolfram], and you know what’s amazing to me? How much of you is missing.”
Personally, I disagree; I believe that there’s a humanity in those numbers and that Mr. Krulwich is falling prey to the idea that science somehow ruins the magic of the universe. Quoth Dr. Sagan:
“It is sometimes said that scientists are unromantic, that their passion to figure out robs the world of beauty and mystery. But is it not stirring to understand how the world actually works — that white light is made of colors, that color is the way we perceive the wavelengths of light, that transparent air reflects light, that in so doing it discriminates among the waves, and that the sky is blue for the same reason that the sunset is red? It does no harm to the romance of the sunset to know a little bit about it.”
So go forth and create beautiful stories, my statistical friends. See you after peer-review.
Related:
阅读(423) | 评论(0) | 转发(0) |
-->
深数据 - Deep Data的更多相关文章
- jQuery源代码学习之六——jQuery数据缓存Data
一.jQuery数据缓存基本原理 jQuery数据缓存就两个全局Data对象,data_user以及data_priv; 这两个对象分别用于缓存用户自定义数据和内部数据: 以data_user为例,所 ...
- 数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics)之间有什么关系?
本来我以为不需要解释这个问题的,到底数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)有什么区别,但是前几天因为有个学弟问我,我想了想发现我竟然也回答 ...
- 代码的坏味道(16)——纯稚的数据类(Data Class)
坏味道--纯稚的数据类(Data Class) 特征 纯稚的数据类(Data Class) 指的是只包含字段和访问它们的getter和setter函数的类.这些仅仅是供其他类使用的数据容器.这些类不包 ...
- Oracle数据泵(Data Dump)错误汇集
Oracle数据泵(Data Dump)使用过程当中经常会遇到一些奇奇怪怪的错误案例,下面总结一些自己使用数据泵(Data Dump)过程当中遇到的问题以及解决方法.都是在使用过程中遇到的问题,以后陆 ...
- jQuery1.9.1源码分析--数据缓存Data模块
jQuery1.9.1源码分析--数据缓存Data模块 阅读目录 jQuery API中Data的基本使用方法介绍 jQuery.acceptData(elem)源码分析 jQuery.data(el ...
- mysql导入数据load data infile用法
mysql导入数据load data infile用法 基本语法: load data [low_priority] [local] infile 'file_name txt' [replace | ...
- android登录实现,存储数据到/data/data/包名/info.txt
1.一个简单登录界面布局代码如下: @1采用线性布局加相对布局方式 @2线性布局采用垂直排列 <?xml version="1.0" encoding="utf-8 ...
- FAT32文件系统学习(3) —— 数据区(DATA区)
FAT32文件系统学习(3) —— 数据区(DATA区) 今天继续学习FAT32文件系统的数据区部分(Data区).其实这一篇应该是最有意思的,我们可以通过在U盘内放入一些文件,然后在程序中读取出来: ...
- PowerBI新功能: 自定义数据连接器(Data Connector)
你是不是觉得原有的数据连接器(Data Connector)列表,就像女人的衣柜,总少那么一件你想要的呐? 现在,你的救星来了!你可以自己造一个了! Power BI的数据连接器(Data Conne ...
随机推荐
- Java之父职场路
Java之父——詹姆斯·高斯林出生于加拿大,是一位计算机编程天才.在卡内基·梅隆大学攻读计算机博士学位时,他编写了多处理器版本的Unix操作系统,是JAVA编程语言的创始人.1991年,在Sun公司工 ...
- Note: log switch off, only log_main and log_events will have logs!
真机(华为c8813)在Eclipase上测试,打不出logcat信息,只有这样的一句话:Note: log switch off, only log_main and log_events will ...
- spring mvc集成freemarker使用
freemarker作为视图技术出现的比velocity早,想当年struts风靡一时,freemarker作为视图层也风光了一把.但现在velocity作为后起之秀的轻量级模板引擎,更容易得到青睐. ...
- Python学习笔记之selenium 定制启动 chrome 的选项
在自动化中,默认情况下我们打开的就是一个普通的纯净的chrome浏览器,而我们平时在使用浏览器时,经常就添加一些插件,扩展,代理之类的应用.所以使用 selenium 时,我们可能需要对 chrome ...
- Oracle 分区表-Range分区
原文:http://www.tuicool.com/articles/MzeM7r 一.什么是分区表 Oracle提供了分区技术以支持VLDB(Very Large DataBase).分区表通过对分 ...
- haproxy启动时提示失败
haproxy启动时提示失败:[ALERT] 164/110030 (11606) : Starting proxy linuxyw.com: cannot bind socket 这个问题,其实就是 ...
- mysql跟踪执行的sql语句
修改my.cnf配置文件 /usr/local/mysql/bin/mysql --verbose --help | grep -A 1 'Default options' Default optio ...
- 视频x264编码浅析
声明 x264_param_t 结构体变量: x264_param_t params; x264_param_default_preset(¶ms, "ultrafast&q ...
- linux下FTP使用
如何在linux下开启FTP服务 1. 首先服务器要安装ftp软件,查看是否已经安装ftp软件下: #which vsftpd 如果看到有vsftpd的目录说明服务器已经安装了ftp软件 ...
- saltstack系列(一)——介绍与安装
saltstack简介 saltstack 是服务器基础架构集中化管理平台.具备配置管理.远程执行.监控等功能. saltstack 基于python. 注意: puppet是一种Linux.Unix ...