深数据 - Deep Data
暂无中文方面的信息,E文的也非常少,原文连接:
A lot of great pieces have been written about the relatively recent surge in interest in big data and data science, but in this piece I want to address the importance of deep data analysis: what we can learn from the statistical outliers by drilling down and asking, “What’s different here? What’s special about these outliers and what do they tell us about our models and assumptions?”
The reason that big data proponents are so excited about the burgeoning data revolution isn’t just because of the math. Don’t get me wrong, the math is fun, but we’re excited because we can begin to distill patterns that were previously invisible to us due to a lack of information.
That’s big data.
Of course, data are just a collection of facts; bits of information that are only given context — assigned meaning and importance — by human minds. It’s not until we do something with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.
And therein lies the rub.
Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?
(Semi)Automated science
In 2010, Cornell researchers Michael Schmidt and Hod Lipson published a groundbreaking paper in “Science” titled, “Distilling Free-Form Natural Laws from Experimental Data”. The premise was simple, and it essentially boiled down to the question, “can we algorithmically extract models to fit our data?”
So they hooked up a double pendulum — a seemingly chaotic system whose movements are governed by classical mechanics — and trained a machine learning algorithm on the motion data.
Their results were astounding.
In a matter of minutes the algorithm converged on Newton’s second law of motion: f = ma. What took humanity tens of thousands of years to accomplish was completed on 32-cores in essentially no time at all.
In 2011, some neuroscience colleagues of mine, lead by Tal Yarkoni, published a paper in “Nature Methods” titled “Large-scale automated synthesis of human functional neuroimaging data”. In this paper the authors sought to extract patterns from the overwhelming flood of brain imaging research.
To do this they algorithmically extracted the 3D coordinates of significant brain activations from thousands of neuroimaging studies, along with words that frequently appeared in each study. Using these two pieces of data along with some simple (but clever) mathematical tools, they were able to create probabilistic maps of brain activation for any given term.
In other words, you type in a word such as “learning” on their website search and visualization tool, NeuroSynth, and they give you back a pattern of brain activity that you should expect to see during a learning task.
But that’s not all. Given a pattern of brain activation, the system can perform a reverse inference, asking, “given the data that I’m observing, what is the most probable behavioral state that this brain is in?”
Similarly, in late 2010, my wife (Jessica Voytek) and I undertook a project to algorithmically discover associations between concepts in the peer-reviewed neuroscience literature. As a neuroscientist, the goal of my research is to understand relationships between the human brain, behavior, physiology, and disease. Unfortunately, the facts that tie all that information together are locked away in more than 21 million static peer-reviewed scientific publications.
How many undergrads would I need to hire to read through that many papers? Any volunteers?
Even more mind-boggling, each year more than 30,000 neuroscientists attend the annual Society for Neuroscience conference. If we assume that only two-thirds of those people actually do research, and if we assume that they only work a meager (for the sciences) 40 hours a week, that’s around 40 million person-hours dedicated to but one branch of the sciences.
Annually.
This means that in the 10 years I’ve been attending that conference, more than 400 million person-hours have gone toward the pursuit of understanding the brain. Humanity built the pyramids in 30 years. The Apollo Project got us to the moon in about eight.
So my wife and I said to ourselves, “there has to be a better way”.
Which lead us to create brainSCANr, a simple (simplistic?) tool (currently itself under peer review) that makes the assumption that the more often that two concepts appear together in the titles or abstracts of published papers, the more likely they are to be associated with one another.
For example, if 10,000 papers mention “Alzheimer’s disease” that also mention “dementia,” then Alzheimer’s disease is probably related to dementia. In fact, there are 17,087 papers that mention Alzheimer’s anddementia, whereas there are only 14 papers that mention Alzheimer’s and, for example, creativity.
From this, we built what we’re calling the “cognome”, a mapping between brain structure, function, and disease.
Big data, data mining, and machine learning are becoming critical tools in the modern scientific arsenal. Examples abound: text mining recipes to find cultural food taste preferences, analyzing cultural trends via word use in books (“culturomics”), identifying seasonality of mood from tweets, and so on.
But so what?
Deep data
What those three studies show us is that it’s possible to automate, or at least semi-automate, critical aspects of the scientific method itself. Schmidt and Lipson show that it is possible to extract equations that perfectly model even seemingly chaotic systems. Yarkoni and colleagues show that it is possible to infer a complex behavioral state given input brian data.
My wife and I wanted to show that brainSCANr could be put to work for something more useful than just quantifying relationships between terms. So we created a simple algorithm to perform what we’re calling “semi-automated hypothesis generation,” which is predicated on a basic “the friend of a friend should be a friend” concept.
In the example below, the neurotransmitter “serotonin” has thousands of shared publications with “migraine,” as well as with the brain region “striatum.” However, migraine and striatum only share 16 publications.
That’s very odd. Because in medicine there is a serotonin hypothesis for the root cause of migraines. And we (neuroscientists) know that serotonin is released in the striatum to modulate brain activity in that region. Given that those two things are true, why is there so little research regarding the role of the striatum in migraines?
Perhaps there’s a missing connection?
Such missing links and other outliers in our models are the essence of deep data analytics. Sure, any data scientist worth their salt can take a mountain of data and reduce it down to a few simple plots. And such plots are important because they tell a story. But those aren’t the only stories that our data can tell us.
For example, in my geoanalytics work as the data evangelist for Uber, I put some of my (definitely rudimentary) neuroscience network analytic skills to work to figure out how people move from neighborhood to neighborhood in San Francisco.
At one point, I checked to see if men and women moved around the city differently. A very simple regression model showed that the number of men who go to any given neighborhood significantly predicts the number of woman who go to that same neighborhood.
No big deal.
But what’s cool was seeing where the outliers were. When I looked at the models’ residuals, that’s where I found the far more interesting story. While it’s good to have a model that fits your data, knowing where the modelbreaks down is not only important for internal metrics, but it also makes for a more interesting story:
What’s happening in the Marina district that so many more women want to go there? And why are there so many more men in SoMa?
The paradox of information
The interpretation of big data analytics can be a messy game. Maybe there are more men in SoMa because that’s where AT&T Park is. But maybe there are just five guys who live in SoMa who happen to take Uber 100 times more often than average.
While data-driven posts make for fun reading (and writing), in the sciences we need to be more careful that we don’t fall prey to ad hoc, just-so stories that sound perfectly reasonable and plausible, but which we cannot conclusively prove.
In 2008, psychologists David McCabe and Alan Castel published a paper in the journal “Cognition,” titled,“Seeing is believing: The effect of brain images on judgments of scientific reasoning”. In that paper, they showed that summaries of cognitive neuroscience findings that are accompanied by an image of a brain scan were rated as more credible by the readers.
This should cause any data scientist serious concern. In fact, I’ve formulated three laws of statistical analyses:
- The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
- The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
- Any sufficiently advanced statistics can trick people into believing the results reflect truth.
The first law is closely related to the “bike shed effect” (also known as Parkinson’s Law of Triviality) which states that, “the time spent on any item of the agenda will be in inverse proportion to the sum involved.”
In other words, if you try to build a simple thing such as a public bike shed, there will be endless town hall discussions wherein people argue over trivial details such as the color of the door. But if you want to build a nuclear power plant — a project so vast and complicated that most people can’t understand it — people will defer to expert opinion.
Such is the case with statistics.
If you make the mistake of going into the comments section of any news piece discussing a scientific finding, invariably someone will leave the comment, “correlation does not equal causation.”
We’ll go ahead and call that truism Voytek’s fourth law.
But people rarely have the capacity to argue against the methods and models used by, say, neuroscientists or cosmologists.
But sometimes we get perfect models without any understanding of the underlying processes. What do we learn from that?
The always fantastic Radiolab did a follow-up story on the Schmidt and Lipson “automated science” research in an episode titled “Limits of Science”. It turns out, a biologist contacted Schmidt and Lipson and gave them data to run their algorithm on. They wanted to figure out the principles governing the dynamics of a single-celled bacterium. Their result?
Well sometimes the stories we tell with data … they just don’t make sense to us.
They found, “two equations that describe the data.”
But they didn’t know what the equations meant. They had no context. Their variables had no meaning. Or, as Radiolab co-host Jad Abumrad put it, “the more we turn to computers with these big questions, the more they’ll give us answers that we just don’t understand.”
So while big data projects are creating ridiculously exciting new vistas for scientific exploration and collaboration, we have to take care to avoid the Paradox of Information wherein we can know too many things without knowing what those “things” are.
Because at some point, we’ll have so much data that we’ll stop being able to discern the map from the territory. Our goal as (data) scientists should be to distill the essence of the data into something that tells as true a story as possible while being as simple as possible to understand. Or, to operationalize that sentence better, we should aim to find balance between minimizing the residuals of our models and maximizing our ability to make sense of those models.
Recently, Stephen Wolfram released the results of a 20-year long experiment in personal data collection, including every keystroke he’s typed and every email he’s sent. In response, Robert Krulwich, the other co-host of Radiolab, concludes by saying “I’m looking at your data [Dr. Wolfram], and you know what’s amazing to me? How much of you is missing.”
Personally, I disagree; I believe that there’s a humanity in those numbers and that Mr. Krulwich is falling prey to the idea that science somehow ruins the magic of the universe. Quoth Dr. Sagan:
“It is sometimes said that scientists are unromantic, that their passion to figure out robs the world of beauty and mystery. But is it not stirring to understand how the world actually works — that white light is made of colors, that color is the way we perceive the wavelengths of light, that transparent air reflects light, that in so doing it discriminates among the waves, and that the sky is blue for the same reason that the sunset is red? It does no harm to the romance of the sunset to know a little bit about it.”
So go forth and create beautiful stories, my statistical friends. See you after peer-review.
Related:
阅读(423) | 评论(0) | 转发(0) |
-->
深数据 - Deep Data的更多相关文章
- jQuery源代码学习之六——jQuery数据缓存Data
一.jQuery数据缓存基本原理 jQuery数据缓存就两个全局Data对象,data_user以及data_priv; 这两个对象分别用于缓存用户自定义数据和内部数据: 以data_user为例,所 ...
- 数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics)之间有什么关系?
本来我以为不需要解释这个问题的,到底数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)有什么区别,但是前几天因为有个学弟问我,我想了想发现我竟然也回答 ...
- 代码的坏味道(16)——纯稚的数据类(Data Class)
坏味道--纯稚的数据类(Data Class) 特征 纯稚的数据类(Data Class) 指的是只包含字段和访问它们的getter和setter函数的类.这些仅仅是供其他类使用的数据容器.这些类不包 ...
- Oracle数据泵(Data Dump)错误汇集
Oracle数据泵(Data Dump)使用过程当中经常会遇到一些奇奇怪怪的错误案例,下面总结一些自己使用数据泵(Data Dump)过程当中遇到的问题以及解决方法.都是在使用过程中遇到的问题,以后陆 ...
- jQuery1.9.1源码分析--数据缓存Data模块
jQuery1.9.1源码分析--数据缓存Data模块 阅读目录 jQuery API中Data的基本使用方法介绍 jQuery.acceptData(elem)源码分析 jQuery.data(el ...
- mysql导入数据load data infile用法
mysql导入数据load data infile用法 基本语法: load data [low_priority] [local] infile 'file_name txt' [replace | ...
- android登录实现,存储数据到/data/data/包名/info.txt
1.一个简单登录界面布局代码如下: @1采用线性布局加相对布局方式 @2线性布局采用垂直排列 <?xml version="1.0" encoding="utf-8 ...
- FAT32文件系统学习(3) —— 数据区(DATA区)
FAT32文件系统学习(3) —— 数据区(DATA区) 今天继续学习FAT32文件系统的数据区部分(Data区).其实这一篇应该是最有意思的,我们可以通过在U盘内放入一些文件,然后在程序中读取出来: ...
- PowerBI新功能: 自定义数据连接器(Data Connector)
你是不是觉得原有的数据连接器(Data Connector)列表,就像女人的衣柜,总少那么一件你想要的呐? 现在,你的救星来了!你可以自己造一个了! Power BI的数据连接器(Data Conne ...
随机推荐
- 戴尔PowerEdge RAID控制卡使用示例(PERC H710P为例)
Dell PERC使用示例列表(H710p) 特别说明,本文相关RAID的操作,仅供网友在测试环境里学习和理解戴尔PowerEdge服务器RAID控制卡的功能和使用方法.切勿直接在生产服务器上做相关实 ...
- Eclipse中调试Jar包的源码(调试Struts2源码)
首先在Eclipse中创建一个新的项目,加入运行Struts2所需要的JAR文件,并将它们加到项目的CLASSPATH中(在Lisbs中右击 build path 如下图: ),成功后的界面如图 1- ...
- ubuntu中配置samba方法
1.在保证能上网的前提下,安装samba软件包,中途出现是否执行,一直点击回车键 #sudo apt-get install samba #sudo apt-get install smbclient ...
- Oracle 数据库备份还原(Expdp/impdp)记录
最近公司将原数据库服务器切换.之前没整过这块,也是一堆的度娘.经过不停的摸索,终于成功了.现在将这份艰辛记录下来,方便自己以后查阅的同时,方便有类似需求的同学参考. 我们此次切换共分:ERP.LOS. ...
- java多线程的练习------------。加深
总结:线程的理解不够.还不够 package com.aa; public class MyThread implements Runnable {// 我们可以继承一个Thread.但是我们可以实现 ...
- 013. MVC5过滤器
微软提供了4中过滤器: 1.Action过滤器: 在Action方法执行之前和Action方法执行之后, 会执行此过滤器中的代码. 比如在执行public ActionResult Index()方法 ...
- python学习(十六) 测试
测试驱动开发. 16.1 先测试,后编码 16.1.1 精确的需求说明 16.1.2 为改变而计划 16.1.3 测试的4个步骤 16.2 测试工具 16.2.1 doctest 16.2.2 uni ...
- apk、图片下载工具(1)
package com.js.ai.modules.pointwall.util; import java.io.BufferedInputStream; import java.io.Buffere ...
- springboot成神之——basic auth和JWT验证结合
本文介绍basic auth和JWT验证结合 目录结构 依赖 config配置文件WebSecurityConfig filter过滤器JWTLoginFilter filter过滤器JWTAuthe ...
- Flask之单元测试
5.2单元测试 为什么要测试? Web程序开发过程一般包括以下几个阶段:[需求分析,设计阶段,实现阶段,测试阶段].其中测试阶段通过人工或自动来运行测试某个系统的功能.目的是检验其是否满足需求,并得出 ...