What does it mean for an algorithm to be fair
What does it mean for an algorithm to be fair
In 2014 the White House commissioned a 90-day study that culminated in a report (pdf) on the state of “big data” and related technologies. The authors give many recommendations, including this central warning.
Warning: algorithms can facilitate illegal discrimination!
Here’s a not-so-imaginary example of the problem. A bank wants people to take loans with high interest rates, and it also serves ads for these loans. A modern idea is to use an algorithm to decide, based on the sliver of known information about a user visiting a website, which advertisement to present that gives the largest chance of the user clicking on it. There’s one problem: these algorithms are trained on historical data, and poor uneducated people (often racial minorities) have ahistorical trend of being more likely to succumb to predatory loan advertisements than the general population. So an algorithm that is “just” trying to maximize clickthrough may also be targeting black people, de facto denying them opportunities for fair loans. Such behavior is illegal.

On the other hand, even if algorithms are not making illegal decisions, by training algorithms on data produced by humans, we naturally reinforce prejudices of the majority. This can have negative effects, like Google’s autocomplete finishing “Are transgenders” with “going to hell?” Even if this is the most common question being asked on Google, and even if the majority think it’s morally acceptable to display this to users, this shows that algorithms do in fact encode our prejudices. People are slowly coming to realize this, to the point where it was recently covered in the New York Times.
There are many facets to the algorithm fairness problem one that has not even been widely acknowledged as a problem, despite the Times article. The message has been echoed by machine learning researchers but mostly ignored by practitioners. In particular, “experts” continually make ignorant claims such as, “equations can’t be racist,” and the following quote from the above linked article about how the Chicago Police Department has been using algorithms to do predictive policing.
Wernick denies that [the predictive policing] algorithm uses “any racial, neighborhood, or other such information” to assist in compiling the heat list [of potential repeat offenders].
Why is this ignorant? Because of the well-known fact that removing explicit racial features from data does not eliminate an algorithm’s ability to learn race. If racial features disproportionately correlate with crime (as they do in the US), then an algorithm which learns race is actually doing exactly what it is designed to do! One needs to be very thorough to say that an algorithm does not “use race” in its computations. Algorithms are not designed in a vacuum, but rather in conjunction with the designer’s analysis of their data. There are two points of failure here: the designer can unwittingly encode biases into the algorithm based on a biased exploration of the data, and the data itself can encode biases due to human decisions made to create it. Because of this, the burden of proof is (or should be!) on the practitioner to guarantee they are not violating discrimination law. Wernick should instead prove mathematically that the policing algorithm does not discriminate.
While that viewpoint is idealistic, it’s a bit naive because there is no accepted definition of what it means for an algorithm to be fair. In fact, from a precise mathematical standpoint, there isn’t even a precise legal definition of what it means for any practice to be fair. In the US the existing legal theory is called disparate impact, which states that a practice can be considered illegal discrimination if it has a “disproportionately adverse” effect on members of a protected group. Here “disproportionate” is precisely defined by the 80% rule, but this is somehow not enforced as stated. As with many legal issues, laws are broad assertions that are challenged on a case-by-case basis. In the case of fairness, the legal decision usually hinges on whether an individual was treated unfairly, because the individual is the one who files the lawsuit. Our understanding of the law is cobbled together, essentially through anecdotes slanted by political agendas. A mathematician can’t make progress with that. We want the mathematical essence of fairness, not something that can be interpreted depending on the court majority.
The problem is exacerbated for data mining because the practitioners often demonstrate a poor understanding of statistics, the management doesn’t understand algorithms, and almost everyone is lulled into a false sense of security via abstraction (remember, “equations can’t be racist”). Experts in discrimination law aren’t trained to audit algorithms, and engineers aren’t trained in social science or law. The speed with which research becomes practice far outpaces the speed at which anyone can keep up. This is especially true at places like Google and Facebook, where teams of in-house mathematicians and algorithm designers bypass the delay between academia and industry.
And perhaps the worst part is that even the world’s best mathematicians and computer scientists don’t know how to interpret the output of many popular learning algorithms. This isn’t just a problem that stupid people aren’t listening to smart people, it’s that everyone is “stupid.” A more politically correct way to say it: transparency in machine learning is a wide open problem. Take, for example, deep learning. A far-removed adaptation of neuroscience to data mining, deep learning has become the flagship technique spearheading modern advances in image tagging, speech recognition, and other classification problems.
A typical example of how a deep neural network learns to tag images. Image source:http://engineering.flipboard.com/2015/05/scaling-convnets/
The picture above shows how low level “features” (which essentially boil down to simple numerical combinations of pixel values) are combined in a “neural network” to more complicated image-like structures. The claim that these features represent natural concepts like “cat” and “horse” have fueled the public attention on deep learning for years. But looking at the above, is there any reasonable way to say whether these are encoding “discriminatory information”? Not only is this an open question, but we don’t even know what kinds of problems deep learning can solve! How can we understand to what extent neural networks can encode discrimination if we don’t have a deep understanding of why a neural network is good at what it does?
What makes this worse is that there are only about ten people in the world who understand the practical aspects of deep learning well enough to achieve record results for deep learning. This means they spent a ton of time tinkering the model to make it domain-specific, and nobody really knows whether the subtle differences between the top models correspond to genuine advances or slight overfitting or luck. Who is to say whether the fiasco with Google tagging images of black people as apes was caused by the data or the deep learning algorithm or by some obscure tweak made by the designer? I doubt even the designer could tell you with any certainty.
Opacity and a lack of interpretability is the rule more than the exception in machine learning. Celebrated techniques like Support Vector Machines, Boosting, and recent popular “tensor methods” are all highly opaque. This means that even if ew knew what fairness meant, it is still a challenge (though one we’d be suited for) to modify existing algorithms to become fair. But with recent success stories in theoretical computer science connecting security, trust, and privacy, computer scientists have started to take up the call of nailing down what fairness means, and how to measure and enforce fairness in algorithms. There is now a yearly workshop called Fairness, Accountability, and Transparency in Machine Learning(FAT-ML, an awesome acronym), and some famous theory researchers are starting to get involved, as are social scientists and legal experts. Full disclosure, two days ago I gave a talk as part of this workshop on modifications to AdaBoost that seem to make it more fair. More on that in a future post.
From our perspective, we the computer scientists and mathematicians, the central obstacle is still that we don’t have a good definition of fairness.
In the next post I want to get a bit more technical. I’ll describe the parts of the fairness literature I like (which will be biased), I’ll hypothesize about the tension between statistical fairness and individual fairness, and I’ll entertain ideas on how someone designing a controversial algorithm (such as a predictive policing algorithm) could maintain transparency and accountability over its discriminatory impact. In subsequent posts I want to explain in more detail why it seems so difficult to come up with a useful definition of fairness, and to describe some of the ideas I and my coauthors have worked on.
Until then!
Like this:
What does it mean for an algorithm to be fair的更多相关文章
- 挑子学习笔记:两步聚类算法(TwoStep Cluster Algorithm)——改进的BIRCH算法
转载请标明出处:http://www.cnblogs.com/tiaozistudy/p/twostep_cluster_algorithm.html 两步聚类算法是在SPSS Modeler中使用的 ...
- PE Checksum Algorithm的较简实现
这篇BLOG是我很早以前写的,因为现在搬移到CNBLOGS了,经过整理后重新发出来. 工作之前的几年一直都在搞计算机安全/病毒相关的东西(纯学习,不作恶),其中PE文件格式是必须知识.有些PE文件,比 ...
- [异常解决] windows用SSH和linux同步文件&linux开启SSH&ssh client 报 algorithm negotiation failed的解决方法之一
1.安装.配置与启动 SSH分客户端openssh-client和openssh-server 如果你只是想登陆别的机器的SSH只需要安装openssh-client(ubuntu有默认安装,如果没有 ...
- [Algorithm] 使用SimHash进行海量文本去重
在之前的两篇博文分别介绍了常用的hash方法([Data Structure & Algorithm] Hash那点事儿)以及局部敏感hash算法([Algorithm] 局部敏感哈希算法(L ...
- Backtracking algorithm: rat in maze
Sept. 10, 2015 Study again the back tracking algorithm using recursive solution, rat in maze, a clas ...
- [Algorithm & NLP] 文本深度表示模型——word2vec&doc2vec词向量模型
深度学习掀开了机器学习的新篇章,目前深度学习应用于图像和语音已经产生了突破性的研究进展.深度学习一直被人们推崇为一种类似于人脑结构的人工智能算法,那为什么深度学习在语义分析领域仍然没有实质性的进展呢? ...
- [Algorithm] 群体智能优化算法之粒子群优化算法
同进化算法(见博客<[Evolutionary Algorithm] 进化算法简介>,进化算法是受生物进化机制启发而产生的一系列算法)和人工神经网络算法(Neural Networks,简 ...
- [Evolutionary Algorithm] 进化算法简介
进化算法,也被成为是演化算法(evolutionary algorithms,简称EAs),它不是一个具体的算法,而是一个“算法簇”.进化算法的产生的灵感借鉴了大自然中生物的进化操作,它一般包括基因编 ...
- Debian 8 jessie, OpenSSH ssh connection server responded Algorithm negotiation failed
安装了debian 8.5 就出问题了. root@debian8:~# lsb_release -aNo LSB modules are available.Distributor ID: Debi ...
随机推荐
- ASP.NET 开发人员应该知道的8个网站
1.CodeProject http://www.codeproject.com/ 2.DotNetNuke 3.4GuysFromRolla 4.DevSource 5.DevX http://ww ...
- JUnit4简要说明
单元测试(unit testing),是指对软件中的最小可测试单元进行检查和验证. 开发者编写一小段代码,用于检验被测代码的一个很小的.很明确的功能是否正确. 通常而言,一个单元测试是用于判断某个特定 ...
- ZooKeeper 分布式锁实现
1 场景描述 在分布式应用, 往往存在多个进程提供同一服务. 这些进程有可能在相同的机器上, 也有可能分布在不同的机器上. 如果这些进程共享了一些资源, 可能就需要分布式锁来锁定对这些资源的访问. 2 ...
- C# 右键菜单 contextMenuStrip
1.添加contextMenuStrip控件 默认命名:contextMenuStrip1 2.在要显示的控件上,找到其ContextMenuStrip属性,并设置其为contextMenuStrip ...
- Magento 使用心得
Modules->模块 Controller->控制器 Model->模型 Magento是这个星球上最强大的购物车网店平台.当然,你应该已经对此毫无疑问了.不过,你可能还不知道,M ...
- Velocity 模板引擎介绍
一.变量 1. 变量定义 #set($name =“velocity”) 2. 变量的使用 在模板文件中使用$name 或者${name} 来使用定义的变量.推荐使用${name} 这种格式,因为在模 ...
- UNIX线程之间的关系
我们在一个线程中经常会创建另外的新线程,如果主线程退出,会不会影响它所创建的新线程呢?下面就来讨论一下. 1. 主线程等待新线程先结束退出,主线程后退出.正常执行. 示例代码: #include & ...
- RabbitMQ 原文译06--Remote procedure call(RPC)
在第三篇文章中, 我们学习了怎么使用队列在多了消息消费者当中进行耗时任务轮询. 但是如果我们想要在远程电脑上运行一个方法,然后等待其执行结果,这就是一个不同的场景,这种就是我们一般讲的RPC(远程过程 ...
- 锋利的Jquery解惑系列(一)------基本概念大锅炖
声明:虽然是基本概念但也是笔者经过一番学习才总结的这些文章,所以他不包括Jquery优缺点.特点.语法的介绍. 概念一:jQuery对像与DOM对象 DOM(Document Object Model ...
- linux一部分常用的命令
如今的web项目,一般在windows下开发,然后部署在linux上.搜索了一下原因,大概是说,linux免费,此外,linux长时间运行都没有问题,可以达到1到2年不停机.因此,需要学习一些常用的l ...