WTF is computer vision?

Posted Nov 13, 2016 by Devin Coldewey, Contributor
 
Next Story

 

Someone across the room throws you a ball and you catch it. Simple, right?

Actually, this is one of the most complex processes we’ve ever attempted to comprehend – let alone recreate. Inventing a machine that sees like we do is a deceptively difficult task, not just because it’s hard to make computers do it, but because we’re not entirely sure how we do it in the first place.

What actually happens is roughly this: the image of the ball passes through your eye and strikes your retina, which does some elementary analysis and sends it along to the brain, where the visual cortex more thoroughly analyzes the image. It then sends it out to the rest of the cortex, which compares it to everything it already knows, classifies the objects and dimensions, and finally decides on something to do: raise your hand and catch the ball (having predicted its path). This takes place in a tiny fraction of a second, with almost no conscious effort, and almost never fails. So recreating human vision isn’t just a hard problem, it’s a set of them, each of which relies on the other.

Well, no one ever said this would be easy. Except, perhaps, AI pioneer Marvin Minsky, who famously instructed a graduate student in 1966 to “connect a camera to a computer and have it describe what it sees.” Pity the kid: 50 years later, we’re still working on it.

Serious research began in the 50s and started along three distinct lines: replicating the eye (difficult); replicating the visual cortex (very difficult); and replicating the rest of the brain (arguably the most difficult problem ever attempted).

To see

Reinventing the eye is the area where we’ve had the most success. Over the past few decades, we have created sensors and image processors that match and in some ways exceed the human eye’s capabilities. With larger, more optically perfect lenses and semiconductor subpixels fabricated at nanometer scales, the precision and sensitivity of modern cameras is nothing short of incredible. Cameras can also record thousands of images per second and detect distances with great precision.

An image sensor one might find in a digital camera.

Yet despite the high fidelity of their outputs, these devices are in many ways no better than a pinhole camera from the 19th century: They merely record the distribution of photons coming in a given direction. The best camera sensor ever made couldn’t recognize a ball — much less be able to catch it.

The hardware, in other words, is severely limited without the software — which, it turns out, is by far the greater problem to solve. But modern camera technology does provide a rich and flexible platform on which to work.

To describe

This isn’t the place for a complete course on visual neuroanatomy, but suffice it to say that our brains are built from the ground up with seeing in mind, so to speak. More of the brain is dedicated to vision than any other task, and that specialization goes all the way down to the cells themselves. Billions of them work together to extract patterns from the noisy, disorganized signal from the retina.

Sets of neurons excite one another if there’s contrast along a line at a certain angle, say, or rapid motion in a certain direction. Higher-level networks aggregate these patterns into meta-patterns: a circle, moving upwards. Another network chimes in: the circle is white, with red lines. Another: it is growing in size. A picture begins to emerge from these crude but complementary descriptions.

A “histogram of oriented gradients,” finding edges and other features using a technique like that found in the brain’s visual areas.

Early research into computer vision, considering these networks as being unfathomably complex, took a different approach: “top-down” reasoning — a book looks like /this/, so watch for /this/ pattern, unless it’s on its side, in which case it looks more like /this/. A car looks like /this/ and moves like /this/.

We can barely come up with a working definition of how our minds work, much less how to simulate it.

For a few objects in controlled situations, this worked well, but imagine trying to describe every object around you, from every angle, with variations for lighting and motion and a hundred other things. It became clear that to achieve even toddler-like levels of recognition would require impractically large sets of data.

 

A “bottom-up” approach mimicking what is found in the brain is more promising. A computer can apply a series of transformations to an image and discover edges, the objects they imply, perspective and movement when presented with multiple pictures, and so on. The processes involve a great deal of math and statistics, but they amount to the computer trying to match the shapes it sees with shapes it has been trained to recognize — trained on other images, the way our brains were.

What an image like this one above (from Purdue University’s E-lab) is showing is the computer displaying that, by its calculations, the objects highlighted look and act like other examples of that object, to a certain level of statistical certainty.

Proponents of bottom-up architecture might have said “I told you so.” Except that until recent years, the creation and operation of artificial neural networks was impractical because of the immense amount of computation they require. Advances in parallel computing have eroded those barriers, and the last few years have seen an explosion of research into and using systems that imitate — still very approximately — the ones in our brain. The process of pattern recognition has been sped up by orders of magnitude, and we’re making more progress every day.

To understand

Of course, you could build a system that recognizes every variety of apple, from every angle, in any situation, at rest or in motion, with bites taken out of it, anything — and it wouldn’t be able to recognize an orange. For that matter, it couldn’t even tell you what an apple is, whether it’s edible, how big it is or what they’re used for.

The problem is that even good hardware and software aren’t much use without an operating system.

For us, that’s the rest of our minds: short and long term memory, input from our other senses, attention and cognition, a billion lessons learned from a trillion interactions with the world, written with methods we barely understand to a network of interconnected neurons more complex than anything we’ve ever encountered.

The future of computer vision is in integrating the powerful but specific systems we’ve created with broader ones.

This is where the frontiers of computer science and more general artificial intelligence converge — and where we’re currently spinning our wheels. Between computer scientists, engineers, psychologists, neuroscientists and philosophers, we can barely come up with a working definition of how our minds work, much less how to simulate it.

That doesn’t mean we’re at a dead end. The future of computer vision is in integrating the powerful but specific systems we’ve created with broader ones that are focused on concepts that are a bit harder to pin down: context, attention, intention.

That said, computer vision even in its nascent stage is still incredibly useful. It’s in our cameras, recognizing faces and smiles. It’s in self-driving cars, reading traffic signs and watching for pedestrians. It’s in factory robots, monitoring for problems and navigating around human co-workers. There’s still a long way to go before they see like we do — if it’s even possible — but considering the scale of the task at hand, it’s amazing that they see at all.

FEATURED IMAGE: BRYCE DURBIN

 
 

(转) WTF is computer vision?的更多相关文章

  1. Computer vision labs

    积累记录一些视觉实验室,方便查找 1.  多伦多大学计算机科学系 2.  普林斯顿大学计算机视觉和机器人实验室 3.  牛津大学Torr Vision Group 4.  伯克利视觉和学习中心 Pro ...

  2. Computer Vision: OpenCV, Feature Tracking, and Beyond--From <<Make Things See>> by Greg

    In the 1960s, the legendary Stanford artificial intelligence pioneer, John McCarthy, famously gave a ...

  3. [转载]Three Trending Computer Vision Research Areas, 从CVPR看接下来几年的CV的发展趋势

    As I walked through the large poster-filled hall at CVPR 2013, I asked myself, “Quo vadis Computer V ...

  4. Computer Vision 学习 -- 图像存储格式

    本文把自己理解的图像存储格式总结一下. 计算机中的数据,都是二进制的,所以图片也不例外. 这是opencv文档的描述,具体在代码里面,使用矩阵来进行存储. 类似下图是(BGR格式): 图片的最小单位是 ...

  5. Analyzing The Papers Behind Facebook's Computer Vision Approach

    Analyzing The Papers Behind Facebook's Computer Vision Approach Introduction You know that company c ...

  6. 计算机视觉和人工智能的状态:我们已经走得很远了 The state of Computer Vision and AI: we are really, really far away.

    The picture above is funny. But for me it is also one of those examples that make me sad about the o ...

  7. Computer Vision的尴尬---by林达华

    Computer Vision的尴尬---by林达华 Computer Vision是AI的一个非常活跃的领域,每年大会小会不断,发表的文章数以千计(单是CVPR每年就录取300多,各种二流会议每年的 ...

  8. Computer Vision Applied to Super Resolution

    Capel, David, and Andrew Zisserman. "Computer vision applied to super resolution." Signal ...

  9. Computer Vision Algorithm Implementations

    Participate in Reproducible Research General Image Processing OpenCV (C/C++ code, BSD lic) Image man ...

随机推荐

  1. 学习Find函数和select

    Find函数其实就类似于在excel按下Ctrl+F出现的查找功能:在某个区域中查找你要找的字符,一旦找到就定位到第一个对应的单元格.所以Find函数的返回值是个单元格,也就是个range值.举例,s ...

  2. 采用ubuntu系统来安装tensorflow

    最近在学习google新开源的深度学习框架tensorflow.发现安装它的时候,需要依赖python2.7.X;我之前一直使用的linux是centos.而centos不更新了,里面的自带的pyth ...

  3. Java集合运用技巧

    需要唯一吗? 需要:Set 需要制定顺序吗? 需要:TreeSet 不需要:HashSet 但是想要一个和存储一致的顺序(有序):LinkedHashSet 不需要:List 需要频繁增删吗? 需要: ...

  4. Python学习路程day21

    本节内容: 项目实战:开发一个WEB聊天室 功能需求: 用户可以与好友一对一聊天 可以搜索.添加某人为好友 用户可以搜索和添加群 每个群有管理员可以审批用户的加群请求,群管理员可以用多个,群管理员可以 ...

  5. entity framework 新手入门篇(4)-entity framework扩展之 entityframework.extended

    对于EF的操作,我们已经有了大概的了解了,但对于实战来说,似乎还欠缺着一些常用的功能,那就是批量的删除,更新数据. 承接上面的部分,我们有一个叫做House的数据库,其中包含house表和seller ...

  6. MFC之TreeCtrl控件使用经验总结

    树形控件可以用于树形的结构,其中有一个根接点(Root)然后下面有许多子结点,而每个子结点上有允许有一个或多个或没有子结点.MFC中使用CTreeCtrl类来封装树形控件的各种操作.通过调用BOOL ...

  7. UIKit框架之UIEvent

    1.继承链:NSObject 2.事件大致可以分为三种事件:触摸事件.动作事件.遥控事件 3.获取事件的touches (1)- (NSSet<UITouch *> *)allTouche ...

  8. const和readonly区别

    内容来源<<你必须知道的.NET>>(转载) 标题:什么才是不变:const和readonly 内容: const:用 const 修饰符声明的成员叫常量,是在编译期初始化并嵌 ...

  9. TabLayout和ViewPager联动时的问题及解决方案

    问题概述 TabLayout搭配ViewPager关联使用时,在未调用TabLayout的setupWithViewPager(mViewPager)方法之前,ViewPager的内容和TabLayo ...

  10. C#的IPAddress IPEndPoint

    以前觉得什么都能记住 翻一遍书就能去考试了,现在回过头来想一些东西,突然有种模糊的陌生感,应了那句好记性不如烂笔头.做笔记终归是利大于弊的.麻烦一点但是却受用. 突然想从头开始,看一些过去的书,补一些 ...