Pushing state-of-the-art in 3D content understanding

2019-10-31 06:34:08

This blog is copied from: https://ai.facebook.com/blog/pushing-state-of-the-art-in-3d-content-understanding/

In order to interpret the world around us, AI systems must understand visual scenes in three dimensions. This need extends beyond robotics, navigation, and even augmented reality applications. Even with 2D photos and videos, the scenes and objects depicted are themselves three-dimensional, of course, and truly intelligent content-understanding systems must be able to recognize the geometry of a cup’s handle when it’s being rotated in a video, or identify which objects are in the foreground and background of a photo.

Today, we’re sharing details on several new Facebook AI research projects that advance the state of the art in 3D image understanding in different but complementary ways. This work, which is being presented at the International Conference on Computer Vision (ICCV) in Seoul, addresses a variety of use cases and circumstances, with different types and amounts of training data and inputs.

  • Mesh R-CNN is a novel, state-of-the-art method to predict the most accurate 3D shapes in a wide range of real-world 2D images. This method, which leverages our general Mask R-CNN framework for object instance segmentation, can detect even complex objects, such as the legs of a chair or overlapping furniture.

  • Using an alternative and complementary approach to Mesh R-CNN, termed C3DPO, we’re the first to achieve a successful large-scale 3D reconstruction of nonrigid shapes on three benchmarks for more than 14 object categories by interpreting 3D geometry. We achieve this using only 2D keypoints and zero 3D annotations.

  • We’ve introduced a novel method to learn association between images and 3D shapes while significantly reducing the need for annotated training examples. This brings us closer to self-supervised systems that can create 3D representations for more kinds of objects.

  • We’ve developed a novel technique, called VoteNet, to perform object detection for circumstances when 3D input from LIDAR or other sensors is available. While most traditional systems for this task depend on 2D image signals, ours is based purely on 3D point clouds, which achieves higher precision than prior work.

This research builds on recent advances in using deep learning to predict and localize objects in an image, as well as new tools and architectures for 3D shape understanding, like voxels, point clouds, and meshes. The field of computer vision extends to a wide range of tasks, but 3D understanding will play a central role in advancing the ability of AI systems to more closely understand, interpret, and operate in the real world.

Achieving state-of-the-art in predicting 3D shapes of unconstrained, obstructed objects

Perception systems like Mask R-CNN are powerful and versatile tools for understanding images. But because they make predictions in 2D, they ignore the 3D structure of the world. Leveraging the advances in 2D perception, we designed a 3D object reconstruction model that predicts 3D object shapes from unconstrained real-world images with a range of optical challenges, including objects with occlusion, clutter, and diverse topologies. Adding a third dimension to object detection systems that are robust against such complexities requires stronger engineering capabilities, and current engineering frameworks have hindered progress in this area.

 
暂停
 
0:00
 

取消静音

更多视图设置HD

进入全屏

 
 
 
 
Mesh R-CNN takes an input image, predicts object instances in that image, and infers their 3D shape. To capture diversity in geometries and topologies, it first predicts coarse voxels, which are refined for accurate mesh predictions.

To address these challenges, we augmented Mask R-CNN’s 2D object segmentation system with a mesh prediction branch, and we built Torch3d, a Pytorch library with highly optimized 3D operators in order to implement the system. Mesh R-CNN uses Mask R-CNN to detect and classify the various objects in an image. It then infers 3D shapes with a novel mesh predictor, which is composed of a hybrid approach of voxel prediction followed by mesh refinement. This two-step process enables us to achieve higher results than prior work for predicting fine-grained 3D structures. Torch3d helps make this possible by enabling efficient, flexible, and modular implementation of complex operations, like chamfer distance, differentiable mesh sampling, and a differentiable renderer.

We use Detectron2 to implement the resulting system, which uses RGB images as input in order to both detect objects and predict 3D shapes. Similar to Mask R-CNN’s use of supervised learning for strong 2D perception, our novel approach learns 3D prediction using fully supervised learning with pairs of images and meshes. For training, we use the Pix3D data set, composed of 10,000 pairs of images and meshes, which is significantly smaller than 2D benchmarks typicallying contain hundreds of thousands of images and object annotations.

We evaluated Mesh R-CNN on two data sets and achieved strong results on both. On the Pix3D data set, Mesh R-CNN is the first system to be able to jointly detect objects of all categories and estimate their full 3D shape across diverse, cluttered, and occluded scenes of furniture. Previous work focused on evaluating models that were trained on perfectly cropped, unoccluded image segments. And on the ShapeNet data set, our hybrid approach of voxel prediction and mesh refinement outperforms prior work by a 7 percent relative margin.

 
暂停
 
-0:35
 

取消静音

更多视图设置HD

进入全屏

 
 
 
 
System overview of Mesh R-CNN. We augment Mask R-CNN with 3D shape inference.

Accurately predicting and reconstructing the shapes of unconstrained scenes in the real world is an important step toward enhancing new experiences, like virtual reality and other forms of telepresence. Still, gathering annotated data for 3D images is substantially more complex and time-consuming than doing so for 2D images, which is why data sets for 3D shape prediction have lagged compared with their 2D counterparts. We’re therefore exploring different approaches to leveraging both supervised and self-supervised learning for reconstructing objects in 3D.

Read the full paper on Mesh R-CNN here.

Reconstructing 3D object categories with 2D keypoints

For scenarios when meshes and corresponding images are not available for training and full reconstruction of static objects or scenes are not necessary, we’ve developed an alternative approach. Our new C3DPO (Canonical 3D Pose Networks) system builds reconstructions of 3D keypoint models and achieves state-of-the-art reconstruction results using the more widely accessible and abundant 2D keypoint supervision. C3DPO helps us understand the 3D geometry of objects in a weakly supervised fashion suitable for large-scale deployment.

 
C3DPO generates 3D keypoints from detected 2D keypoints for a range of object categories, accurately differentiating between viewpoint changes and shape deformations.

2D keypoints, which track specific parts of the object category (e.g., human joints or bird wings), provide a complete set of cues about the object geometry and its deformations, or viewpoint changes. The resulting 3D keypoints are useful, for instance, in modeling 3D faces and full-body meshes for more lifelike avatar graphics in VR. Similar to Mesh R-CNN, C3DPO reconstructs 3D objects using unconstrained images with occlusions and missing values.

C3DPO is the first method capable of reconstructing data sets consisting of hundreds of thousands of images with several thousand 2D keypoints. We achieve state-of-the-art reconstruction accuracy on three different data sets for more than 14 diverse nonrigid object categories. And we’ve made the code for this work available here.

Our model has two important innovations. First, given a set of monocular 2D keypoints, our new 3D reconstruction network predicts the parameters of the corresponding camera viewpoint as well as the 3D keypoint locations in a canonical orientation. Second, we introduce a novel regularization technique termed canonicalization, which consists of a second auxiliary deep network that learns alongside the 3D reconstruction network. This technique addresses the ambiguity that comes with factorizing 3D viewpoint and shape. These two innovations enable us to capture much better statistical models of the data than is possible with traditional approaches.

Such reconstructions were previously unachievable mainly because of memory constraints with the previous matrix-factorization-based methods which, unlike our deep network, cannot operate in a “minibatch” regime. Previous methods addressed the modeling of deformations by leveraging multiple simultaneous images and establishing correspondences between instantaneous 3D reconstructions, which requires hardware that’s mostly found in special labs. The efficiencies introduced by C3DPO makes it possible to enable 3D reconstruction in cases where employing hardware for 3D capture isn’t feasible, such as with large-scale objects like airplanes. Read the full paper on C3DPO here.

Learning pixel-to-surface mappings from image collections

 
暂停
 
0:00
 

取消静音

更多视图设置HD

进入全屏

 
 
 
 
Our system learns a parameterized convolutional neural network (CNN) that takes an image as input and predicts a per-pixel canonical surface map that indicates a corresponding location point on the template shape. The similar coloring of the predicted canonical surface mapping between the 2D image and 3D shape implies correspondence.

We take a step further toward reducing the supervision required for developing 3D understanding for generic classes of objects. We introduce an approach that can leverage unannotated image collections with approximate automatic instance segmentations. Instead of explicitly predicting the 3D structure underlying an image, we tackle a complementary task of mapping pixels in an image to the surface of a category-level template for 3D shapes.

Not only does this mapping allow us to understand the image in context of a category-level 3D shape, but it also gives us the ability of generalizing correspondences between objects of the same class or category. For instance, when people see the highlighted beak of the bird in the left image, we can easily locate the corresponding point in the image on the right.

This is possible because we intuitively understand the shared 3D structure across these instances. Our novel approach of mapping pixels of images to a canonical 3D surface enables our learned system to have this capability as well. When evaluating our approach by measuring its accuracy of transferring correspondences across instances, we achieved results that are about twice as accurate as previous self-supervised methods that did not leverage the underlying 3D structure of the task.

Our key insight – which allows learning with significantly less supervision – is that mapping from pixel to 3D surface can be paired with the inverse operation (going from 3D to pixel) in order to complete a cycle. Our novel approach operationalizes this and can learn using only unannotated, free, publicly available image collections with approximate segmentations from a detection method. Our resulting system can be used off the shelf, applied generally alongside other methods of top-down 3D prediction to provide a complementary pixelwise 3D understanding, and we’ve released the code here.

As demonstrated by the consistency of the colors of the cars that are moving in the video above, our system yields an invariant pixelwise embedding for objects undergoing motion and rotation. This consistency extends beyond a specific instance and can be useful in scenarios where we need to understand the commonalities across objects.

 
Instead of learning the 2D to 2D correspondence between two images directly, we learn 2D to 3D correspondence and ensure consistency with a 3D to 2D reprojection — and this consistent cycle serves as a supervised signal for learning the 2D to 3D correspondence.

For instance, if we train a system to learn the correct place to sit on a chair or where to grasp a mug, our representation can be useful the next time the system needs to understand where to sit on a different chair or how to grasp another mug. Such tasks can not only help deepen our understanding of traditional 2D images and video content, but also enhance AR/VR experiences by transferring representations of objects. Read more about canonical surface mapping here.

Improving the fundamentals of object detection in current 3D systems

As leading-edge technologies, like autonomous agents and systems to scan 3D spaces, continue to advance, we need to push forward the mechanisms for detecting objects when 3D data is readily available. In these cases, a 3D scene understanding system needs to know what objects are in a scene and where they are in order to support high-level tasks like navigation. We’ve improved upon existing systems by constructing VoteNet, a highly accurate end-to-end 3D object detection network tailored for point clouds, which was nominated for the Best Paper Award at ICCV 2019. Unlike traditional systems for this task, which depend on 2D image signals, ours is one of the first systems based purely on 3D point clouds. This approach is more efficient and achieves much higher recognition precision than previous works.

Our model, which we’ve open-sourced here, achieves state-of-the-art 3D detection outperforming all previous methods for 3D object detection by at least 3.7 and 18.4 mAP (mean average precision) increases in SUN RGB-D and ScanNet, respectively. VoteNet outperforms previous methods by using only geometric information, without relying on standard color images.

VoteNet has a simple design, compact model size, and high efficiency, with a speed of about 100 milliseconds for a full scene and a smaller memory footprint than previous methods designed for research. Our algorithm takes in 3D point clouds from depth cameras and returns 3D bounding boxes of objects with their semantic classes.

 
Illustration of the VoteNet architecture for 3D object detection in point clouds.

We introduce a voting mechanism that’s inspired by the classical Hough voting algorithm. Using this method, we essentially generate new points that lie close to object centers, and these points can then be grouped and aggregated to generate box proposals. With the basic idea of voting, which is learned through deep neural networks, a set of 3D seed points vote to object centers in order to recover where they are and what they are.

As the use of 3D scanners grows in the real world — already common in applications from autonomous vehicles to biomedicine — it’s important for us to be able to achieve semantic understanding of the 3D content by localizing and classifying objects of a 3D scene. Supplementing 2D cameras with more advanced depth camera sensors for 3D recognition allows us to capture a more robust view of any given scene. With VoteNet, systems can better recognize major objects in a scene, supporting tasks like placing a virtual object, or navigation and LiveMap construction.

Developing systems with richer understanding of the real world

3D computer vision has many open research questions, and we are experimenting with multiple problem statements, techniques, and methods of supervision as we explore the best way to push the field forward as we did for 2D understanding. As the digital world adapts and shifts to use products like 3D Photos and immersive AR and VR experiences, we need to keep pushing sophisticated systems to more accurately understand and interact with objects in a visual scene.

It’s also part of Facebook AI’s long-term goal of developing AI systems that understand and interact with the real world as humans do. We have been creating scientific breakthroughs across a broad range of capabilities focused on narrowing the gap between physical and virtual spaces. Our latest 3D-focused research can also help improve and better populate 3D objects in Facebook AI’s simulation platform, which is important for training virtual agents to operate in the real world. In the same way that robotics pushes us to address complex challenges that come from conducting experiments in the physical world, where conditions are more unpredictable, 3D research is important for teaching systems how to understand all viewpoints of objects, even when they’re occluded, hidden, or have other optical challenges.

When combined with other senses, like tactile sensing and natural language understanding, AI systems, such as virtual assistants, can function in a way that’s more seamless and useful. Collectively, this leading-edge research helps us move one step closer to building AI systems that can more intuitively understand three dimensions in the same way that humans do.

The research papers described in this blog post are being presented at ICCV 2019, along with other new work in computer vision, including:

  • SlowFast, a method for extracting information from video using input at two different frame rates.

  • TensorMask, an alternate method of object segmentation using the dense, sliding-window technique

Written by

Georgia Gkioxari

Research Scientist

Shubham Tulsiani

Research Scientist

David Novotny

Research Scientist

Pushing state-of-the-art in 3D content understanding的更多相关文章

  1. Image Processing and Analysis_8_Edge Detection:Edge and line oriented contour detection State of the art ——2011

    此主要讨论图像处理与分析.虽然计算机视觉部分的有些内容比如特 征提取等也可以归结到图像分析中来,但鉴于它们与计算机视觉的紧密联系,以 及它们的出处,没有把它们纳入到图像处理与分析中来.同样,这里面也有 ...

  2. 翻新并行程序设计的认知整理版(state of the art parallel)

    近几年,业内对并行和并发积累了丰富的经验.有了较深刻的理解.但之前积累的大量教材,在当今的软硬件体系下.反而都成了负面教材.所以,有必要加强宣传,翻新大家的认知. 首先.天地倒悬,结论先行:当你须要并 ...

  3. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm

    HyperLogLog参考下面这篇blog, http://blog.codinglabs.org/articles/algorithms-for-cardinality-estimation-par ...

  4. 从 Quora 的 187 个问题中学习机器学习和NLP

    从 Quora 的 187 个问题中学习机器学习和NLP 原创 2017年12月18日 20:41:19 作者:chen_h 微信号 & QQ:862251340 微信公众号:coderpai ...

  5. 计算机视觉和人工智能的状态:我们已经走得很远了 The state of Computer Vision and AI: we are really, really far away.

    The picture above is funny. But for me it is also one of those examples that make me sad about the o ...

  6. 国外60个专业3D模型网站

    原始链接:http://blog.sina.com.cn/s/blog_4ba3c7950100jxkh.html Today, 3D models are used in a wide variet ...

  7. 优化WPF 3D性能

    Maximize WPF 3D Performance .NET Framework 4.5   As you use the Windows Presentation Foundation (WPF ...

  8. the core of Git is a simple key-value data store The objects directory stores all the content for your database

    w https://git-scm.com/book/en/v1/Git-Internals-Plumbing-and-Porcelain Git is a content-addressable f ...

  9. Godot-3D教程-01.介绍3D

    创建一个3D游戏将是个挑战,额外增加的Z坐标将使许多用于2D游戏的通用技术不再有用.为了帮助变换(transition),值得一提的是Godot将使用十分相似的API用于2D和3D. 目前许多节点是公 ...

随机推荐

  1. Middle English finaunce金融

    Etymology finance From Middle English finaunce, a surety bond.A supply of money or goods. With thy b ...

  2. ORACLE数据库导出表,字段名,长度,类型,字段注释,表注释语句

    转自:https://www.cnblogs.com/superming/p/11040455.html --数据库导出表,字段名,长度,类型,字段注释,表注释语句 SELECT T1.TABLE_N ...

  3. rabbitmq:配置rabbitmq-management插件

    rabbitmq提供了一个图形的管理界面,用于管理.监控rabbitmq的运行情况,它是以插件的形式提供的,如果要启用需要启用插件 一.启用插件 rabbitmq-plugins enable rab ...

  4. fiddler模拟弱网测试

    1.首先设置手机代理 设置手机代理到本机ip,端口号8888(Fiddler默认设置): 手机访问http://ip:port安装Fiddler证书 2.修改fiddler配置 勾选上后,已经开始限速 ...

  5. docker配置镜像加速器

    docker配置镜像加速器 针对Docker客户端版本大于 1.10.0 的用户 您可以通过修改daemon配置文件/etc/docker/daemon.json来使用加速器 sudo mkdir - ...

  6. 熟练掌握GitHub及Git的使用方法

    一.Git 命令的理解和使用 Git是一个快速,可扩展的分布式版本控制系统,具有异常丰富的命令集,可提供高级操作和对内部的完全访问. 分布式:Git版本控制系统是一个分布式的系统,是用来保存工程源代码 ...

  7. Bag of Tricks for Image Classification with Convolutional Neural Networks笔记

    以下内容摘自<Bag of Tricks for Image Classification with Convolutional Neural Networks>. 1 高效训练 1.1 ...

  8. .net反射机制的简单介绍

    1.什么是反射 Reflection,中文翻译为反射.  这是.Net中获取运行时类型信息的方式,.Net的应用程序由几个部分: ‘程序集(Assembly)’.‘模块(Module)’.‘类型(cl ...

  9. Pychram中使用reduce()函数报错:Unresolved reference 'reduce'

    python3不能直接使用reduce()函数,因为reduce() 函数已经被从全局名字空间里移除了,它现在被放置在fucntools 模块里,所以要使用reduce函数得先饮用fucntools ...

  10. SpringBoot 之Spring Boot Starter依赖包及作用(自己还没有看)

    spring-boot-starter 这是Spring Boot的核心启动器,包含了自动配置.日志和YAML. spring-boot-starter-amqp 通过spring-rabbit来支持 ...