Curse of Dimensionality

Curse of Dimensionality refers to non-intuitive properties of data observed when working in high-dimensional space *, specifically related to usability and interpretation of distances and volumes. This is one of my favourite topics in Machine Learning and Statistics since it has broad applications (not specific to any machine learning method), it is very counter-intuitive and hence awe inspiring, it has profound application for any of analytics techniques, and it has ‘cool’ scary name like some Egyptian curse!
For quick grasp, consider this example: Say, you dropped a coin on a 100 meter line. How do you find it? Simple, just walk on the line and search. But what if it’s 100 x 100 sq. m. field? It’s already getting tough, trying to search a (roughly) football ground for a single coin. But what if it’s 100 x 100 x 100 cu.m space?! You know, football ground now has thirty-story height. Good luck finding a coin there! That, in essence is “curse of dimensionality”.

Many ML methods use Distance Measure

Most segmentation and clustering methods rely on computing distances between observations. Well known k-Means segmentation assigns points to nearest center. DBSCAN and Hierarchical clustering also required distance metrics. Distribution and density based outlier detectionalgorithms also make use of distance relative to other distances to mark outliers.

Supervised classification solutions like k-Nearest Neighbours method also use distance between observations to assign class to unknown observation. Support Vector Machine method involves transforming observations around select Kernels based on distance between observation and the kernel.

Common form of recommendation systems involve distance based similarity among user and item attribute vectors. Even when other forms of distances are used, number of dimensions plays a role in analytic design.

One of the most common distance metrics is Euclidian Distance metric, which is simply linear distance between two points in multi-dimensional hyper-space. Euclidian Distance for point i and point j in n dimensional space can be computed as:

Distance plays havoc in high-dimension

Consider simple process of data sampling. Suppose the black outside box in Fig. 1 is data universe with uniform distribution of data points across whole volume, and that we want to sample 1% of observations as enclosed by red inside box. Black box is hyper-cube in multi-dimensional space with each side representing range of value in that dimension. For simple 3-dimensional example in Fig. 1, we may have following range:

Figure 1 : Sampling

What is proportion of each range should we sample to obtain that 1% sample? For 2-dimensions, 10% of range will achieve overall 1% sampling, so we may select x∈(0,10) and y∈(0,50) and expect to capture 1% of all observations. This is because 10%2=1%. Do you expect this proportion to be higher or lower for 3-dimension?

Even though our search is now in additional direction, proportional actually increases to 21.5%. And not only increases, for just one additional dimension, it doubles! And you can see that we have to cover almost one-fifth of each dimension just to get one-hundredth of overall! In 10-dimensions, this proportion is 63% and in 100-dimensions – which is not uncommon number of dimensions in any real-life machine learning – one has to sample 95% of range along each dimension to sample 1% of observations! This mind-bending result happens because in high dimensions spread of data points becomes larger even if they are uniformly spread.

This has consequence in terms of design of experiment and sampling. Process becomes very computationally expensive, even to the extent that sampling asymptotically approaches population despite sample size remaining much smaller than population.

Consider another huge consequence of high dimensionality. Many algorithms measure distance between two data points to define some sort of near-ness (DBSCAN, Kernels, k-Nearest Neighbour) in reference to some pre-defined distance threshold. In 2-dimensions, we can imagine that two points are near if one falls within certain radius of another. Consider left image in Fig. 2. What’s share of uniformly spaced points within black square fall inside the red circle? That is about

Figure 2 : Near-ness

So if you fit biggest circle possible inside the square, you cover 78% of square. Yet, biggest sphere possible inside the cube covers only

of the volume. This volume reduces exponentially to 0.24% for just 10-dimension! What it essentially means that in high-dimensional world every single data point is at corners and nothing really is center of volume, or in other words, center volume reduces to nothing because there is (almost) no center! This has huge consequences of distance based clustering algorithms. All the distances start looking like same and any distance more or less than other is more random fluctuation in data rather than any measure of dissimilarity!

Fig. 3 shows randomly generated 2-D data and corresponding all-to-all distances. Coefficient of Variation in distance, computed as Standard Deviation divided by Mean, is 45.9%. Corresponding number of similarly generated 5-D data is 26.5% and for 10-D is 19.1%. Admittedly this is one sample, but trend supports the conclusion that in high-dimensions every distance is about same, and none is near or far!

Figure 3 : Distance Clustering

High-dimension affects other things too

Apart from distances and volumes, number of dimensions creates other practical problems. Solution run-time and system-memory requirements often non-linearly escalate with increase in number of dimensions. Due to exponential increase in feasible solutions, many optimization methods cannot reach global optima and have to make do with local optima. Further, instead of closed-form solution, optimization must use search based algorithms like gradient descent, genetic algorithm and simulated annealing. More dimensions introduce possibility of correlation and parameter estimation can become difficult in regression approaches.

Dealing with High-dimension

This will be separate blog post in itself, but correlation analysis, clustering, information value, variance inflation factor, principal component analysis are some of the ways in which number of dimensions can be reduced.

* Number of variables, observations or features a data point is made up of is called dimension of data. For instance, any point in space can be represented using 3 co-ordinates of length, breadth, and height, and has 3 dimensions

Curse of Dimensionality的更多相关文章

  1. [转]The Curse of Dimensionality(维数灾难)

    原文章地址:维度灾难 - 柳枫的文章 - 知乎 https://zhuanlan.zhihu.com/p/27488363 对于大多数数据,在一维空间或者说是低维空间都是很难完全分割的,但是在高纬空间 ...

  2. [Stats385] Lecture 05: Avoid the curse of dimensionality

    Lecturer 咖中咖 Tomaso A. Poggio Lecture slice Lecture video 三个基本问题: Approximation Theory: When and why ...

  3. 【PRML读书笔记-Chapter1-Introduction】1.4 The Curse of Dimensionality

    维数灾难 给定如下分类问题: 其中x6和x7表示横轴和竖轴(即两个measurements),怎么分? 方法一(simple): 把整个图分成:16个格,当给定一个新的点的时候,就数他所在的格子中,哪 ...

  4. 对The Curse of Dimensionality(维度灾难)的理解

    一个特性:低维(特征少)转向高维的过程中,样本会变的稀疏(可以有两种理解方式:1.样本数目不变,样本彼此之间距离增大.2.样本密度不变,所需的样本数目指数倍增长). 高维度带来的影响: 1.变得可分. ...

  5. Dimensionality and high dimensional data: definition, examples, curse of..

    Dimensionality in statistics refers to how many attributes a dataset has. For example, healthcare da ...

  6. 第八章——降维(Dimensionality Reduction)

    机器学习问题可能包含成百上千的特征.特征数量过多,不仅使得训练很耗时,而且难以找到解决方案.这一问题被称为维数灾难(curse of dimensionality).为简化问题,加速训练,就需要降维了 ...

  7. 壁虎书8 Dimensionality Reduction

    many Machine Learning problems involve thousands or even millions of features for each training inst ...

  8. NLP点滴——文本相似度

    [TOC] 前言 在自然语言处理过程中,经常会涉及到如何度量两个文本之间的相似性,我们都知道文本是一种高维的语义空间,如何对其进行抽象分解,从而能够站在数学角度去量化其相似性.而有了文本之间相似性的度 ...

  9. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

随机推荐

  1. 【转】MySQL GRANT REVOKE用法

    MySQL的权限系统围绕着两个概念: 认证->确定用户是否允许连接数据库服务器 授权->确定用户是否拥有足够的权限执行查询请求等. 如果认证不成功的话,哪么授权肯定是无法进行的. revo ...

  2. android Material Design:主题

    <style name="MyTheme" parent="@android:style/android:Theme.Material"> < ...

  3. 安装mysql 5.7 最完整版教程

    Step1: 检测系统是否自带安装mysql #yum list installed | grep mysql Step2: 删除系统自带的mysql及其依赖 命令: yum remove mysql ...

  4. PHP+MYSQL会员系统的开发实例教程

    本文通过一个简单的实例完成了完整的PHP+MySQL会员系统功能.是非常实用的一个应用.具体实现步骤如下: 一.会员系统的原理: 登陆-->判断-->保持状态(Cookie或Session ...

  5. Oracle表变化趋势追踪记录

    #DBA_HIST_SEG_STAT可以看出对象的使用趋势,构造如下SQL查询出每个时间段内数据库对象的增长量,其中DB_BLOCK_CHANGES_DELTA为块个数 select c.SNAP_I ...

  6. java的变量

    什么是变量? 在计算机中用来存储信息,通过声明语句来指明存储位置和所需空间. 变量的声明方法及赋值 分号:语句结束标志             赋值号:将=右边的值赋给左边的变量 变量有哪些数据类型? ...

  7. opengl基础学习专题 (二) 点直线和多边形

    题外话 随着学习的增长,越来越觉得自己很水.关于上一篇博文中推荐用一个 学习opengl的 基于VS2015的 simplec框架.存在 一些问题. 1.这个框架基于VS 的Debug 模式下,没有考 ...

  8. 菜鸟学习Struts——国际化

    一.概念 国际化:界面上的语言可以根据用户所在的地区改变显示语言. 如图: 二.实例 下面就一步一步的教大家利用Struts实现国际化. 1.编写资源文件 这个资源文件就是界面上显示的字符,资源文件里 ...

  9. OpenGL完整实例

    结合上一节的内容,分享完整代码. 先画一个cube,然后通过OnGestureListener去触发onFling使它旋转起来. OnGestureListener相关的方法我已经都加了注释,可以参考 ...

  10. bzoj 1269 [AHOI2006]文本编辑器editor

    原题链接:http://www.lydsy.com/JudgeOnline/problem.php?id=1269 伸展树的运用,如下: #include<cstdio> #include ...