A Beginner’s Guide to Eigenvectors, PCA, Covariance and Entropy

Content:

This post introduces eigenvectors and their relationship to matrices in plain language and without a great deal of math. It builds on those ideas to explain covariance, principal component analysis, and information entropy.

The eigen in eigenvector comes from German, and it means something like “very own.” For example, in German, “mein eigenes Auto” means “my very own car.” So eigen denotes a special relationship between two things. Something particular, characteristic and definitive. This car, or this vector, is mine and not someone else’s.

Matrices, in linear algebra, are simply rectangular arrays of numbers, a collection of scalar values between brackets, like a spreadsheet. All square matrices (e.g. 2 x 2 or 3 x 3) have eigenvectors, and they have a very special relationship with them, a bit like Germans have with their cars.

Linear Transformations

We’ll define that relationship after a brief detour into what matrices do, and how they relate to other numbers.

Matrices are useful because you can do things with them like add and multiply. If you multiply a vector v by a matrix A, you get another vector b, and you could say that the matrix performed a linear transformation on the input vector.

Av = b

It maps one vector v to another, b.

We’ll illustrate with a concrete example. (You can see how this type of matrix multiply, called a dot product, is performed here.)

So A turned v into b. In the graph below, we see how the matrix mapped the short, low line v, to the long, high one, b.

You could feed one positive vector after another into matrix A, and each would be projected onto a new space that stretches higher and farther to the right.

Imagine that all the input vectors v live in a normal grid, like this:

And the matrix projects them all into a new space like the one below, which holds the output vectors b:

Here you can see the two spaces juxtaposed:

(Credit: William Gould, Stata Blog)

And here’s an animation that shows the matrix’s work transforming one space to another:

The blue lines are eigenvectors.

You can imagine a matrix like a gust of wind, an invisible force that produces a visible result. And a gust of wind must blow in a certain direction. The eigenvector tells you the direction the matrix is blowing in.

So out of all the vectors affected by a matrix blowing through one space, which one is the eigenvector? It’s the one that doesn’t change direction; that is, the eigenvector is already pointing in the same direction that the matrix is pushing all vectors toward. An eigenvector is a like a weathervane. An eigenvane, as it were.

The definition of an eigenvector, therefore, is a vector that responds to a matrix as though that matrix were a scalar coefficient. In this equation, A is the matrix, x the vector, and lambda the scalar coefficient, a number like 5 or 37 or pi.

You might also say that eigenvectors are axes along which linear transformation acts, stretching or compressing input vectors. They are the lines of change that represent the action of the larger matrix.

Notice we’re using the plural – axes and lines. Just as a German may have a Volkswagen for grocery shopping, a Mercedes for business travel, and a Porsche for joy rides (each serving a distinct purpose), square matrices have as many eigenvectors as they have linearly independent variables; i.e. a 2 x 2 matrix could have two eigenvectors, a 3 x 3 matrix three, and an n x n matrix could have n eigenvectors, each one representing its line of action in one dimension.

Because eigenvectors distill the axes of principal force that a matrix moves input along, they are useful in matrix decomposition; i.e. the diagonalization of a matrix along its eigenvectors. Because those eigenvectors are representative of the matrix, they perform the same task as the autoencoders employed by deep neural networks.

To quote Yoshua Bengio:

Many mathematical objects can be understood better by breaking them into constituent parts, or finding some properties of them that are universal, not caused by the way we choose to represent them.

For example, integers can be decomposed into prime factors. The way we represent the number 12 will change depending on whether we write it in base ten or in binary, but it will always be true that 12 = 2 × 2 × 3. 

From this representation we can conclude useful properties, such as that 12 is not divisible by 5, or that any integer multiple of 12 will be divisible by 3.

Much as we can discover something about the true nature of an integer by decomposing it into prime factors, we can also decompose matrices in ways that show us information about their functional properties that is not obvious from the representation of the matrix as an array of elements.

One of the most widely used kinds of matrix decomposition is called eigen-decomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues.

Principal Component Analysis (PCA)

PCA is a tool for finding patterns in high-dimensional data such as images. Machine-learning practitioners sometimes use PCA to preprocess data for their neural networks. By centering, rotating and scaling data, PCA prioritizes dimensionality (allowing you to drop some low-variance dimensions) and can improve the neural network’s convergence speed and the overall quality of results.

To get to PCA, we’re going to quickly define some basic statistical ideas – mean, standard deviation, variance and covariance – so we can weave them together later. Their equations are closely related.

Mean is simply the average value of all x’s in the set X, which is found by dividing the sum of all data points by the number of data points, n.

Standard deviation is the square root of the average square distance of data points to the mean. In the equation below, the numerator contains the sum of the differences between each datapoint and the mean, and the denominator is simply the number of data points (minus one), producing the average distance.

Variance is the measure of the data’s spread. If I take a team of Dutch basketball players and measure their height, those measurements won’t have a lot of variance. They’ll all be grouped above six feet.

But if I throw the Dutch basketball team into a classroom of psychotic kindergartners, then the combined group’s height measurements will have a lot of variance. Variance is the spread, or the amount of difference that data expresses.

Variance is simply standard deviation squared, and is often expressed as s^2.

For both variance and standard deviation, squaring the differences between data points and the mean makes them positive, so that values above and below the mean don’t cancel each other out.

Let’s assume you plotted the age (x axis) and height (y axis) of those individuals (setting the mean to zero) and came up with an oblong scatterplot:

PCA attempts to draw straight, explanatory lines through data, like linear regression.

Each straight line represents a “principal component,” or a relationship between an independent and dependent variable. While there are as many principal components as there are dimensions in the data, PCA’s role is to prioritize them.

The first principal component bisects a scatterplot with a straight line in a way that explains the most variance; that is, it follows the longest dimension of the data. (This happens to coincide with the least error, as expressed by the red lines…) In the graph below, it slices down the length of the “baguette.”

The second principal component cuts through the data perpendicular to the first, fitting the errors produced by the first. There are only two principal components in the graph above, but if it were three-dimensional, the third component would fit the errors from the first and second principal components, and so forth.

Covariance Matrix

While we introduced matrices as something that transformed one set of vectors into another, another way to think about them is as a description of data that captures the the forces at work upon it, the forces by which two variables might relate to each other as expressed by their variance and covariance.

Imagine that we compose a square matrix of numbers that describe the variance of the data, and the covariance among variables. This is the covariance matrix. It is an empirical description of data we observe.

Finding the eigenvectors and eigenvalues of the covariance matrix is the equivalent of fitting those straight, principal-component lines to the variance of the data.

Eigenvalues are simply the coefficients attached to eigenvectors, which give the axes magnitude. In this case, they are the measure of the data’s covariance. By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the principal components in order of significance.

For a 2 x 2 matrix, a covariance matrix might look like this:

The numbers on the upper left and lower right represent the variance of the x and y variables, respectively, while the identical numbers on the lower left and upper right represent the covariance between x and y. Because of that identity, such matrices are known as symmetrical. As you can see, the covariance is positive, since the graph near the top of the PCA section points up and to the right.

If two variables increase and decrease together (a line going up and to the right), they have a positive covariance, and if one decreases while the other increases, they have a negative covariance (a line going down and to the right).

(Credit: Vincent Spruyt)

Notice that when one variable or the other doesn’t move at all, and the graph shows no diagonal motion, there is no covariance whatsoever. Covariance answers the question: do these two variables dance together? If one remains null while the other moves, the answer is no.

Also, in the equation below, you’ll notice that there is only a small difference between covariance and variance.

vs.

The great thing about calculating covariance is that, in a high-dimensional space where you can’t eyeball intervariable relationships, you can know how two variables move together by the positive, negative or non-existent character of their covariance. (Correlation is a kind of normalized covariance, with a value between -1 and 1.)

To sum up, the covariance matrix defines the shape of the data. Diagonal spread along eigenvectors is expressed by the covariance, while x-and-y-axis-aligned spread is expressed by the variance.

Causality has a bad name in statistics, so take this with a grain of salt:

While not entirely accurate, it may help to think of each component as a causal force in the Dutch basketball player example above, with the first principal component being age; the second possibly gender; the third nationality (implying nations’ differing healthcare systems), and each of those occupying its own dimension in relation to height. Each acts on height to different degrees. You can read covariance as traces of possible cause.

Change of Basis

Because the eigenvectors of the covariance matrix are orthogonal to each other, they can be used to to reorient the data from the x and y axes to the axes represented by the principal components. You re-base the coordinate system for the dataset in a new space defined by its lines of greatest variance.

The x and y axes we’ve shown above are what’s called the basis of a matrix; that is, they provide the points of the matrix with x, y coordinates. But it is possible to recast a matrix along other axes; for example, the eigenvectors of a matrix can serve as the foundation of a new set of coordinates for the same matrix. Matrices and vectors are animals in themselves, independent of the numbers linked to a specific coordinate system like x and y.

In the graph above, we show how the same vector v can be situated differently in two coordinate systems, the x-y axes in black, and the two other axes shown by the red dashes. In the first coordinate system, v = (1,1), and in the second, v = (1,0), but v itself has not changed. Vectors and matrices can therefore be abstracted from the numbers that appear inside the brackets.

This has profound and almost spiritual implications, one of which is that there exists no natural coordinate system, and mathematical objects in n-dimensional space are subject to multiple descriptions. (Changing matrices’ bases also makes them easier to manipulate.)

Entropy & Information Gain

In information theory, the term entropy refers to information we don’t have (normally people define “information” as what they know, and jargon has triumphed once again in turning plain language on its head to the detriment of the uninitiated). The information we don’t have about a system, its entropy, is related to its unpredictability: how much it can surprise us.

If you know that a certain coin has heads embossed on both sides, then flipping the coin gives you absolutely no information, because it will be heads every time. You don’t have to flip it to know.

A balanced, two-sided coin does contain an element of surprise with each coin toss. And a six-sided die, by the same argument, contains even more surprise with each roll, which could produce any one of six results with equal frequency. Both those objects contain information in the technical sense.

Now let’s imagine the die is loaded, it comes up “three” on five out of six rolls, and we figure out the game is rigged. Suddenly the amount of surprise produced with each roll by this die is greatly reduced. We understand a trend in the die’s behavior that gives us greater predictive capacity.

Understanding the die is loaded is analogous to finding a principal component in a dataset. You simply identify an underlying pattern.

That transfer of information, from what we don’t know about the system to what we know, represents a change in entropy. Insight decreases the entropy of the system. Get information, reduce entropy. This is information gain. And yes, this type of entropy is subjective, in that it depends on what we know about the system at hand. (Fwiw, information gain is synonymous with Kullback-Leibler divergence, which we explored briefly in this tutorial on restricted Boltzmann machines.)

So each principal component cutting through the scatterplot represents a decrease in the system’s entropy, in its unpredictability.

It so happens that explaining the shape of the data one principal component at a time, beginning with the component that accounts for the most variance, is similar to walking data through a decision tree. The first component of PCA, like the first split in a properly formed decision tree, will be along the dimension that reduces unpredictability the most.

Other Resources

A Beginner’s Guide to Eigenvectors, PCA, Covariance and Entropy的更多相关文章

  1. A Beginner's Guide to Paxos

    Google Drive: A Beginner's Guide to Paxos The code ideas of Paxos protocol: 1) Optimistic concurrenc ...

  2. Beginner's Guide to Python-新手指导

    Refer English Version: http://wiki.python.org/moin/BeginnersGuide New to programming? Python is free ...

  3. A Beginner's Guide To Understanding Convolutional Neural Networks(转)

    A Beginner's Guide To Understanding Convolutional Neural Networks Introduction Convolutional neural ...

  4. (转)A Beginner's Guide To Understanding Convolutional Neural Networks Part 2

    Adit Deshpande CS Undergrad at UCLA ('19) Blog About A Beginner's Guide To Understanding Convolution ...

  5. (转)A Beginner's Guide To Understanding Convolutional Neural Networks

    Adit Deshpande CS Undergrad at UCLA ('19) Blog About A Beginner's Guide To Understanding Convolution ...

  6. 新手教程之:循环网络和LSTM指南 (A Beginner’s Guide to Recurrent Networks and LSTMs)

    新手教程之:循环网络和LSTM指南 (A Beginner’s Guide to Recurrent Networks and LSTMs) 本文翻译自:http://deeplearning4j.o ...

  7. Photography theory: a beginner's guide(telegraph.co.uk)

    By Diane Smyth, Tim Clark, Rachel Segal Hamilton and Lewis Bush 11:00AM BST 09 Jun 2014   Have you r ...

  8. A Beginner’s Guide to the OUTPUT Clause in SQL Server

    原文 A Beginner’s Guide to the OUTPUT Clause in SQL Server T-SQL supports the OUTPUT clause after the ...

  9. The Beginner’s Guide to iptables, the Linux Firewall

    Iptables is an extremely flexible firewall utility built for Linux operating systems. Whether you’re ...

随机推荐

  1. 福大软工1816:Beta(2/7)

    Beta 冲刺 (2/7) 队名:第三视角 组长博客链接 本次作业链接 团队部分 团队燃尽图 工作情况汇报 张扬(组长) 过去两天完成了哪些任务 文字/口头描述 为utils_wxpy.py添加注释 ...

  2. 技嘉主板+AMD CPU开启CPU虚拟化方法

    硬件环境:技嘉AB350+AMD Ryzen 5 1600X 由于安装虚拟机的需要,所以要开启CPU的虚拟化. 首先进入BIOS. 然后如图:(M.I.T-高级频率设定-CPU超频进阶设置-SVM M ...

  3. 使用docker国内镜像解决方案

    1:蜂巢镜像 https://c.163yun.com/hub#/m/library/ 例如: docker pull hub.c.163.com/library/nginx:1.8 再次执行dock ...

  4. (十一)Jmeter另一种调试工具 HTTP Mirror Server

    之前我介绍过Jmeter的一种调试工具Debug Sampler,它可以输出Jmeter的变量.属性甚至是系统属性而不用发送真实的请求到服务器.既然这样,那么HTTP Mirror Server又是做 ...

  5. Snapseed玩出新高度,分分钟让你成p图大神! 转

    (,,・∀・)ノ゛嗨呀 小阔爱们! 不知道大家记不记得~ 上周我们的副条发了一篇: <看过他的照片,我才知道什么是创意摄影> 德国仅22岁超现实主义艺术家Justin Peters 创造了 ...

  6. Linux的压缩/解压缩文件处理 zip & unzip

    Linux的压缩/解压缩命令详解及实例 压缩服务器上当前目录的内容为xxx.zip文件 zip -r xxx.zip ./* 解压zip文件到当前目录 unzip filename.zip 另:有些服 ...

  7. [cnbeta]微软最强数据中心级操作系统

    微软近日发表了一篇介绍Windows系统内核的博文,期间为了展示Windows的强大扩展性,放出了一张非常震撼的Windows任务管理器截图:乍一看似乎没啥特别的,几十甚至上百个逻辑核心的系统并不罕见 ...

  8. HTML5资源站

    前端里:http://www.yyyweb.com/ http://www.cnblogs.com/html5tricks/p/3925844.html

  9. (转)REST无状态的理解

    转至http://lelglin.iteye.com/blog/1852092 Representational State Transfer的缩写.我对这个词组的翻译是"表现层状态转化&q ...

  10. http://www.pythonchallenge.com/ 网站题解

    在知乎中无意发现了这个网站,做了几题发现挺有趣的,这里记录下自己的解题思路,顺便对比下答案中的思路 网页:http://www.pythonchallenge.com/ 目前只做了几题,解题的方法就是 ...