Linear Regression and Maximum Likelihood Estimation

Imagination is an outcome of what you learned. If you can imagine the world, that means you have learned what the world is about.

Actually we don't know how we see, at lease it's really hard to know, so we can't program to tell a machine to see.

One of the most important part in machine learning is to introspect how our brain learn by subconscious. If we can't introspect, it can be fairly hard to replicate a brain.

Linear Models

Supervised learning of linear models can be divided into 2 phases:

Training:
1. Read training data points with labels $\left\{\mathbf{x}_{1:n},y_{1:n}\right\}$, where $\mathbf{x}_i \in \mathbb{R}^{1 \times d}, \ y_i \in \mathbb{R}^{1 \times c}$;
2. Estimate model parameters $\hat{\theta}$ by certain learning Algorithms.
  Note: The parameters are the information the model learned from data.
Prediction:
1. Read a new data point without label $\mathbf{x}_{n+1}$ (typically has never seen before);
2. Along with parameter $\hat{\theta}$, estimate unknown label $\hat{y}_{n+1}$.

1-D example:
First of all, we create a linear model:
\[
\hat{y}_i = \theta_0 + \theta_1 x_{i}
\]
Both $x$ and $y$ are scalars in this case.

Then we, for example, take SSE (Sum of Squared Error) as our objective / loss / cost / energy / error function¹:

\[
J(\theta)=\sum_{i=1}^n \left( \hat{y}_i - y_i\right)^2
\]

Linear Prediction Model

In general, each data point $x_i$ should have $d$ dimensions, and the corresponding number of parameters should be $(d+1)$.

The mathematical form of linear model is:
\[
\hat{y}_i = \sum_{j=0}^{d} \theta_jx_{ij}
\]

The matrix form of linear model is:
\[
\begin{bmatrix}
\hat{y}_1 \\
\hat{y}_2 \\
\vdots \\
\hat{y}_n
\end{bmatrix}=
\begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1d} \\
1 & x_{21} & x_{22} & \cdots & x_{2d} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{n1} & x_{n2} & \cdots & x_{nd}
\end{bmatrix}
\begin{bmatrix}
\theta_0 \\
\theta_1 \\
\theta_2 \\
\vdots \\
\theta_d
\end{bmatrix}
\]
Or in a more compact way:
\[
\mathbf{\hat{y}} = \mathbf{X\theta}
\]
Note that the matrix form is widely used not only because it's a concise way to represent the model, but is also straightforward for coding in MatLab or Python (Numpy).

Optimization Approach

In order to optimize the model prediction, we need to minimize the quadratic cost:
\[
J(\mathbf{\theta}) = \sum_{i=1}^n \left( \hat{y}_i - y_i\right)^2 \\
= \left( \mathbf{y-X\theta} \right)^\mathtt T\left( \mathbf{y-X\theta} \right)
\]

by setting the derivatives w.r.t vector $\mathbf{\theta}$ to zero since the cost function is strictly convex and the domain of $\theta$ is convex².

\[
\begin{align*}\notag
\frac{\partial J(\mathbf{\theta})}{\partial \mathbf{\theta}} &= \frac{\partial}{ \partial \mathbf{\theta} } \left( \mathbf{y-X\theta} \right)^\mathtt T\left( \mathbf{y-X\theta} \right) \\
&=\frac{\partial}{ \partial \mathbf{\theta} } \left( \mathbf{y}^\mathtt T\mathbf{y} + \mathbf{\theta}^\mathtt T \mathbf{X}^\mathtt T\mathbf{X\theta} -2\mathbf{y}^\mathtt T\mathbf{X\theta} \right) \\
&=\mathbf{0}+2 \left( \mathbf{X}^\mathtt T\mathbf{X} \right)^\mathtt T \mathbf{\theta} - 2 \left( \mathbf{y}^\mathtt T\mathbf{X} \right)^\mathtt T \\
&=2 \left( \mathbf{X}^\mathtt T\mathbf{X} \right) \mathbf{\theta} - 2 \left( \mathbf{X}^\mathtt T\mathbf{y} \right) \\
&\triangleq\mathbf{0}
\end{align*}
\]

So we get $\mathbf{\hat{\theta}}$ as an analytical solution:
\[
\mathbf{\hat{\theta}} = \left( \mathbf{X}^\mathtt T\mathbf{X} \right)^{-1} \left( \mathbf{X}^\mathtt T\mathbf{y} \right)
\]

After passing by these procedures, we can see that learning is just about to adjust model parameters so as to minimize the objective function.
Thus, the prediction function can be rewrite as:
\[
\begin{align*}\notag
\mathbf{\hat{y}} &= \mathbf{X\hat{\theta}}\\
&=\mathbf{X}\left( \mathbf{X}^\mathtt T\mathbf{X} \right)^{-1} \mathbf{X}^\mathtt T\mathbf{y}
\triangleq \mathbf{Hy}
\end{align*}
\]
where $\mathbf{H}$ refers to hat matrix because it added hat to $\mathbf{y}$

Multidimensional Label $\mathbf{y_i}$

So far we have been assuming $y_i$ to be a scalar. But what if the model have multiple outputs (e.g. $c$ outputs)? Simply align with $c$ parameters:
\[
\begin{bmatrix}
y_{11} & \cdots & y_{1c} \\
y_{21} & \cdots & y_{2c} \\
\vdots & \ddots & \vdots \\
y_{n1} & \cdots & y_{nc}
\end{bmatrix}=
\begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1d} \\
1 & x_{21} & x_{22} & \cdots & x_{2d} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{n1} & x_{n2} & \cdots & x_{nd}
\end{bmatrix}
\begin{bmatrix}
\theta_{01} & \cdots & \theta_{0c}\\
\theta_{11} & \cdots & \theta_{1c}\\
\theta_{21} & \cdots & \theta_{2c}\\
\vdots & \ddots & \vdots \\
\theta_{d1}& \cdots & \theta_{dc}
\end{bmatrix}
\]

Linear Regression with Maximum Likelihood

If we assume that each label $y_i$ is Gaussian distributed with mean $x_i^{\mathtt{T}} \theta$ and variance $\sigma^2$:
\[
y_i \sim N(x_i^{\mathtt{T}}\theta, \sigma^2) = \left( 2\pi\sigma^2 \right)^{-1/2} e^{ -\frac{\left( y_i-x_i^{\mathtt{T}}\theta \right)^2}{2\sigma^2} }
\]

Likelihood

With a reasonable i.i.d. assumption over $\mathbf{y}$, we can decompose the joint distribution of likelihood:
\[
\begin{align*}\notag
p( \mathbf{y}|\mathbf{X,\theta,\sigma^2} ) &= \prod_{i=1}^n {p(y_i|\mathbf{x}_i,\theta,\sigma^2} ) \\
&=\prod_{i=1}^n \left( 2\pi\sigma^2 \right)^{-1/2} e^{ -\frac{\left( y_i-x_i^{\mathtt{T}}\theta \right)^2}{2\sigma^2} } \\
&=\left( 2\pi\sigma^2 \right)^{-n/2} e^{-\frac{\sum_{i=1}^n \left( y_i-x_i^{\mathtt{T}}\theta \right)^2}{2\sigma^2}} \\
&= \left( 2\pi\sigma^2 \right)^{-n/2} e^{-\frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) } {2\sigma^2}}
\end{align*}\notag
\]

Maximum Likelihood Estimation

Then our goal is to maximize the probability of the label in our Gaussian linear regression model w.r.t. $\theta$ and $\sigma$.

Instead of minimizing the cost function SSE (length of blue lines), this time we maximize likelihood (length of green lines) to optimize the model parameters.

Since $\log$ function is monotonic and can simplify exponent function, here we utilize log-likelihood:
\[
\log p( \mathbf{y}|\mathbf{X,\theta}, \sigma^2 ) = -\frac{n}{2} \log \left( 2\pi\sigma^2 \right) -\frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) } {2\sigma^2}
\]

MLE of $\theta$:
\[
\begin{align*}\notag
\frac{\partial {\log p( \mathbf{y}|\mathbf{X,\theta,\sigma^2} )} }{\partial {\theta}} &= \frac{\partial}{\partial \theta} \left[ -\frac{n}{2} \log \left( 2\pi\sigma^2 \right) -\frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) } {2\sigma^2} \right] \\
&= 0 - \frac{1}{2\sigma^2} \frac{\partial{(\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta})}}{\partial{\theta}} \\
&= -\frac{1}{2\sigma^2} \frac{ \partial{ \left( \mathbf{y}^{\mathtt{T}}\mathbf{y} + \theta^{\mathtt{T}} \mathbf{X}^{\mathtt{T}} \mathbf{X\theta} - 2\mathbf{y}^{\mathtt{T}}\mathbf{X\theta} \right) } }{\partial{\theta}} \\
&= -\frac{1}{2\sigma^2} \left[ 0+ 2\left( \mathbf{X^{\mathtt{T}}X} \right)^{\mathtt{T}}\theta - 2\left( \mathbf{y}^{\mathtt{T}}\mathbf{X} \right)^{\mathtt{T}} \right] \\
&= -\frac{1}{2\sigma^2} \left[ 2\mathbf{X^{\mathtt{T}}X\theta} - 2\mathbf{X}^{\mathtt{T}}\mathbf{y} \right] \triangleq 0
\end{align*}
\]
There's no surprise that the estimation of maximum likelihood is identical to that of least-square method.
\[
\hat\theta_{MLE} = \left( \mathbf{X}^{\mathtt{T}}\mathbf{X} \right)^{-1} \mathbf{X}^{\mathtt{T}} \mathbf{y}
\]

Besides where the "line" is, using MLE with Gaussian will give us the uncertainty, or confidence as another parameter, of the prediction $\mathbf{\hat y}$
MLE of $\sigma^2$:
\[
\begin{align*}\notag
\frac{\partial {\log p( \mathbf{y}|\mathbf{X,\theta}, \sigma^2 )} }{\partial {\sigma}} &= \frac{\partial}{\partial \sigma} \left[ -\frac{n}{2} \log \left( 2\pi\sigma^2 \right) -\frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) } {2\sigma^2} \right] \\
&= -\frac{n}{2} \frac{1}{2\pi\sigma^2} 4\pi\sigma + 2 \frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) }{2\sigma^3} \\
&= -\frac{n}{\sigma} + \frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) }{\sigma^3} \triangleq 0
\end{align*}
\]
Thus, we get:
\[
\begin{align*}\notag
\hat\sigma_{MLE}^2 &= \frac1n (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) \\
&= \frac1n \sum_{i=1}^n \left(y_i-\mathbf{x}_i^\mathtt{T}\theta \right)^2
\end{align*}
\]
which is the standard estimate of variance, or mean squared error (MSE).
However, this uncertainty estimator does not work very well. We'll see another uncertainty estimator later that is very powerful.

Again, we analytically obtain the optimal parameters for the model to describe labeled data points.

Prediction

Since we have had the optimal parameters $\left(\theta_{MLE},\sigma_{MLE}^2\right)$ of our linear regression model, making prediction is simply get the mean of the Gaussian given different test data point $\mathbf x_*$:
\[
\hat y_* = \mathbf x_*^{\mathtt T}\theta_{MLE}
\]
with uncertainty $\sigma_{MLE}^2$.

Frequentist Learning

Maximum Likelihood Learning is part of frequentist learning.

Frequentist learning assumes there is a truth (true model) of parameter $\theta_{truth}$ that if we had adequate data, we would be able to recover that truth. The core of learning in this case is to guess / estimate / learn the parameter $\hat \theta$ w.r.t. the true model given finite number of training data.

Maximum likelihood is essentially trying to approximate model parameter $\theta_{truth}$ by maximizing likelihood (joint probability of data given parameter), i.e.

Given $n$ data points $\mathbf X = [\mathbf x_1, \cdots,\mathbf x_n]$ with corresponding labels $\mathbf y = [y_1, \cdots, y_n]$, we choose the value of model parameter $\theta$ that is most probable to generate such data points.

Also note that frequentist learning relies on Law of Large Numbers.

KL Divergence and MLE

Given i.i.d assumption on data $\mathbf X$ from distribution $p(\mathbf X|\theta_{true})$:
\[
p(\mathbf X|\theta_{true})=\prod_{i=1}^n p(\mathbf x_i|\theta_{true}) \\
\begin{align*}
\theta_{MLE} &= \arg \underset {\theta}{\max} \prod_{i=1}^n p(\mathbf x_i|\theta) \\
&= \arg \underset {\theta}{\max}\sum_{i=1}^n \log p(\mathbf x_i|\theta)
\end{align*}
\]
Then we add a constant value $-\sum_{i=1}^n \log p(\mathbf x_i|\theta_{true})$ onto the equation and then divide by the constant number $n$:
\[
\begin{align*}
\theta_{MLE} &= \arg \underset {\theta}{\max} \frac1 n\sum_{i=1}^n \log p(\mathbf x_i|\theta) -\frac1 n\sum_{i=1}^n \log p(\mathbf x_i|\theta_{true})\\
&= \arg \underset {\theta} {\max} \frac 1 n \log \frac{p(\mathbf x_i|\theta)}{p(\mathbf x_i|\theta_{true})}
\end{align*}
\]

Recall Law of Large Numbers that is: as $n\rightarrow \infty$,
\[
\frac 1 n\sum_{i=1}^nx_i\rightarrow\int xp(x)\mathrm dx=\mathbb E[x]
\]
where $x_i$ is simulated from $p(x)$

Again, we know from frequentist learning that data point $\mathbf x_i\sim p(\mathbf x|\theta)$. Hence, as $n$ goes $\infty$, the MLE of $\theta$ becomes
\[
\begin{align*}
\theta_{MLE}&=\arg \underset{\theta}{\max} \int_{\mathbf x} \log \frac{p(\mathbf x|\theta)}{p(\mathbf x|\theta_{true})} p(\mathbf x|\theta_{true}) \mathrm dx \\
&=\arg \underset{\theta}{\min} \int_{\mathbf x} \log \frac{p(\mathbf x|\theta_{true})}{p(\mathbf x|\theta)} p(\mathbf x|\theta_{true}) \mathrm dx \\
&=\arg \underset{\theta}{\min}\ \mathbb E_{p(\mathbf x|\theta_{true})} \left[ \log \frac{p(\mathbf x|\theta_{true})}{p(\mathbf x|\theta)} \right] \\
&=\arg \underset{\theta}{\min}\ \mathrm {KL} \left[ p(\mathbf x|\theta_{true})\ ||\ p(\mathbf x|\theta) \right]
\end{align*}
\]
Therefore, maximizing likelihood is equivalent to minimizing KL divergence.

Entropy and MLE

In the last part, we get
\[
\begin{align*}
\theta_{MLE}&=\arg \underset{\theta}{\min} \int_{\mathbf x} \log \frac{p(\mathbf x|\theta_{true})}{p(\mathbf x|\theta)} p(\mathbf x|\theta_{true}) \mathrm dx \\
&=\arg \underset{\theta}{\min} \int_{\mathbf x} \log p(\mathbf x|\theta_{true}) p(\mathbf x|\theta_{true}) \mathrm dx - \int_{\mathbf x} \log p(\mathbf x|\theta) p(\mathbf x|\theta_{true}) \mathrm dx
\end{align*}
\]
The first integral in the equation above is negative entropy w.r.t. true parameter $\theta_{true}$, i.e. information in the world , while the second integral is negative cross entropy w.r.t. model parameter $\theta$ and true parameter $\theta_{true}$., i.e. information from model. The equation says, if the information in the world matches information from model, then the model has learned!

Statistical Quantities of Frequentist Learning

There are 2 quantities that frequentist often estimate:

bias
variance

Refer: CPSC540, UBC
Written with StackEdit.

SSE is known by everyone but works poorly under certain circumstances e.g. if the training data contains some noise (outliers) then the model will be distorted seriously by outliers.↩
See one of some interesting explanations here ↩