Welcome to Machine-Learning-Model-Analysis’s documentation!

mathematical notation

  • \(x\) 表示标量
  • \(\bar{x}\) 表示估计值
  • \(\mathbf{x}\) 表示列向量
  • \(\mathbf{x}^T\) 表示行向量
  • \(\mathbf{X}\) 表示矩阵,在表示特征矩阵的时候,行表示样本数,列表示特征数
  • \(n\) 表示列数
  • \(m\) 表示行数
  • \(\mathbf{x}_i\) 表示向量的第i个值
  • \(\mathbf{X}_{i}\) 表示矩阵的第i行
  • \(\mathbf{X}^{j}\) 表示矩阵的第j列
  • \(\mathbf{X}_{i}^{j}\) 表示矩阵的第i行j列的值
  • \(\epsilon\) 表示预测值与真实值之间的误差项
  • \(;\) 分号后的数字表示超参数
  • \(w、h、l\) 代表长宽高

linear Model

linear regression

图例

Predicted function

\[\begin{split}f(\mathbf{x}) = \begin{bmatrix} \theta_0 \\ \theta_1 \\ .. \\ \theta_n \end{bmatrix}^T \begin{bmatrix} \mathbf{x}_0 \\ \mathbf{x}_1 \\ .. \\ \mathbf{x}_n \end{bmatrix} =\mathbf{\theta}^T \mathbf{x}\end{split}\]

Loss function

\[L(y,\bar{y}) = (y - \bar{y})^2\]

Object function

\[O(\mathbf{y},\mathbf{X}) = \frac{1}{2} \sum_{i=1}^{m}L(\mathbf{y}_{i},f(\mathbf{X}_i))=(\mathbf{y} - \mathbf{X}\mathbf{\theta})^T(\mathbf{y} - \mathbf{X}\mathbf{\theta})\]

Optimizing

\[\mathbf{\theta} = \mathop{\mathbf{\arg\min}}_{\mathbf{\theta}} \ \ \frac{1}{2} O(\mathbf{y},\mathbf{X})\]
Normal equations
\[\begin{split}\nabla_{\theta} {O(\mathbf{y},\mathbf{X};\mathbf{\theta})} & = \nabla_{\theta} \ \ \frac{1}{2}(\mathbf{y} - \mathbf{X}\mathbf{\theta})^T (\mathbf{y} - \mathbf{X}\mathbf{\theta}) \\ & = \frac{1}{2} \nabla_{\theta} \ \ (\mathbf{\theta}^T \mathbf{X}^T \mathbf{X} \mathbf{\theta} - \mathbf{\theta}^T \mathbf{X}\mathbf{y} - \mathbf{y}^T\mathbf{X}\mathbf{\theta} + \mathbf{y}^T\mathbf{y}) \\ & = \frac{1}{2} \nabla_{\theta} \ \ \text{Tr}(\mathbf{\theta}^T \mathbf{X}^T \mathbf{X} \mathbf{\theta} - \mathbf{\theta}^T \mathbf{X}\mathbf{y} - \mathbf{y}^T\mathbf{X}\mathbf{\theta} + \mathbf{y}^T\mathbf{y}) \\ & = \frac{1}{2} \nabla_{\theta} \ \ (\text{Tr}(\mathbf{\theta}^T \mathbf{X}^T \mathbf{X}\mathbf{\theta}) - 2 \text{Tr}(\mathbf{y}^T\mathbf{X}\mathbf{\theta})) \\ & = \frac{1}{2} (\mathbf{X}^T \mathbf{X} \mathbf{\theta} + \mathbf{X}^T \mathbf{X}\mathbf{\theta} - 2 \mathbf{X}^T \mathbf{y}) \\ & = \mathbf{X}^T \mathbf{X} \mathbf{\theta} - \mathbf{X}^T \mathbf{y}\end{split}\]

\(\nabla_{\theta}{O(\mathbf{y},\mathbf{X};\mathbf{\theta})} = 0\),得

\[\begin{split}\mathbf{X}^T \mathbf{X} \mathbf{\theta} & = \mathbf{X}^T \mathbf{y} \\ \mathbf{\theta} & = \begin{cases} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} & \text{Tr}{\mathbf{X}} = \text{Tr}{\mathbf{X}^T}\\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} & \text{pass} \end{cases}\end{split}\]

Warning

若特征矩阵不是方阵,的情况。。。。待查询相关资料

Note

求解过程需要相关矩阵迹的知识

Gradient-based
\[\begin{split}\nabla_{\theta} {O(\mathbf{y},\mathbf{X};\mathbf{\theta})} & = \nabla_{\theta} \ \ \frac{1}{2} \sum_{i=1}^{m} L(\mathbf{y}_i,f(\mathbf{X}_i)) \\ & = \nabla_{\theta} \ \ \frac{1}{2} \sum_{i=1}^{m} (\mathbf{y}_i - \mathbf{\theta}^T \mathbf{X}_i)^2 \\ & = \nabla_{\theta} \ \ \frac{1}{2} (\mathbf{y}_i - \mathbf{\theta}^T \mathbf{X}_i )^2 \\ & = \frac{1}{2} * 2 (\mathbf{y}_i - \mathbf{\theta}^T \mathbf{X}_i) \frac{\partial{(-\mathbf{\theta}^T \mathbf{X}})}{\mathrm{d}{\theta}} \\ & = - (\mathbf{y}_i - \mathbf{\theta}^T \mathbf{X}_i) \mathbf{X}_{i}\end{split}\]

根据上式子,可以得出参数更新规则:

\[\mathbf{\theta} \leftarrow \mathbf{\theta} - \alpha (- (\mathbf{y}_i - f(\mathbf{X}_i)) \mathbf{X}_{i} )\]

linear regression why

预测函数为什么是这样的?

答: 假设满足广义线性模型构造方法的三个假设情况下成立. 其中假设一满足的情况下 \(P(y|x;\theta) \sim \mathcal{N}(\mu, \sigma^2)\) .由此存在如下的推导:

\[\begin{split}P(y;\mu) =& \frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}(y-\mu)^2) \\ =& \frac{1} {\sqrt{2\pi} } \exp(-\frac{1}{2}y^2).\exp(\mu y - \frac{1}{2}\mu^2)\end{split}\]

由此可得: \(\eta = \mu\).在假设二成立的情况下,可以得出:

\[\begin{split}f(x) & =\mathbf{E}[y|x;\theta] \\ & = \mu\end{split}\]

其中假设三 \(\eta=\theta^T \mathbf{x}\) . 联立假设一、二、三,可得:

\[f(x) = \theta^T \mathbf{x}\]

损失函数为什么是二次函数?

答:假设误差项 \(\epsilon \sim \mathcal{N}(0,\sigma^2)\). 由于:

\[y = \mathbf{\theta}^T \mathbf{x} + \epsilon\]

所以:

\[P(y:x;\mathbf{\theta}) = \frac{1}{\sqrt{2 \pi}} \exp ({- \frac{(y-\mathbf{\theta}^T \mathbf{x} )^2 }{2 \sigma^2}})\]

最大似然估计可得:

\[\begin{split}L(\mathbf{\theta})= & \prod_{i=1}^{m} P( \mathbf{y}_{i}| \mathbf{X}_{i} ; \mathbf{\theta}) \\ \ell = & \log { \prod_{i=1}^{m}P(\mathbf{y}_{i}|\mathbf{X}_{i};\mathbf{\theta}) } \\ = & \log { \prod_{i=1}^{m} \frac{1}{\sqrt{2\pi}\sigma} \exp{(- \frac{ (\mathbf{y}_{i} - \mathbf{\theta}^T\mathbf{X}_{i})^2}{2 \sigma^2})} } \\ = & \sum_{i=1}^{m} \log { - \frac{1}{\sqrt{2\pi} \sigma} \exp(- \frac{(\mathbf{y}_i - \mathbf{\theta}^T \mathbf{X}_i)^2}{2 \sigma^2}) } \\ = & m \log{\frac{1}{\sqrt{2 \pi} \sigma } } - \frac{1}{\sigma^2} . \frac{1}{2} \sum_{i=1}^{m} (\mathbf{y}_i - \mathbf{\theta} \mathbf{X}_i)^2 \\ \approxeq & - \frac{1}{2} \sum_{i=1}^{m} (\mathbf{y}_i - \mathbf{\theta} \mathbf{X}_i)^2\end{split}\]

所以最大似然等价于最小平方损失函数.

linear regression expand

Lasso

Ridge

LWR(locally weighted linear regression)

logistic regression

图例

Predicted function

\[f(x) = g(\mathbf{\theta}^T \mathbf{x}) = \frac{1}{ 1 + \exp{- \mathbf{\theta}^T \mathbf{x}}}\]

Loss function

\[L(y,\bar{y}) = - (y \log {f(x)} + (1-y) \log {(1 - f(x))})\]

Note

还可以化简,待完善log(2+exp)形式

Note

该损失函数有一个专业名词

Objection function

\[O(\mathbf{y},\mathbf{X}) = \sum_{i=1}^{m} L(\mathbf{y}_i, f(\mathbf{X}_i)=- \sum_{i=1}^{m} {\mathbf{y}_i \log {f(\mathbf{X}_i)} + (1-\mathbf{y}_i) \log {(1 - f(\mathbf{X}_i))}}\]

Optimizing

\[\mathbf{\theta} = \mathop{\mathbf{\arg\min}}_{\mathbf{\theta}} \ \ O(\mathbf{y},\mathbf{X})\]
Gradient-based
\[\begin{split}\nabla_{\theta} {O(\mathbf{y},\mathbf{X};\mathbf{\theta})} & = \nabla_{\theta} \ \ \sum_{i=1}^{m} L(\mathbf{y}_i,f(\mathbf{X}_i)) \\ & = \nabla_{\theta} \ \ - \sum_{i=1}^{m} {\mathbf{y}_i \log {f(\mathbf{X}_i) + (1 - \mathbf{y}_i) \log {(1 - f(\mathbf{X}_i ))}}} \\ & = \nabla_{\theta} \ \ - ({\mathbf{y}_i \log {f(\mathbf{X}_i) + (1 - \mathbf{y}_i) \log {(1 - f(\mathbf{X}_i ))}}} \\ & = - \frac{\partial{ \mathbf{y}_i \log {f(\mathbf{X}_i) + (1 - \mathbf{y}_i) \log {(1 - f(\mathbf{X}_i ))}} }}{ \mathrm{d}(\mathbf{\theta}^T \mathbf{X}_i)} \frac{\partial{\mathbf{\theta}^T \mathbf{X}_i}}{\mathrm{d}x} \\ & = - (\mathbf{y}_i \frac{1}{f(\mathbf{X}_i)} + (1-y) \frac{1}{1 - f(\mathbf{X}_i)}) g(\mathbf{\theta}^T \mathbf{X}_i) ( 1 - g(\mathbf{\theta}^T \mathbf{X}_i)) \mathbf{X}_i \\ & = - (\mathbf{y}_i - f(\mathbf{X}_i)) \mathbf{X}_i\end{split}\]

根据上式子,可以得出参数更新规则:

\[\mathbf{\theta} \leftarrow \mathbf{\theta} - \alpha (- (\mathbf{y}_i - f(\mathbf{X}_i)) \mathbf{X}_{i} )\]

logistic regression why

预测函数为什么是这样的?

假设满足广义线性模型构造方法的三个假设情况下成立. 其中假设一 满足的情况下 \(P(y|x;\theta) \sim \text{Bernoulli}(\phi)\) .由此存在如下的推导:

\[\begin{split}P(y;\phi) = & \phi^y(1-\phi)^{1-y} \\ = & \exp{ (y\log{\phi} + (1-y)\log{(1 - \phi)} ) } \\ = & \exp{ (( \log{ (\frac{\phi}{1 - \phi} } ))y + \log{(1-\phi)} )}\end{split}\]

由上式可得: \(\eta=\log{\frac{\phi}{1 - \phi}}\). 在假设二成立的情况下,可以得出:

\[\begin{split}f(x) & =\mathbf{E}[y|x;\theta] \\ & = \phi\end{split}\]

其中假设三 \(\phi=\theta^T \mathbf{x}\). 联立假设一、二、三,可得:

\[f(x) = \frac{1}{1 + \exp{- \mathbf{\theta}^T \mathbf{x}}}\]

损失函数为什么是交叉熵?

由对数几率(odd):

\[\log{\frac{y}{1-y}} = \mathbf{\theta}^T \mathbf{x}\]

可以得出:

\[\begin{split}P(y=1|x;\mathbf{\theta}) & = f(\mathbf{x}) \\ P(y=0|x;\mathbf{\theta}) & = 1 - f(\mathbf{x})\end{split}\]

化简可得:

\[P(y|x;\mathbf{\theta}) = f(\mathbf{x})^y (1 - f(\mathbf{x}))^{1-y}\]

最大似然估计可以得出:

\[\begin{split}L(\mathbf{\theta}) & = \prod_{i=1}^{m}P(\mathbf{y}_i|\mathbf{X}_i;\mathbf{\theta}) \\ \ell & = \log{\prod_{i=1}^{m}P(\mathbf{y}_i|\mathbf{X}_i;\mathbf{\theta})} \\ & = \log{ \prod_{i=1}^{m} f(\mathbf{X}_i)^{\mathbf{y}_i} (1 - f(\mathbf{X}_i))^{1-\mathbf{y}_i} } \\ & = \sum_{i=1}^{m} {\mathbf{y}_i \log{f(\mathbf{X}_i)} + {(1-\mathbf{y}_i)} \log{(1 - f(\mathbf{X}_i))}}\end{split}\]

所以最大似然估计等价于最小化交叉熵损失函数.

logistic regression expand

softmax regression

图例

Predicted function

\[f(\mathbf{x},k;\mathbf{k}) = \frac{ \exp^{\mathbf{x}\mathbf{\Theta}_k} } {\sum\limits_{k=1}^{\mathbf{k}}{\exp^{\mathbf{x}\mathbf{\Theta}_k} }}\]

Note

应该存在更简洁的表达方式,待查阅

Loss function

\[L(\mathbf{y},\mathbf{\bar{y}}) = - \prod_{k=1}^{\mathbf{k}}{\mathbf{y}_k \log{\mathbf{\bar{y}}_k}}\]

Object function

\[\begin{split}\begin{align*} O(\mathbf{X},\mathbf{Y};\mathbf{\Theta}) & = \prod_{i=1}^{m} L(\mathbf{Y}_i,\mathbf{\bar{Y}}_i) \\ & = - \prod_{i=1}^{m} \prod_{k=1}^{\mathbf{k}} {\mathbf{Y}}_{i}^{k} \log{\mathbf{\bar{Y}}_{i}^{k}} \end{align*}\end{split}\]

Optimizing

Gradient-based
_images/softmax-Gradient-based.png

Note

其中存在 \((\exp^{\Theta_k \mathbf{X}_i})^{'} \text{、} \ \ (\sum\limits_{k=1}^{m} \exp^{\Theta_k \mathbf{X}_i})^{'}\) 是对 \(\Theta_k \mathbf{X}_i\) 求导.

softmax regression why

为什么预测函数是这样?

假设满足广义线性模型构造方法的三个假设情况下成立. 其中假设一 满足的情况下 \(P(\mathbf{y}|x;\phi,k) \sim \text{category}(\phi_1,\phi_2,...,\phi_i,...,\phi_k)\) .由此存在如下的推导:

\[\begin{split}P(\mathbf{y};\theta,k) & = {\phi_1}^{\mathbf{y}_1} {\phi_2}^{\mathbf{y}_2}...{\phi_i}^{\mathbf{y}_i}...{\phi_k}^{ 1 - \sum\limits_{i=1}^{k-1}\mathbf{y}} \\ & = \exp^{ \mathbf{y}_1 \log{\phi_1} + \mathbf{y}_2 \log{\phi_2}+...+\mathbf{y}_i \log{\phi_i}+...+(1 - \sum\limits_{i=1}^{k-1}\mathbf{y}) \log{\phi_k}} \\ & = \exp^{ \mathbf{y}_1 \log{ \frac{\phi_1}{\phi_k}} + \mathbf{y}_2 \log{ \frac{\phi_2}{\phi_k}} +...+\mathbf{y}_{k-1} \log{ \frac{\phi_{k-1}}{\phi_k}} + \log{\phi_k} } \\ & = \exp^{ \begin{bmatrix} \log{ \frac{\phi_1}{\phi_k}} \\ \log{ \frac{\phi_2}{\phi_k}} \\ ... \\ \log{\frac{\phi_{k-1}}{\phi_k}} \\ \log{\frac{\phi_k}{\phi_k}} \end{bmatrix}^T \mathbf{y} - -\log{\phi_k}}\end{split}\]

由上式可得: \(\eta=\begin{bmatrix} \log{ \frac{\phi_1}{\phi_k}} \\ \log{ \frac{\phi_2}{\phi_k}} \\ ... \\ \log{\frac{\phi_{k-1}}{\phi_k}} \\ \log{\frac{\phi_k}{\phi_k}} \end{bmatrix}\). 进一步可以得出:

\[\begin{split}\eta_i & = \log{\frac{\phi_i}{\phi_k}} \\ \phi_i & = \phi_k \exp^{\eta_i} \\ 1 & = \sum\limits_{i=1}^{k} \phi_i = \phi_k \sum\limits_{i=1}^{k} \exp^{\eta_i} \\ \phi_k & = \frac{1}{\sum\limits_{i=1}^{k} \exp^{\eta_i}}\end{split}\]

\(\phi_k = \frac{1}{\sum\limits_{i=1}^{k} \exp^{\eta_i}}\) 代入 \(\phi_i = \phi_k \exp^{\eta_i}\) 中,可得如下结果:

\[\phi_i = \frac{ \exp^{\eta_i} }{\sum\limits_{i=1}^{k} \exp^{\eta_i}}\]

从假设二可以推导出:

\[\begin{split}f(x) & =\mathbf{E}[\mathbf{y}|x;\phi] \\ & = \begin{bmatrix} \phi_1 \\ \phi_2 \\ ... \\ \phi_k \end{bmatrix}\end{split}\]

其中假设三 \(\phi_i=\theta_{i}^{T} \mathbf{x}\). 联立假设一、二、三的结果,可得:

\[\begin{split}f(x) & = \begin{bmatrix} \frac{ \exp^{\theta_{1}^{T} \mathbf{x}} }{\sum\limits_{i=1}^{k} \exp^{\theta_{i}^{T} \mathbf{x}}} \\ \frac{ \exp^{\theta_{2}^{T} \mathbf{x}} }{\sum\limits_{i=1}^{k} \exp^{\theta_{i}^{T} \mathbf{x}}} \\ ... \\ \frac{ \exp^{\theta_{k}^{T} \mathbf{x}} }{\sum\limits_{i=1}^{k} \exp^{\theta_{i}^{T} \mathbf{x}}} \end{bmatrix}\end{split}\]

Note

\(1 \leq i \leq k\)

Neruo layers

convolution layer

图例

表达式

一维
\[O(n) = \sum\limits_{i=n}^{w+n}I(i)K(i)\]
\[\frac{\partial{O(n)}}{\mathrm{d}K} = I[n:w+n]\]
总梯度
\[\frac{\partial{ (\sum\limits_{n=0}^{N}O(n)})}{\mathrm{d}K} = \sum\limits_{n=0}^{o} I[n:w+n]\]
二维
\[O(m,n) = \sum\limits_{i=m}^{w+m}\sum\limits_{j=n}^{h+n} I(i,j)K(i,j)\]
\[\frac{\partial{O(m,n)}}{\mathrm{d}K} = I[m:w+m;n:h+n]\]
总梯度
\[\frac{\partial{(\sum\limits_{m=0}^{M}\sum\limits_{n=0}^{N} O(m,n))}}{\mathrm{d}K} = \sum\limits_{m=0}^{M}\sum\limits_{n=0}^{N} I[m:w+m;n:h+n]\]
三维
\[O(m,n,v) = \sum\limits_{i=m}^{w+m}\sum\limits_{j=n}^{h+n}\sum\limits_{k=v}^{l+v} I(i,j,k)K(i,j,k)\]
\[\frac{\partial{O(m,n,v)}}{\mathrm{d}K} = I[m:w+m;n:h+n;v:l+v]\]
总梯度
\[\frac{\partial{(\sum\limits_{m=0}^{M}\sum\limits_{n=0}^{N} \sum\limits_{v=0}^{V} O(m,n,v))}}{\mathrm{d}K} = \sum\limits_{m=0}^{M}\sum\limits_{n=0}^{N}\sum\limits_{v=0}^{V} I[m:w+m;n:h+n;v:l+v]\]

Warning

相关梯度的计算有待查阅

高效实现

Full connection layer

图例

表达式

一维(向量)
\[O(n) = \sum\limits_{i=1}^{x}I(i)K(i)_n\]
\[\frac{\partial{O(n)}}{\mathrm{d}K_n} = I\]
二维(矩阵)
\[O(n) = \sum\limits_{i=1}^{x}\sum\limits_{j=1}^{y} I(i,j)K(i,j)_n\]
\[\frac{\partial{O(n)}}{\mathrm{d}K_n} = I\]
三维(张量)
\[O(n) = \sum\limits_{i=1}^{x}\sum\limits_{j=1}^{y} \sum\limits_{k=1}^{z}I(i,j,k)K(i,j,k)_n\]
\[\frac{\partial{O(n)}}{\mathrm{d}K_n} = I\]

高效实现

pooling layer

图例

表达式

最大池化
一维
\[O(m) = \sum\limits_{i=m}^{w+m} I(i) K(i)\]
\[\begin{split}K(i) = \begin{cases} 1 & if \ \ I(i) \text{is maximum} \\ 0 & else \ \ \text{other} \end{cases}\end{split}\]
二维
\[O(m,n) = \sum\limits_{i=m}^{w+m}\sum\limits_{j=n}^{h+n} I(i,j) K(i,j)\]
\[\begin{split}K(i,j) = \begin{cases} 1 & if \ \ I(i,j) \text{is maximum} \\ 0 & else \ \ \text{other} \end{cases}\end{split}\]
三维
\[O(m,n) = \sum\limits_{i=m}^{w+m}\sum\limits_{j=n}^{h+n} \sum\limits_{k=z}^{l+z} I(i,j,k) K(i,j,k)\]
\[\begin{split}K(i,j,k) = \begin{cases} 1 & if \ \ I(i,j,k) \text{is maximum} \\ 0 & else \ \ \text{other} \end{cases}\end{split}\]
均值池化
一维
\[O(m) = \sum\limits_{i=m}^{w+m} I(i) K(i)\]
\[k(i) = \frac{1}{w}\]
二维
\[O(m,n) = \sum\limits_{i=m}^{w+m}\sum\limits_{j=n}^{h+n} I(i,j) K(i,j)\]
\[k(i,j) = \frac{1}{w \times h}\]
三维
\[O(m,n) = \sum\limits_{i=m}^{w+m}\sum\limits_{j=n}^{h+n} \sum\limits_{k=z}^{l+z} I(i,j,k) K(i,j,k)\]
\[k(i,j,k) = \frac{1}{w \times h \times l}\]

高效实现

  • 描述当前层与下一层的关系
  • 参数数量与神经元数量之间的权衡
  • 统一的框架描述(数学公式)描述所有的层

neural network model

LeNet-5

图例

特点

MLP

图例

优点

Todo

  • 重新组织rst 文档的相关知识点 Ok
  • 初始化Machine-Learning-Model-Analysis 仓库 及编写相关内容 ok
  • 机器学习电子书需处理 ok
  • gradle 笔记 ok
  • 线性回归的求解,需要查询输入矩阵不是方阵的情况.
  • 关于矩阵迹的相关附录
  • linear regression expand (lasso、ridge、加权线性回归)
  • logistic回归的相关扩展待查阅
  • logistic回归IRLS方法求解
  • 最大熵模型的描述
  • 描述当前层与下一层的关系
  • 参数数量与神经元数量之间的权衡
  • 统一的框架描述(数学公式)描述所有的层
  • 层高效的实现

MxNet-Hands-On-Deep-Learning

第一课:从上手到多类分类(视频未看)

  • 线性回归 理论初步完成
  • logistic回归 理论初步完成
  • softmax回归 理论初步完成

第二课:过拟合、多层感知机、GPU和卷积神经网络(已看)

  • MLP 实现 实现,未测试
  • 为什么有时候测试比训练准确率高?
  • 过拟合的原因?
  • LeNet模型 实现 部分实现,未测试
  • 卷积层、全连接层、池化层相关分析 —-理论初步完成
  • 卷积层、全连接层、池化层 适用场景

Indices and tables