# Brief overview of machine learning

Contents

## Content

### NN

At the first, we use the model

$h\left(x,\theta \right)$

$h$$h$ can be one layer or multiply layer.

$\left[\mathcal{X},\mathcal{Y}\right]=\left[{x}^{\left(i\right)},{y}^{\left(i\right)}{\right]}_{i=1,2,\dots ,m}\phantom{\rule{0ex}{0ex}}{x}^{\left(i\right)}\in {\mathbb{R}}^{n}\phantom{\rule{0ex}{0ex}}{y}^{\left(i\right)}\in {\mathbb{R}}^{1}$

Loss function

$J\left(\theta \right)=\frac{1}{2}\left[\sum _{i=1}^{m}\left(h\left({x}^{\left(i\right)},\theta \right)-{y}^{\left(i\right)}{\right)}^{2}\right]$

Derivation

$\frac{\mathrm{\partial }J\left(\theta \right)}{\mathrm{\partial }\theta }=\sum _{i=1}^{m}\left[\left(h\left({x}^{\left(i\right)},\theta \right)-{y}^{\left(i\right)}\right)\frac{\mathrm{\partial }h\left({x}^{\left(i\right)},\theta \right)}{\mathrm{\partial }\theta }\right]\phantom{\rule{0ex}{0ex}}\frac{{\mathrm{\partial }}^{2}J\left(\theta \right)}{\mathrm{\partial }{\theta }^{2}}=\sum _{i=1}^{m}\left[{\left(\frac{\mathrm{\partial }h\left({x}^{\left(i\right)},\theta \right)}{\mathrm{\partial }\theta }\right)}^{2}+\left(h\left({x}^{\left(i\right)},\theta \right)-{y}^{\left(i\right)}\right)\frac{{\mathrm{\partial }}^{2}h\left({x}^{\left(i\right)},\theta \right)}{\mathrm{\partial }{\theta }^{2}}\right]$

When $h$$h$ is only Linear Regression, the we can get

$\frac{{\mathrm{\partial }}^{2}J\left(\theta \right)}{\mathrm{\partial }{\theta }^{2}}\ge 0$

It follows that Linear Regression must be convex.

We assume difference

${\epsilon }^{\left(i\right)}=h\left({x}^{\left(i\right)},\theta \right)-{y}^{\left(i\right)}$

then we think it obeys the normal distribution, and we assume the mean is 0.

$\epsilon \sim N\left(0,{\sigma }^{2}\right)\phantom{\rule{0ex}{0ex}}P\left({\epsilon }^{\left(i\right)}\right)=\frac{1}{\sqrt{2\pi }\sigma }\mathrm{exp}\left(-\frac{\left({\epsilon }^{\left(i\right)}-0{\right)}^{2}}{2{\sigma }^{2}}\right)$

It implies that

It is hard to understand how to convert the before formula to the after formula.

Likelihood function

The more larger $L$$L$ is, the more similar the $h\left(x\right)$$h(x)$ is to $f\left(x\right)$$f(x)$ .

Then we can get the follow formula from $L$$L$

The action that we maximize $\mathrm{log}\left(L\right)$$\log(L)$ is equivalent to that we maximize $L$$L$.

$\begin{array}{rl}\mathrm{log}\left(L\left(\theta \right)\right)& =\mathrm{log}\left(\prod _{i=1}^{m}\frac{1}{\sqrt{2\pi }\sigma }\mathrm{exp}\left(-\frac{\left(h\left({x}^{\left(i\right)},\theta \right)-{y}^{\left(i\right)}{\right)}^{2}}{2{\sigma }^{2}}\right)\right)\\ & =\sum _{i=1}^{m}\mathrm{log}\left[\frac{1}{\sqrt{2\pi }\sigma }\mathrm{exp}\left(-\frac{\left(h\left({x}^{\left(i\right)},\theta \right)-{y}^{\left(i\right)}{\right)}^{2}}{2{\sigma }^{2}}\right)\right]\\ & =\sum _{i=1}^{m}\mathrm{log}\frac{1}{\sqrt{2\pi }\sigma }+\sum _{i=1}^{m}\left(-\frac{\left(h\left({x}^{\left(i\right)},\theta \right)-{y}^{\left(i\right)}{\right)}^{2}}{2{\sigma }^{2}}\right)\end{array}$

Conclusion

1. The less difference between $h\left(x,model\right)$$h(x, model)$ and $y$$y$, the bigger the function $L$$L$ is, in other words, the model fit $f\left(x\right)$$f(x)$ very well near the points $\left[{x}_{i},{y}_{i}\right]$$[x_i, y_i]$.
2. When the difference between $h\left(x,model\right)$$h(x, model)$ and $y$$y$ is small enough, the more smaller variance is, the bigger the function $L$$L$ is also. Please pay attention to it, the model fit $f\left(x\right)$$f(x)$ very well in this case, not just near the points $\left[{x}_{i},{y}_{i}\right]$$[x_i, y_i]$.

### Cross-Entropy

Let us consider binary classification.

Sigmoid

$x,\theta \in {\mathbb{R}}^{n}\phantom{\rule{0ex}{0ex}}h:{\mathbb{R}}^{n}⟶\left[0,1\right]\phantom{\rule{0ex}{0ex}}{h}_{\theta }\left(x\right)=\frac{1}{1+{e}^{-{\theta }^{T}x}}$

Let us assume that

Then

Likelihood function

$\begin{array}{rl}L& =\prod _{i=1}^{m}P\left({y}^{\left(i\right)}|{x}^{\left(i\right)};\theta \right)\\ & =\prod _{i=1}^{m}\left[{h}_{\theta }\left({x}^{\left(i\right)}\right){\right]}^{{y}^{\left(i\right)}}\cdot \left[1-{h}_{\theta }\left({x}^{\left(i\right)}\right)\right){\right]}^{1-{y}^{\left(i\right)}}\end{array}$

Log likelihood function

$\mathrm{log}L=\sum _{i=1}^{m}{y}^{\left(i\right)}\mathrm{log}{h}_{\theta }\left({x}^{\left(i\right)}\right)+\left(1-{y}^{\left(i\right)}\right)\mathrm{log}\left(1-{h}_{\theta }\left({x}^{\left(i\right)}\right)\right)$

iterative formula

${\theta }_{j}={\theta }_{j}+\alpha \left({y}^{\left(i\right)}-{h}_{\theta }\left({x}^{\left(i\right)}\right)\right){x}_{j}^{\left(i\right)}$

Softmax

${\theta }^{\left(i\right)}\in {\mathbb{R}}^{n},\phantom{\rule{1em}{0ex}}i=1,\dots ,k\phantom{\rule{0ex}{0ex}}p\left({y}^{\left(i\right)}|{x}^{\left(i\right)};\theta \right)=\frac{\mathrm{exp}\left(\left\{{\theta }^{\left({y}^{\left(i\right)}\right)}{\right\}}^{T}x\right)}{\sum _{j=1}^{k}\mathrm{exp}\left(\left\{{\theta }^{\left(j\right)}{\right\}}^{T}x\right)}\phantom{\rule{0ex}{0ex}}L=\prod _{i=1}^{m}\frac{\mathrm{exp}\left(\left\{{\theta }^{\left({y}^{\left(i\right)}\right)}{\right\}}^{T}x\right)}{\sum _{j=1}^{k}\mathrm{exp}\left(\left\{{\theta }^{\left(j\right)}{\right\}}^{T}x\right)}\phantom{\rule{0ex}{0ex}}\mathrm{log}L=\sum _{i=1}^{m}\mathrm{log}\frac{\mathrm{exp}\left(\left\{{\theta }^{\left({y}^{\left(i\right)}\right)}{\right\}}^{T}x\right)}{\sum _{j=1}^{k}\mathrm{exp}\left(\left\{{\theta }^{\left(j\right)}{\right\}}^{T}x\right)}\phantom{\rule{0ex}{0ex}}$

### Hidden Markov Model (HMM)

$\begin{array}{rlrlrlrlrl}& {y}_{1}& \to & {y}_{2}& \to & \cdots & \to & {y}_{i}& \to & {y}_{n}\\ & ↓& ↓& & & & ↓& & ↓\\ & {x}_{1}& & {x}_{2}& & \cdots & & {x}_{i}& & {x}_{n}\end{array}$

${y}_{1},{y}_{2},\cdots ,{y}_{n}$${y_1, y_2, \cdots , y_n}$ : status variable

${s}_{1},{s}_{2},\cdots ,{s}_{N}$${s_1,s_2,\cdots,s_N}$ : status space

$\mathcal{Y}=\left\{{s}_{1},{s}_{2},\dots ,{s}_{N}\right\}\phantom{\rule{0ex}{0ex}}{y}_{i}\in \mathcal{Y}$

${x}_{1},{x}_{2},\cdots ,{x}_{n}$${x_1, x_2, \cdots, x_n}$ : observable variable

${o}_{1},{o}_{2},\cdots ,{o}_{M}$${o_1,o_2,\cdots,o_M}$ : observable space

$\mathcal{X}=\left\{{o}_{1},{o}_{2},\dots ,{o}_{M}\right\}\phantom{\rule{0ex}{0ex}}{x}_{i}\in \mathcal{X}$

joint probability distribution:

$P\left({x}_{1},{y}_{1},\dots ,{x}_{n},{y}_{n}\right)=P\left({y}_{1}\right)P\left({x}_{1}|{y}_{1}\right)\prod _{i=2}^{n}P\left({y}_{i}|{y}_{i-1}\right)P\left({x}_{i}|{y}_{i}\right)$

state-transition matrix:

$A=\left[{a}_{ij}{\right]}_{N×N}\phantom{\rule{0ex}{0ex}}{a}_{ij}=P\left({y}_{t+1}={s}_{j}|{y}_{t}={s}_{j}\right)$

observable probability matrix:

$B=\left[{b}_{ij}{\right]}_{N×M}\phantom{\rule{0ex}{0ex}}{b}_{ij}=P\left({x}_{t}={o}_{j}|{y}_{t}={s}_{i}\right)$

Initial state probability:

$\pi =\left({\pi }_{1},{\pi }_{2},\dots ,{\pi }_{N}\right)\phantom{\rule{0ex}{0ex}}{\pi }_{i}=P\left({y}_{i}={s}_{i}\right)$

HMM:

$\lambda =\left[A,B,\pi \right]$

How to generate the observable series of $\left[{x}_{1},{x}_{2},\cdots ,{x}_{n}\right]$$[x_1,x_2,\cdots,x_n]$:

1. $t=1$$t = 1$, getting ${y}_{1}$$y_1$ according to initial state probability.
2. getting $x{i}$ from $y{i}$ according to $B$$B$, which is observable probability matrix.
3. getting $y{i+1}$ from $y{i}$ according to $A$$A$, which is state-transition matrix.
4. if $t$t < n$, set $t=t+1$$t = t+1$, or not be end.

### Markov Random Field (MRF)

Clique (C): Maximal clique $\left({C}^{\ast }\right):{x}_{1},{x}_{2},{x}_{1},{x}_{3},{x}_{2},{x}_{4},{x}_{3},{x}_{5},{x}_{2},{x}_{5},{x}_{6}$$(C^{*}): {x_1, x_2}, {x_1, x_3}, {x_2, x_4}, {x_3, x_5}, {x_2, x_5, x_6}$

$P\left(x\right)=\frac{1}{Z}\prod _{Q\in \mathcal{C}}{\psi }_{Q}\left({x}_{Q}\right)\phantom{\rule{0ex}{0ex}}Z=\sum _{x}\left[\prod _{Q\in \mathcal{C}}{\psi }_{Q}\left({x}_{Q}\right)\right]$

For convenience, we express the probability with maximal clique $\left({C}^{\ast }\right)$$(C^{*})$.

$P\left(x\right)=\frac{1}{{Z}^{\ast }}\prod _{Q\in {\mathcal{C}}^{\ast }}{\psi }_{Q}\left({x}_{Q}\right)\phantom{\rule{0ex}{0ex}}{Z}^{\ast }=\sum _{x}\left[\prod _{Q\in {\mathcal{C}}^{\ast }}{\psi }_{Q}\left({x}_{Q}\right)\right]$

Follow the above picture:

$P\left(x\right)=\frac{1}{Z}{\psi }_{12}\left({x}_{1},{x}_{2}\right){\psi }_{13}\left({x}_{1},{x}_{3}\right){\psi }_{24}\left({x}_{2},{x}_{4}\right){\psi }_{35}\left({x}_{3},{x}_{5}\right){\psi }_{256}\left({x}_{2},{x}_{5},{x}_{6}\right)$

### MLP

$input:{x}_{1},{x}_{2}\phantom{\rule{0ex}{0ex}}output:{y}_{1}\phantom{\rule{0ex}{0ex}}active:\phi \left(x\right)\phantom{\rule{0ex}{0ex}}$ ### maxout

Maxout layer will supersede active layer (etc. Relu, Sigmoid). ### maxpooling ### RBF network The middle layer is the difference between RBF network and BP network.

• BP: affine (linear)

• RBF: gaussian RBF units (nonlinear):

$\mathrm{exp}\left({-\gamma ‖x-{c}_{i}‖}^{2}\right)$

RBF network middle layer only has one layer. It has the greatest response at the middle; The reaction is maximal near the middle and the more farther from it the more weaker.

### autoencoder https://www.cs.toronto.edu/~lczhang/360/lec/w05/autoencoder.html

### AlexNet

Imagenet classification with deep convolutional neural networks 2012

Prevent overfitting:

1. Data Augmentation

Generate new data from original data by some simply transformations, e.g. rotation.

2. Dropout

### Regularization

As usual, we have the follow optimization problem:

If model has some parameters are very large, then it will make the model overfitting.

So we should append the decay items (\lambda) to prevent model parameter values to rise too high.

### Decision tree

Information entropy

$Ent\left(D\right)=-\sum _{k=1}^{|\mathcal{Y}|}{p}_{k}{\mathrm{log}}_{2}{p}_{k}$

Information gain

$max\phantom{\rule{1em}{0ex}}Gain\left(D,a\right)=Ent\left(D\right)-\sum _{v=1}^{V}\frac{|{D}^{v}|}{|D|}Ent\left({D}^{v}\right)$

ID3 algorithm

${a}^{\ast }=\mathrm{arg}\underset{a\in A}{max}Gain\left(D,a\right)$

It prefers over items for which the quantities are very filling.

C4.5 algorithm, gain radio

$IV\left(a\right)=-\sum _{v=1}^{V}\frac{|{D}^{v}|}{|D|}{\mathrm{log}}_{2}\frac{|{D}^{v}|}{|D|}\phantom{\rule{0ex}{0ex}}GainRadio\left(D,a\right)=\frac{Gain\left(D,a\right)}{IV\left(a\right)}$

It prefers over items for which the quantities are small. The result shows that it will choose the max gain radio from the gain() which are higher than average.

CRAT

$\begin{array}{rl}Gini\left(D\right)& =\sum _{k=1}^{|\mathcal{Y}|}\sum _{{k}^{\prime }\ne k}{p}_{k}{p}_{{k}^{\prime }}\\ & =1-\sum _{k=1}^{|\mathcal{Y}|}{p}_{k}^{2}\end{array}$
$GiniIndex\left(D,a\right)=\sum _{v=1}^{V}\frac{|{D}^{v}|}{|D|}Gini\left({D}^{v}\right)$
${a}^{\ast }=\mathrm{arg}\underset{a\in A}{min}GiniIndex\left(D,a\right)$

Requirement

1. It needs labels

### Bagging and Random Forest

Bagging, we can separate the dataset into $T$$T$ subsets which has m examples. Then $T$$T$ prototype models can individually be trained out. When test data need to be predicted, we put the data into the $T$$T$ subsets and refer the max confidence model as the predicted result.

Random forest, it is the same as Bagging, except it used decision tree to choose ultimate model.

### Knowledge Graph

In knowledge representation and reasoning, knowledge graph is a knowledge base that users a graph-structured data model or topology to integrate data. Knowledge graphs are often used to store interlink descriptions of entities - objects, events, situations or abstract concepts - while also encoding the semantics underlying the used terminology.

Since the development of Semantic Web, knowledge graphs are often associated with linked open data projects, focusing on the connections between concepts and entities. They are also prominently associated with and used by search engines such as Google, Bing, and Yahoo; knowledge-engines and question-answering services such as WolframAlpha, Apple's Siri, and Amazon Alexa; and social networks such as Linkedin and facebook.

1. Singhal, Amit (May 16, 2012). "Introducing the Knowledge Graph: things, not strings". Official Google Blog. Retrieved 21 March 2017.
2. [Jens Lehmann et al., 2015] DBpedia: A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia.
3. [Fabian, M. S. et al. 2007] Yago: A core of semantic knowledge unifying wordnet and wikipedia
4. [Roberto Navigli et. al., 2012] BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network
5. [Banko et al. 2007] "Open information extraction from the web." IJCAI. Vol. 7. 2007.
6. [Newell, Allen et al. 1976] “Computer Science as Empirical Inquiry: Symbols and Search”, Communications of the ACM, 19 (3)

## Appendices

### Model averaging

Each neural node has averaging weight, and prevents to exist heavy nodes.

We want get the model like:  $P\left(AB\right)=P\left(A\right)P\left(B|A\right)\phantom{\rule{0ex}{0ex}}P\left(A|B\right)P\left(B\right)=P\left(A\right)P\left(B|A\right)$