Brief overview of machine learning

Content

NN

At the first, we use the model

h(x,θ)

h can be one layer or multiply layer.

[X,Y]=[x(i),y(i)]i=1,2,,mx(i)Rny(i)R1

Loss function

J(θ)=12[i=1m(h(x(i),θ)y(i))2]

Derivation

J(θ)θ=i=1m[(h(x(i),θ)y(i))h(x(i),θ)θ]2J(θ)θ2=i=1m[(h(x(i),θ)θ)2+(h(x(i),θ)y(i))2h(x(i),θ)θ2]

When h is only Linear Regression, the we can get

2J(θ)θ20

It follows that Linear Regression must be convex.

We assume difference

ε(i)=h(x(i),θ)y(i)

then we think it obeys the normal distribution, and we assume the mean is 0.

εN(0,σ2)P(ε(i))=12πσexp((ε(i)0)22σ2)

It implies that

P(y(i)|x(i); θ)=12πσexp(((h(x(i),θ)y(i))0)22σ2)

It is hard to understand how to convert the before formula to the after formula.

Likelihood function

L(θ)=i=1mP(y(i)|x(i); θ)=i=1m12πσexp((h(x(i),θ)y(i))22σ2)

The more larger L is, the more similar the h(x) is to f(x) .

Then we can get the follow formula from L

The action that we maximize log(L) is equivalent to that we maximize L.

log(L(θ))=log(i=1m12πσexp((h(x(i),θ)y(i))22σ2))=i=1mlog[12πσexp((h(x(i),θ)y(i))22σ2)]=i=1mlog12πσ+i=1m((h(x(i),θ)y(i))22σ2)

Conclusion

  1. The less difference between h(x,model) and y, the bigger the function L is, in other words, the model fit f(x) very well near the points [xi,yi].
  2. When the difference between h(x,model) and y is small enough, the more smaller variance is, the bigger the function L is also. Please pay attention to it, the model fit f(x) very well in this case, not just near the points [xi,yi].

Cross-Entropy

Let us consider binary classification.

Sigmoid

x,θRnh:Rn[0,1]hθ(x)=11+eθTx

Let us assume that

P(y=1|x ; θ)=hθ(x)P(y=0|x ; θ)=1hθ(x)

Then

P(y|x ; θ)=[hθ(x)]y[1hθ(x)]1yy=1P(y=1|x;θ)=[hθ(x)]1[1hθ(x)]11=hθ(x)y=0

Likelihood function

L=i=1mP(y(i)|x(i);θ)=i=1m[hθ(x(i))]y(i)[1hθ(x(i)))]1y(i)

Log likelihood function

logL=i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))

iterative formula

θj=θj+α(y(i)hθ(x(i)))xj(i)

Softmax

θ(i)Rn,i=1,,kp(y(i)|x(i);θ)=exp({θ(y(i))}Tx)j=1kexp({θ(j)}Tx)L=i=1mexp({θ(y(i))}Tx)j=1kexp({θ(j)}Tx)logL=i=1mlogexp({θ(y(i))}Tx)j=1kexp({θ(j)}Tx)

Hidden Markov Model (HMM)

y1y2yiynx1x2xixn

y1,y2,,yn : status variable

s1,s2,,sN : status space

Y={s1,s2,,sN}yiY

x1,x2,,xn : observable variable

o1,o2,,oM : observable space

X={o1,o2,,oM}xiX

joint probability distribution:

P(x1,y1,,xn,yn)=P(y1)P(x1|y1)i=2nP(yi|yi1)P(xi|yi)

state-transition matrix:

A=[aij]N×Naij=P(yt+1=sj|yt=sj)

observable probability matrix:

B=[bij]N×Mbij=P(xt=oj|yt=si)

Initial state probability:

π=(π1,π2,,πN)πi=P(yi=si)

HMM:

λ=[A,B,π]

How to generate the observable series of [x1,x2,,xn]:

  1. t=1, getting y1 according to initial state probability.
  2. getting $x{i}$ from $y{i}$ according to B, which is observable probability matrix.
  3. getting $y{i+1}$ from $y{i}$ according to A, which is state-transition matrix.
  4. if t<n, set t=t+1, or not be end.

Markov Random Field (MRF)

Clique (C):

clique

Maximal clique (C):x1,x2,x1,x3,x2,x4,x3,x5,x2,x5,x6

P(x)=1ZQCψQ(xQ)Z=x[QCψQ(xQ)]

For convenience, we express the probability with maximal clique (C).

P(x)=1ZQCψQ(xQ)Z=x[QCψQ(xQ)]

Follow the above picture:

P(x)=1Zψ12(x1,x2)ψ13(x1,x3)ψ24(x2,x4)ψ35(x3,x5)ψ256(x2,x5,x6)

MLP

input:x1,x2output:y1active:φ(x)

MLP

maxout

Maxout layer will supersede active layer (etc. Relu, Sigmoid).

maxout

maxpooling

maxpooling

RBF network

RBF

The middle layer is the difference between RBF network and BP network.

  • BP: affine (linear)

  • RBF: gaussian RBF units (nonlinear):

exp(γxci2)

RBF network middle layer only has one layer.

RBF

It has the greatest response at the middle; The reaction is maximal near the middle and the more farther from it the more weaker.

autoencoder

AE

https://www.cs.toronto.edu/~lczhang/360/lec/w05/autoencoder.html

AlexNet

Imagenet classification with deep convolutional neural networks 2012

Prevent overfitting:

  1. Data Augmentation

    Generate new data from original data by some simply transformations, e.g. rotation.

  2. Dropout

Regularization

As usual, we have the follow optimization problem:

xRn, θRmL(θ)=12n[i=1n(hθ(x(i))y(i))2]min L(θ)

If model has some parameters are very large, then it will make the model overfitting.

So we should append the decay items (\lambda) to prevent model parameter values to rise too high.

xRn, θRmL(θ)=12n[i=1n(hθ(x(i))y(i))2+λi=1mθi2]min L(θ)

Decision tree

Information entropy

Ent(D)=k=1|Y|pklog2pk

Information gain

maxGain(D,a)=Ent(D)v=1V|Dv||D|Ent(Dv)

ID3 algorithm

a=argmaxaAGain(D,a)

It prefers over items for which the quantities are very filling.

C4.5 algorithm, gain radio

IV(a)=v=1V|Dv||D|log2|Dv||D|GainRadio(D,a)=Gain(D,a)IV(a)

It prefers over items for which the quantities are small. The result shows that it will choose the max gain radio from the gain() which are higher than average.

CRAT

Gini(D)=k=1|Y|kkpkpk=1k=1|Y|pk2
GiniIndex(D,a)=v=1V|Dv||D|Gini(Dv)
a=argminaAGiniIndex(D,a)

Requirement

  1. It needs labels

Bagging and Random Forest

Bagging, we can separate the dataset into T subsets which has m examples. Then T prototype models can individually be trained out. When test data need to be predicted, we put the data into the T subsets and refer the max confidence model as the predicted result.

Random forest, it is the same as Bagging, except it used decision tree to choose ultimate model.

Knowledge Graph

In knowledge representation and reasoning, knowledge graph is a knowledge base that users a graph-structured data model or topology to integrate data. Knowledge graphs are often used to store interlink descriptions of entities - objects, events, situations or abstract concepts - while also encoding the semantics underlying the used terminology.

Since the development of Semantic Web, knowledge graphs are often associated with linked open data projects, focusing on the connections between concepts and entities. They are also prominently associated with and used by search engines such as Google, Bing, and Yahoo; knowledge-engines and question-answering services such as WolframAlpha, Apple's Siri, and Amazon Alexa; and social networks such as Linkedin and facebook.

  1. Singhal, Amit (May 16, 2012). "Introducing the Knowledge Graph: things, not strings". Official Google Blog. Retrieved 21 March 2017.
  2. [Jens Lehmann et al., 2015] DBpedia: A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia.
  3. [Fabian, M. S. et al. 2007] Yago: A core of semantic knowledge unifying wordnet and wikipedia
  4. [Roberto Navigli et. al., 2012] BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network
  5. [Banko et al. 2007] "Open information extraction from the web." IJCAI. Vol. 7. 2007.
  6. [Newell, Allen et al. 1976] “Computer Science as Empirical Inquiry: Symbols and Search”, Communications of the ACM, 19 (3)

Appendices

Model averaging

Each neural node has averaging weight, and prevents to exist heavy nodes.

We want get the model like:

Averaging

instead of:

Averaging2

Bayes theorem

P(AB)=P(A)P(B|A)P(A|B)P(B)=P(A)P(B|A)