Content
NN
At the first, we use the model
can be one layer or multiply layer.
Loss function
Derivation
When is only Linear Regression, the we can get
It follows that Linear Regression must be convex.
We assume difference
then we think it obeys the normal distribution, and we assume the mean is 0.
It implies that
It is hard to understand how to convert the before formula to the after formula.
Likelihood function
The more larger is, the more similar the is to .
Then we can get the follow formula from
The action that we maximize is equivalent to that we maximize .
Conclusion
- The less difference between and , the bigger the function is, in other words, the model fit very well near the points .
- When the difference between and is small enough, the more smaller variance is, the bigger the function is also. Please pay attention to it, the model fit very well in this case, not just near the points .
Cross-Entropy
Let us consider binary classification.
Sigmoid
Let us assume that
Then
Likelihood function
Log likelihood function
iterative formula
Softmax
Hidden Markov Model (HMM)
: status variable
: status space
: observable variable
: observable space
joint probability distribution:
state-transition matrix:
observable probability matrix:
Initial state probability:
HMM:
How to generate the observable series of :
- , getting according to initial state probability.
- getting $x{i}$ from $y{i}$ according to , which is observable probability matrix.
- getting $y{i+1}$ from $y{i}$ according to , which is state-transition matrix.
- if , set , or not be end.
Markov Random Field (MRF)
Clique (C):
Maximal clique
For convenience, we express the probability with maximal clique .
Follow the above picture:
MLP
maxout
Maxout layer will supersede active layer (etc. Relu, Sigmoid).
maxpooling
RBF network
The middle layer is the difference between RBF network and BP network.
-
BP: affine (linear)
-
RBF: gaussian RBF units (nonlinear):
RBF network middle layer only has one layer.
It has the greatest response at the middle; The reaction is maximal near the middle and the more farther from it the more weaker.
autoencoder
https://www.cs.toronto.edu/~lczhang/360/lec/w05/autoencoder.html
AlexNet
Imagenet classification with deep convolutional neural networks 2012
Prevent overfitting:
-
Data Augmentation
Generate new data from original data by some simply transformations, e.g. rotation.
-
Dropout
Regularization
As usual, we have the follow optimization problem:
If model has some parameters are very large, then it will make the model overfitting.
So we should append the decay items (\lambda) to prevent model parameter values to rise too high.
Decision tree
Information entropy
Information gain
ID3 algorithm
It prefers over items for which the quantities are very filling.
C4.5 algorithm, gain radio
It prefers over items for which the quantities are small. The result shows that it will choose the max gain radio from the gain() which are higher than average.
CRAT
Requirement
- It needs labels
Bagging and Random Forest
Bagging, we can separate the dataset into subsets which has m examples. Then prototype models can individually be trained out. When test data need to be predicted, we put the data into the subsets and refer the max confidence model as the predicted result.
Random forest, it is the same as Bagging, except it used decision tree to choose ultimate model.
Knowledge Graph
In knowledge representation and reasoning, knowledge graph is a knowledge base that users a graph-structured data model or topology to integrate data. Knowledge graphs are often used to store interlink descriptions of entities - objects, events, situations or abstract concepts - while also encoding the semantics underlying the used terminology.
Since the development of Semantic Web, knowledge graphs are often associated with linked open data projects, focusing on the connections between concepts and entities. They are also prominently associated with and used by search engines such as Google, Bing, and Yahoo; knowledge-engines and question-answering services such as WolframAlpha, Apple's Siri, and Amazon Alexa; and social networks such as Linkedin and facebook.
- Singhal, Amit (May 16, 2012). "Introducing the Knowledge Graph: things, not strings". Official Google Blog. Retrieved 21 March 2017.
- [Jens Lehmann et al., 2015] DBpedia: A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia.
- [Fabian, M. S. et al. 2007] Yago: A core of semantic knowledge unifying wordnet and wikipedia
- [Roberto Navigli et. al., 2012] BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network
- [Banko et al. 2007] "Open information extraction from the web." IJCAI. Vol. 7. 2007.
- [Newell, Allen et al. 1976] “Computer Science as Empirical Inquiry: Symbols and Search”, Communications of the ACM, 19 (3)
Appendices
Model averaging
Each neural node has averaging weight, and prevents to exist heavy nodes.
We want get the model like:
instead of: