supervised learning - input X and output Y are both known and analyzed
unsupervised learning - only input X is known (eg. clustering)
reinforcement learning - providing 'rewards' on correct output
evolutionary learning - many generations and adaptations of algorithms
perceptron - a simulated neuron with weighted inputs, an output, and a threshold
\(Y = WX + b\)
where \(W\) is the input weight, \(X\) is the input, \(Y\) is the output, and \(b\) is bias.
The inputs are weighted, summed, and shifted. This value is then mapped through a function, often a step function.
Error
\(E_i = t_i - y_i\)
Mean Square Error \(MSE = \frac{1}{N} \sum E_i^2\)
The goal of optimizing a perceptron is to minimize the mean square error. This is done by developing an algorithm which adjusts input weights based on output error.
\(w_{ij} = w_{ij} + \eta (t_{ij} - y_{ij}) x_1\)
The function \(W = X^+Y\) finds the weights that result in the smallest MSE for a given \(X\) and \(Y\)
where \(X^+ = (X^tX)^{-1}X^t\)
The delta rule is used to determine how to adjust weights of a perceptron based on output error. The idea is to find the change in error with respect to weights, then move the weight in the opposite direction. This is called gradient descent.
First we define the error as an absolute quantity.
Error
\(E(w) = \frac{1}{2}\left[Y - T \right]^2\)
where \(Y\) is the output of the activation function, and \(T\) is the target
Note that the activation function must be differentiable.
Change in a specific weight for a neuron
\(\Delta w_{x} = - \alpha (y - t) f'(\sum w_i x_i + b)\)
\(E^2 = |Ax - y|^2\) is to be minimized
\(x = A^+y\), where \(A^+ = [A^tA]^{-1}A^t\) is called \(A\) pseudo-inverse
Doing last example using pseudo-inverse:
\[\left[ \begin{matrix} 1 & 1 \\ 2 & 1 \\ -1 & 1 \end{matrix} \right] \left[ \begin{matrix} w \\ b \\ \end{matrix} \right] = \left[ \begin{matrix} 2 \\ 1 \\ 0 \end{matrix} \right] \]
Let \[A = \left[ \begin{matrix} 1 & 1 \\ 2 & 1 \\ -1 & 1 \end{matrix} \right]\], \[w = \left[ \begin{matrix} w \\ b \end{matrix}\right]\], \[y = \left[ \begin{matrix} 2 \\ 1 \\ 0 \end{matrix} \right]\]
\(w = A^+y\) where \(A^+ = (A^tA)^{-1}A^t\)
\[w = \left[\begin{matrix} 0.4286 \\ 0.7143 \end{matrix} \right]\]
ipe
Linear Nonseparable
When a set of data cannot be accurately clustered with linear separation, it is called linearly nonseparable
Sometimes, the data can be transformed with an extra dimension so that it becomes linearly separable.
\(y_m(k) = \sum_{j=0}^{N_{m-1}} w_m(k,j)z_{m-1}(j)\)
\(z_m(k) = f(y_m(k))\)
Error function: \(E_T = \frac{1}{2L}\sum_{l=0}^{L-1} \sum_{k=0}^{N_m-1}(d_l(k) - o_l(k))^2\)
Begin by training output layer, since we know the desired output for only this layer.
Assign random values to all weights
Run X through the system to get Y at the output layer
Use \(\delta_f(i) = (d_f(i) - o_f(i) * f_f(y(i))\)
Back propogate error: \(\delta_{f-1}(i) = f_i'(y_n(i))\sum_{k=1}^{N_{m+1}}\delta_{m+1}(k)w_{m+1}(k,i)\)
Change weights: \(\Deltaw_m(i,j) = \alpha\delta_m(i) z_{m-1}(i)\)
Repeat for all output pairs.
Repeat entire procedure until weights settle.
Taylor Expansion
$$ f(x) = f(x0) + ∇ f(x)
Jacobian Matrix
\[\mathbf J = \frac{d\mathbf f}{d\mathbf x} = \begin{bmatrix} \dfrac{\partial \mathbf{f}}{\partial x_1} & \cdots & \dfrac{\partial \mathbf{f}}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \dfrac{\partial f_1}{\partial x_1} & \cdots & \dfrac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \dfrac{\partial f_m}{\partial x_1} & \cdots & \dfrac{\partial f_m}{\partial x_n} \end{bmatrix}\]
where \(f(x) = \langle f_1(x_1,\hdots,x_n),\hdots,f_m(x_1,\hdots,x_n) \rangle\)
Hessian Matrix
\[\bold H = \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex] \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix}\]
Newton's Method
\(P_k = -(\nabla^2f(x_k))^{-1}\nabla f(x_k)\)
Singular Value Decomposition
Let \(A\) be a mxn matrix.
Then \(A = USV^t\)
where \(U\) is an mxm orthogonal matrix, \(V\) is an nxn orthogonal matrix, and
\[S = \left[ \begin{matrix} s_1 & \hdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \hdots & s_n \end{matrix} \right]^n\]
Let \(f(x) = ax_1^2 + bx_2^2 + cx_3^2\)
Determine Jordanian. (same as gradient when \(f\) is scalar
\(\nabla f(x) = \langle 2ax_1, 2bx_2, 2cx_3 \rangle\)
Determine Hessian
\[H = \left[ \begin{matrix} 2a & 0 & 0 \\ 0 & 2b & 0 \\ 0 & 0 & 2c \end{matrix} \right]\]
Determine residual, \(r(x) = [ ax_1^2 bx_2^2 cx_3^2 ]\)
\((J^tJ)dx = -J^tr\)
\[-J^tr = \begin{matrix} ax_1 & 0 & 0 \\ 0 & bx_2 & 0 \\ 0 & cx_2 & 0 \end{matrix}\]
\(\Delta x = -[J^tJ + rI]^{-1} J^t r\)
\(x(n+1) = x(x) + \Delta x\)
Let \(f(x) = 100(x_2 - x_1^2)^2 + (1 - x_1)^2\), Rosenbrock's function
\(r = \left[ \begin{matrix} 10(x_2 - x_1^2) \\ 1 - x_2 \end{matrix} \right]\)
\(J = \left[ \begin{matrix} -20x_1 & 10 \\ -1 & 0 \end{matrix} \right]\)
then the answer can be calculated with
\(\Delta x = -[J^tJ + rI]^{-1}J^tr\)
Faster descent than gradient descent.
Let \(f(x) = 0.5x_1^2 + 0.2x_2^2 + 0.6x_3^2\)
$α = .931
\[x(1) = \left( \begin{matrix} -2 \\ 2 \\ 0 \end{matrix} \right) + .931 \left( \begin{matrix} 2 \\ -.8 \\ 2.4 \right) = \left( \begin{matrix} -.138 \\ 1.255 \\ 2.35 \end{matrix} \right)\]
\(\Beta = 0.0337\)
$p(1) = ( \begin{matrix} .138brute force - evaluate every possible solution greedy search - 'cheapest link' - choose the cheapest path from the current point hill climbing - randomly pick a solution/path. If the error decreases, accept this change. Otherwise, discard it.
Algorithm
Let \(E_n\) be the current error, called energy, T be be a changing scaling value, called temperature.
If \(E_n - E_{n-1} < 0\), accept.
Else if \[e^{\frac{E_{n-1} - E_n}{T}}\] is bigger than a random value between x and 1, accept. (x is usally 0.8)
\(T \leftarrow T\), for \(0 < c \leq 1\)
Else, reject change.
The simulated annealing algorithm is a type of hill climbing. It is different from steepest descent in that small positive changes in error are allowed in the hopes that local minima can be escaped.
A genetic algorithm is an iterative process which modifies solution parameters in a random way in each iteration.
chromosome - parameters or weights to be changed organism - a set of parameters child - an organism derived from another organism
Most genetic algorithms select the most effective organisms in a single iteration then randomise and combine their chromosomes (crossing over).
Let \(C = \{p_1, ..., p_n\}\)
proportional - \(C_{new} = \beta C_{p1} + (1 - \beta C_{p2})\) for \(0 \leq \beta \leq 1\)
difference - \(C_{new} = C_{p1} + \beta (C_{p1} - C_{p2})\) for \(0 \leq \beta \leq 1\)
uniform crossover - one of the parent chromosomes is randomly selected and transferred to the child
Mutations increase variability between generations and decrease the chance that organisms will converge on a local extrema, rather than the global one. Typically, a percentage called the mutation rate specifies the fraction of parameters to mutate either through replacement by random number or bit flipping.
unsupervised learning
Training data is only the inputs. There are no error/cost functions to determine the fitness of parameters.
The most common unsupervised learning is clustering, which is the process of separating inputs into groups (or clusters) of similar input.
Several implementations of clustering are given below.
Determine mean \(m_x = \frac{1}{K} \sum_{i=1}^K x_i\)
\(C_x = \frac{1}{K-1} \sum_{k=1}^k (x_k - m_x)(x_k - m_x)^t\) where \(K\) is num pixels \(y = A(x - m_x)\) where \(A\) is KLT transform
Define \(Y'\) by setting some parts of \(Y\) to 0
It is useful to know if a dataset can be easily separated by a line so that we can choose the correct method for classsification. This is called linear separability.
Linear Separability
\(\text{separability} = S_B/S_W\),
where \(S_W = \sum_\text{class} p_c (x_j - \mu)(x_j - \mu)^T\) is within-class scatter
\(S_B = \sum_\text{class} (\mu_c - \mu)(\mu_c - \mu)^T\) is between-class scatter
Rayleigh Quotient rule
LDA is used as a preprocessing algorithm to simplify labeled data before it is fed into a machine learning algorithm. It looks at which parameters have the largest effect on the class for a datapoint, then combines parameters and reduces dimensionality.
PSO is an iterative algorithm which models the movements of birds or fish in a swarm. Namely, the movements of individuals within the population are influenced by both what the individual perceives as the best move, and what the swarm perceives as a best move. This components are then weighted and summed together to determine how the individual should move.
\(v_\text{new} = v_\text{old} + \Gamma_l \times r_l \times (p_\text{local best} - p_\text{old}) + \Gamma_g \times r_g \times (p_\text{global best} - p_\text{old})\)
\(p_\text{new} = p_\text{old} + v_\text{new}\)
where
\(v\) - particle velocity \(p\) - particle position (parameters/weights) \(r_l, r_g\) - uniform random numbers \(\Gamma_l, \Gamma_g\) - learning factor \(p_\text{global best}\) - best position ever found by any particle \(p_\text{local best}\) - best position ever found by this particle
\(f(x_i,z_i) = w_1d_{max}(z_i,x_i) + w_2(z_{max} - d_{min}(x_i))\)
zmax - max pixel val
\(d(a,b) = \sqrt{\sum_{k=1}^N (a_k - b_k)^2}\)