Types of Learning

supervised learning - input X and output Y are both known and analyzed

unsupervised learning - only input X is known (eg. clustering)

reinforcement learning - providing 'rewards' on correct output

evolutionary learning - many generations and adaptations of algorithms

  • bullet
  • bullet 2
  • bullet 3

Perceptron

perceptron - a simulated neuron with weighted inputs, an output, and a threshold

\(Y = WX + b\)

where \(W\) is the input weight, \(X\) is the input, \(Y\) is the output, and \(b\) is bias.

The inputs are weighted, summed, and shifted. This value is then mapped through a function, often a step function.

Minimizing Error

Error

\(E_i = t_i - y_i\)

Mean Square Error \(MSE = \frac{1}{N} \sum E_i^2\)

The goal of optimizing a perceptron is to minimize the mean square error. This is done by developing an algorithm which adjusts input weights based on output error.

\(w_{ij} = w_{ij} + \eta (t_{ij} - y_{ij}) x_1\)

The function \(W = X^+Y\) finds the weights that result in the smallest MSE for a given \(X\) and \(Y\)

where \(X^+ = (X^tX)^{-1}X^t\)

Delta Rule Learning

The delta rule is used to determine how to adjust weights of a perceptron based on output error. The idea is to find the change in error with respect to weights, then move the weight in the opposite direction. This is called gradient descent.

First we define the error as an absolute quantity.

Error

\(E(w) = \frac{1}{2}\left[Y - T \right]^2\)

where \(Y\) is the output of the activation function, and \(T\) is the target

Note that the activation function must be differentiable.

Change in a specific weight for a neuron

\(\Delta w_{x} = - \alpha (y - t) f'(\sum w_i x_i + b)\)

Derivation

Method of Least Squares

\(E^2 = |Ax - y|^2\) is to be minimized

\(x = A^+y\), where \(A^+ = [A^tA]^{-1}A^t\) is called \(A\) pseudo-inverse

Properties of Pseudo Inverse

  • the pseudo-inverse always exists
  • \(AA^+A = A\)
  • \(A^+AA^+ = A\)
  1. Doing last example using pseudo-inverse:

    \[\left[ \begin{matrix} 1 & 1 \\ 2 & 1 \\ -1 & 1 \end{matrix} \right] \left[ \begin{matrix} w \\ b \\ \end{matrix} \right] = \left[ \begin{matrix} 2 \\ 1 \\ 0 \end{matrix} \right] \]

    Let \[A = \left[ \begin{matrix} 1 & 1 \\ 2 & 1 \\ -1 & 1 \end{matrix} \right]\], \[w = \left[ \begin{matrix} w \\ b \end{matrix}\right]\], \[y = \left[ \begin{matrix} 2 \\ 1 \\ 0 \end{matrix} \right]\]

    \(w = A^+y\) where \(A^+ = (A^tA)^{-1}A^t\)

    \[w = \left[\begin{matrix} 0.4286 \\ 0.7143 \end{matrix} \right]\]

Nonseparability

ipe

Linear Nonseparable

When a set of data cannot be accurately clustered with linear separation, it is called linearly nonseparable

Sometimes, the data can be transformed with an extra dimension so that it becomes linearly separable.

Multilayer Networks

\(y_m(k) = \sum_{j=0}^{N_{m-1}} w_m(k,j)z_{m-1}(j)\)

\(z_m(k) = f(y_m(k))\)

Error function: \(E_T = \frac{1}{2L}\sum_{l=0}^{L-1} \sum_{k=0}^{N_m-1}(d_l(k) - o_l(k))^2\)

Begin by training output layer, since we know the desired output for only this layer.

  1. Assign random values to all weights

  2. Run X through the system to get Y at the output layer

  3. Use \(\delta_f(i) = (d_f(i) - o_f(i) * f_f(y(i))\)

  4. Back propogate error: \(\delta_{f-1}(i) = f_i'(y_n(i))\sum_{k=1}^{N_{m+1}}\delta_{m+1}(k)w_{m+1}(k,i)\)

  5. Change weights: \(\Deltaw_m(i,j) = \alpha\delta_m(i) z_{m-1}(i)\)

  6. Repeat for all output pairs.

  7. Repeat entire procedure until weights settle.

Exam Review

  • sequential training
    • one input comes in, the output is calculated, and the weights are update immediately
  • batch training
    • all inputs come in, all outputs are calculated, and weights are updated at the end
  • adaline
  • mlp learning algorithm
  • error correction
    • delta rule
    • gradient descent
    • least mean square error
  • exponential smoothing

Optimization

Taylor Expansion

$$ f(x) = f(x0) + ∇ f(x)

Jacobian Matrix

\[\mathbf J = \frac{d\mathbf f}{d\mathbf x} = \begin{bmatrix} \dfrac{\partial \mathbf{f}}{\partial x_1} & \cdots & \dfrac{\partial \mathbf{f}}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \dfrac{\partial f_1}{\partial x_1} & \cdots & \dfrac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \dfrac{\partial f_m}{\partial x_1} & \cdots & \dfrac{\partial f_m}{\partial x_n} \end{bmatrix}\]

where \(f(x) = \langle f_1(x_1,\hdots,x_n),\hdots,f_m(x_1,\hdots,x_n) \rangle\)

Hessian Matrix

\[\bold H = \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex] \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix}\]

Newton's Method

\(P_k = -(\nabla^2f(x_k))^{-1}\nabla f(x_k)\)

Levenberg-Marquardt Algorithm

Singular Value Decomposition

Let \(A\) be a mxn matrix.

Then \(A = USV^t\)

where \(U\) is an mxm orthogonal matrix, \(V\) is an nxn orthogonal matrix, and

\[S = \left[ \begin{matrix} s_1 & \hdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \hdots & s_n \end{matrix} \right]^n\]

Derivation

Let \(f(x) = ax_1^2 + bx_2^2 + cx_3^2\)

  1. Determine Jordanian. (same as gradient when \(f\) is scalar

    \(\nabla f(x) = \langle 2ax_1, 2bx_2, 2cx_3 \rangle\)

    Determine Hessian

    \[H = \left[ \begin{matrix} 2a & 0 & 0 \\ 0 & 2b & 0 \\ 0 & 0 & 2c \end{matrix} \right]\]

    Determine residual, \(r(x) = [ ax_1^2 bx_2^2 cx_3^2 ]\)

    \((J^tJ)dx = -J^tr\)

    \[-J^tr = \begin{matrix} ax_1 & 0 & 0 \\ 0 & bx_2 & 0 \\ 0 & cx_2 & 0 \end{matrix}\]

    \(\Delta x = -[J^tJ + rI]^{-1} J^t r\)

    \(x(n+1) = x(x) + \Delta x\)

  2. Let \(f(x) = 100(x_2 - x_1^2)^2 + (1 - x_1)^2\), Rosenbrock's function

    \(r = \left[ \begin{matrix} 10(x_2 - x_1^2) \\ 1 - x_2 \end{matrix} \right]\)

    \(J = \left[ \begin{matrix} -20x_1 & 10 \\ -1 & 0 \end{matrix} \right]\)

    then the answer can be calculated with

    \(\Delta x = -[J^tJ + rI]^{-1}J^tr\)

Conjugate Gradients

Faster descent than gradient descent.

  1. Let \(f(x) = 0.5x_1^2 + 0.2x_2^2 + 0.6x_3^2\)

    $α = .931

    \[x(1) = \left( \begin{matrix} -2 \\ 2 \\ 0 \end{matrix} \right) + .931 \left( \begin{matrix} 2 \\ -.8 \\ 2.4 \right) = \left( \begin{matrix} -.138 \\ 1.255 \\ 2.35 \end{matrix} \right)\]

    \(\Beta = 0.0337\)

    $p(1) = ( \begin{matrix} .138

Search

Types

brute force - evaluate every possible solution greedy search - 'cheapest link' - choose the cheapest path from the current point hill climbing - randomly pick a solution/path. If the error decreases, accept this change. Otherwise, discard it.

Simulated Annealing

Algorithm

Let \(E_n\) be the current error, called energy, T be be a changing scaling value, called temperature.

If \(E_n - E_{n-1} < 0\), accept.

Else if \[e^{\frac{E_{n-1} - E_n}{T}}\] is bigger than a random value between x and 1, accept. (x is usally 0.8)

\(T \leftarrow T\), for \(0 < c \leq 1\)

Else, reject change.

The simulated annealing algorithm is a type of hill climbing. It is different from steepest descent in that small positive changes in error are allowed in the hopes that local minima can be escaped.

Genetic Algorithms

A genetic algorithm is an iterative process which modifies solution parameters in a random way in each iteration.

chromosome - parameters or weights to be changed organism - a set of parameters child - an organism derived from another organism

Most genetic algorithms select the most effective organisms in a single iteration then randomise and combine their chromosomes (crossing over).

Crossing Over Methods

Let \(C = \{p_1, ..., p_n\}\)

  • proportional - \(C_{new} = \beta C_{p1} + (1 - \beta C_{p2})\) for \(0 \leq \beta \leq 1\)

  • difference - \(C_{new} = C_{p1} + \beta (C_{p1} - C_{p2})\) for \(0 \leq \beta \leq 1\)

  • uniform crossover - one of the parent chromosomes is randomly selected and transferred to the child

Mutations

Mutations increase variability between generations and decrease the chance that organisms will converge on a local extrema, rather than the global one. Typically, a percentage called the mutation rate specifies the fraction of parameters to mutate either through replacement by random number or bit flipping.

Unsupervised Learning

unsupervised learning

Training data is only the inputs. There are no error/cost functions to determine the fitness of parameters.

The most common unsupervised learning is clustering, which is the process of separating inputs into groups (or clusters) of similar input.

Several implementations of clustering are given below.

K-means Algorithm

  1. Pick N cluster centers and assign them to random coordinates. (choosing the number of cluster centers is a problem of itself)
  2. Assign each datapoint to the nearest cluster.
  3. For each cluster, move it to the average location of all datapoints assigned to it.
  4. Repeat 2-3 until clusters stop moving.
  5. The system is now trained and ready for classifying input.
  • covariance matrix
  • semisupervised learning
  • lloyds algorithm
  • linda buzo gray algorithm (LBG)

PCA

Determine mean \(m_x = \frac{1}{K} \sum_{i=1}^K x_i\)

\(C_x = \frac{1}{K-1} \sum_{k=1}^k (x_k - m_x)(x_k - m_x)^t\) where \(K\) is num pixels \(y = A(x - m_x)\) where \(A\) is KLT transform

Define \(Y'\) by setting some parts of \(Y\) to 0

Linear Separability

It is useful to know if a dataset can be easily separated by a line so that we can choose the correct method for classsification. This is called linear separability.

Linear Separability

\(\text{separability} = S_B/S_W\),

where \(S_W = \sum_\text{class} p_c (x_j - \mu)(x_j - \mu)^T\) is within-class scatter

\(S_B = \sum_\text{class} (\mu_c - \mu)(\mu_c - \mu)^T\) is between-class scatter

Rayleigh Quotient rule

Linear Discriminant Analysis (LDA)

LDA is used as a preprocessing algorithm to simplify labeled data before it is fed into a machine learning algorithm. It looks at which parameters have the largest effect on the class for a datapoint, then combines parameters and reduces dimensionality.

Particle Swarm Optimization (PSO)

PSO is an iterative algorithm which models the movements of birds or fish in a swarm. Namely, the movements of individuals within the population are influenced by both what the individual perceives as the best move, and what the swarm perceives as a best move. This components are then weighted and summed together to determine how the individual should move.

\(v_\text{new} = v_\text{old} + \Gamma_l \times r_l \times (p_\text{local best} - p_\text{old}) + \Gamma_g \times r_g \times (p_\text{global best} - p_\text{old})\)

\(p_\text{new} = p_\text{old} + v_\text{new}\)

where

\(v\) - particle velocity \(p\) - particle position (parameters/weights) \(r_l, r_g\) - uniform random numbers \(\Gamma_l, \Gamma_g\) - learning factor \(p_\text{global best}\) - best position ever found by any particle \(p_\text{local best}\) - best position ever found by this particle

\(f(x_i,z_i) = w_1d_{max}(z_i,x_i) + w_2(z_{max} - d_{min}(x_i))\)

zmax - max pixel val

Distance

\(d(a,b) = \sqrt{\sum_{k=1}^N (a_k - b_k)^2}\)