Machine Learning - Local Methods
Instance-based learning
The idea is to store the training examples \(D = \{(\vec{x^n}, \vec{t^n})\}\), \(\vec{x^n}\in\Re^{d_{in}}\), \(\vec{t^n}\in\Re^{d_{out}}\)
This leads to the nearest neighbor algorithm: Training consists of memorizing all examples and for an unknown input \(\vec{x}\), find the best match \(\vec{x}^n\) of the training samples. The output is \(\vec{t^n}\).
K-nearest neighbor
For unknown input \(\vec{x}\), find the set \(S\) of the \(k\) nearest neighbors of strores samples.
- for discrete valued output: vote among \(k\) nearest neighbors
- for real values output, use mean of \(k\) nearest neighbors \(\vec{y} = 1/k \sum_{i\in S}\vec{t^i}\)
Properties:
- plain nearet neighbor approach assigns stored output to complete Voronio tesselation cell around the sample input \(\rightarrow\) hard boundaries
- K-nearest neighbors allow continous transitions
- suitable choice for \(k\) depends on the local intrinsic data dimensionality
- Training:
- very fast
- require memory (roughly equivalent to # examples)
- does not waste information
- does not require parameter settings or complex procedures
- Application:
- may be slow (for many stored examples)
- sensitive to errors and noise
Distance-weighted k-nearest neighbors: The idea is that nearer neighbors are more important than far ones. Note that the chosen \(k\) is global and difficult as the intrinsic dimension is unknown and may change locally.
Improve k-nearest neighbors by weighting with the distance to the input \(\vec{x}\) (\(S\) is the set of the indices of the \(k\) nearest neighbors): \(\vec{y} = (1 / \sum_{i\in S}w_i) \sum_{i\in S} w_i \vec{t^i}\) with the inverse distance as weight \(w_i = 1 / \| \vec{x} - \vec{x^i}\|\). Precautions for “direct hit” \(\vec{x} = \vec{x^i}\) are necessary. Note now the neighbors can be entire set of examples!
Locally weighted regression: K-nearest neighbors approximate \(\vec{y}(\vec{x})\) locally for each sample point. The idea is to construct better local approximation of \(\vec{y}(\vec{x})\) by computing a fit function in the region surrounding the sample points.
There are two choices to make:
- Fit function, e.g., linear or quadratic
- Error function which will be minimized, e.g.m by gradient descent to get the best parameters of the fit function. The error function should be local.
Possible error functions:
- squared error over only the \(k\) nearest neighbors: \(E_1(\vec{x}^n) = 1/2 \sum_{\vec{x}\in \{k nearest neighbors of \vec{x}^n\}}(\vec{t^n} - \vec{y}(\vec{x^i}))^2\)
- error over the entire data set \(D\) where the error of each training sample \(\vec{x^i}\) is weighted by a decreasing function \(K\) of its distance to \(\vec{x^n}\)
- combine 1 and 2
Radial basis functions
RBFs provide a global approximation of a target function by a linear combination of local approximations. This method is related to distance weighted regression and neural networks. Like MLPs, RPFs represent a mapping \(\vec{x} \rightarrow \vec{y}, \vec{x} \in \Re^{d_{in}}, \vec{y}\in\Re^{d_{out}}\)
Architecture:
- single layer of units / neurons
- each neuron gets the same input
- activation of a neuron according to match between input and weights
- activation function is unimodal (usually a Gaussian), not sigmoid!
- activation function is usually called kernel function, since it defines an “area of responsibilit” in the input space
- neurons contribute to vector values output by their weights
- highly activated neurons contribute more
- thus the output function is represented by local functions with “compact support”
Output of the RBF network with \(N\) neurons: \(\vec{y} = \vec{w_0} + \sum_{i=1...N}\vec{w_i}K_i(\|\vec{x}-\vec{\xi_i}\|)\) with the kernel function \(K_i(\|\vec{x}-\vec{\xi_i}\|) = exp(-(1/2 \sigma_i^2)\|\vec{x}-\vec{\xi_i}\|^2)\).
Training:
- find suitable “centers” \(\vec{\xi_i}\in\Re^{d_{in}}\)
- find suitable “radii of incluence” \(\sigma_i\)
- find output weights \(\vec{w_i}\in\Re^{d_{out}}\) to form output
Several methods exist to solve these tasks:
Finding suitable input weights \(\vec{\xi_i}\):
- use examples (instances): \(\vec{\xi_i} = \vec{x^i}\)
- clustering on input part of examples
Finding the radii:
E.g., define the radius by distance to nearest neighbor, controlled by \(\gamma\): \(\sigma_i = \gamma * min_{k /neq i} \vert \vec{\xi_i} - \vec{\xi_k} \vert\)
Output weights \(\vec{w_i}\) like perceptrons: \(\Delta\vec{w_i} =\eta (\vec{t} - \vec{y})K_i(\|\vec{x}-\vec{\xi_i}\|)\)
Compare RBF and MLP:
- effect of an adaptation step:
- RBF: only input component acts locally on one / some basis functions \(\rightarrow\) affects only performance on data in this input area
- MLP: input-output pair may change all weights \(\rightarrow\) may affect performance on all data
- both have architectural parameters:
- RBF: one easy to interpret parameter (# basis functions)
- MLP: # layers, # neurons in each layer
- both have adaptation parameters:
- RBF:
- clustering parameters
- radii
- stepsize for supervised training
- MLP:
- stepsize
- various others such as momentum
- -> parameters of RBFs are decoupled and easy to interpret
- -> effect of MLP parameters is difficult to predict as they interact in a complex way during the minimization
- RBF:
Self-organizing maps
One of the big questions: Given signal data, how do we get an abstract, symbolic representation? Some important aspects of this task:
- concept formation: as long as we can not get that, at least find reasonable prototypes
- filtering relevant from irrelevant information
- finding structure, in particular, relations between concepts/prototypes
\(\rightarrow\) A highly useful tool would be a toppology preserving mapping from signals to a higher level. In 1982, Teuvo Kohonen can up with the Kohonen-net or also self-organizing map (SOM).
SOMs are made of a 2 dimensional layer of neurons.
- for the first time, we consider the spatial (physical) arrangement of neurons in a layer
- all neurons receive the same input \(\vec{x}\in\Re^d\)
- competition:
- the neuron at location \(\vec{s}\) with best matching weights \(\vec{w_{\vec{s}}}\) “wins”, i.e., has highest excitation \(y_{\vec{s}}\in\Re\)
- the best match neuron adapts its weights but lateral interaction causes neighboring neurons to adapt, too
Notation:
- input: \(\vec{x}\in\Re^d\)
- weights of neuron at grid location \(\vec{r}\): \(\vec{w_{\vec{r}}}\)
- excitation of neuron at grid location \(\vec{r}\): \(y_{\vec{r}}\)
- grid location of maximum excitation: \(\vec{s}\) determined by \(\vec{w_{\vec{s}}} * \vec{x} > \vec{w_{\vec{r}}} * \vec{x}\) for all \(\vec{r} \neq \vec{s}\)
Excitation over the layer caused by lateral interactions between the excitation center \(\vec{s}\) and surrounding locations \(\vec{r}\) is modeled by a unimodal function, usually the Gaussian: \(h_{\vec{r}\vec{s}} = exp(-\vert\vec{r}-\vec{s}\vert^2 / 2\sigma^2)\) Adaption rule(Kohonen’s rule): \(\Delta \vec{w_{\vec{r}}} = \eta * h_{\vec{r}\vec{s}} * (\vec{x} - \vec{w_{\vec{r}}})\)