All Articles

Introduction to Extreme Learning Machines

Published 13 May 2020 · 29 min read

What is ELM?

ELM (Extreme Learning Machines) are feedforward neural networks. “Invented” in 2006 by G. Huang.

As said in the original paper:

this algorithm tends to provide good generalization performance at extremely fast learning speed

Hence the phrase “Extreme” in ELM.

Why ELM is different from standard Neural Network

ELM doesn’t require gradient based backpropagation to work. It uses Moore-Penrose generalized inverse to set its weights.

First, we look at standard SLFN (Single hidden Layer Feedforward Neural network):

SLFN
Single hidden Layer Feedforward Neural network, Source: Shifei Ding under CC BY 3.0

It’s pretty straightforward:

  1. multiply inputs by weights
  2. add bias
  3. apply the activation function
  4. repeat steps 1-3 number of layers times
  5. calculate output
  6. backpropagate
  7. repeat everything

ELM removes step 4 (because it’s always SLFN), replaces step 6 with matrix inverse, and does it only once, so step 7 goes away as well.

More details

Before going into details we need to look how ELM output is calculated:

fL(x)=i=1Lβigi(x)=i=1Lβig(wixj+bi),j=1,...,Nf_L(x) = \sum_{i=1}^{L}\beta_ig_i(x) = \sum_{i=1}^{L}\beta_ig(w_i * x_j + b_i), j = 1,...,N

Where:

  • L is a numbe of hidden units
  • N is a number of training samples
  • βi\beta_i is weight vector between iith hidden layer and output
  • w is a weight vector between input and hidden layer
  • g is an activation function
  • b is a vias vector
  • x in an input vector

It is quite similar to what’s going one in standard NN with backpropagation but if you look closely you can see that we’re naming weight between hidden layer and output as β\beta. This β\beta matrix is a special matrix because that is our pseudo-inverse. We can shorten the equation and write it as:

T=HβT = H\beta

Where:

H=[g(w1x1+b1)...g(wLx1+bL)...g(w1xN+b1)...g(wLxN+bL)]N×LH = \begin{bmatrix} g(w_1 * x_1 + b_1) & ... & g(w_L*x_1+b_L) \\ \vdots & ... & \vdots \\ g(w_1 * x_N + b_1) & ... & g(w_L * x_N + b_L) \end{bmatrix}_{N \times L} β=[β1TβLT]L×mT=[t1TtNT]N×m\beta = \begin{bmatrix} \beta_1^T \\ \vdots \\ \beta_L^T \end{bmatrix}_{L \times m} T = \begin{bmatrix} t_1^T \\ \vdots \\ t_N^T \end{bmatrix}_{N \times m}

Where:

  • m is a number of outputs
  • H is called Hidden Layer Output Matrix
  • T is a training data target matrix

The theory behind the learning (You can skip this section if you want)

Now we have to dig dipper into theories behind the network to decide what to do next.

First theorem:

Given a standard SLFN with N hidden nodes and activation function

g:RRg:R \rightarrow R which is infinitely differentiable in any interval, for N arbitrary distinct samples (xi,ti)(x_i,t_i), where xiRnx_i \in R^n and tiRmt_i \in R^m, for any wiw_i and bib_i randomly chosen from any intervals of RnR^n and RR, respectively, according to any continuous probability distribution, then with probability one, the hidden layer output matrix HH of the SLFN is invertible and

HβT=0|| H\beta - T || = 0

Function is infinitely differentiable if it’s a smooth function

Second theorem:

Given any small positive value ϵ>0\epsilon > 0 and activation function g:RRg:R \rightarrow R which is infinitely differentiable in any interval, there exists LNL \leq N such that for NN arbitrary distinct samples (xi,ti)(x_i, t_i), where xiRnx_i \in R^n and tiRmt_i \in R^m, for any wiw_i and bib_i randomly chosen from any intervals of RnR^n and RR, respectively, according to any continuous probability distribution, then with probability one,

HN×LβL×mTN×m<ϵ|| H_{N \times L}\beta_{L \times m} - T_{N \times m} || < \epsilon

Third theorem:

(Serre, Rao, and Mitra in Generalized Inverse of Matrices and its Applications). A matrix G of order n×mn \times m is a Moore-Penrose generalized inverse of matrix A of order m×nm \times n, if:

AGA=A,GAG=G,(AG)T=AG,(GA)T=GAAGA = A, GAG = G, (AG)^T = AG, (GA)^T = GA

I’m not going to prove those theorems but if you’re interested please refer Page 3, ELM-NC-2006 for further explanation.

Now what we have to do is to define our cost function. Bassing our assumptions on Capabilities of a four-layered feedforward neural network: four layers versus three we can see that SLFN is a lineary system if the input weights and the hidden layer biases can be chosen randomly.

Because our ELM is a linear system then we can create optimization objective:

Hβ^T=minβHβT|| H\hat\beta - T || = \min_{\beta}|| H\beta - T ||

To aproximate the solution we need to use Rao’s and Mitra’s work again:

p. 51 of Rao and Mitra Generalized Inverse of Matrices and its Applications

Let there exist a matrix GG such that GyGy is a minimum norm least-squares solution of a linear system Ax=yAx = y. Then it is necessary and sufficient that G=AG = A^{\dagger}, the Moore–Penrose generalized inverse of matrix AA.

Now we can figure out that because H is invertable we can calculate β^\hat\beta as:

β^=HT\hat\beta = H^{\dagger}T

Learning algorithm

After going through some difficult math we can define learning algorith now. Algorithm itself is relatively easy:

  1. Randomly assign weight wiw_i and bias bib_i, i=1,...Li = 1,...L
  2. Calculate hidden layer output H
  3. Calculate output weight matrix β^=HT\hat\beta = H^\dagger T
  4. Use β^\hat\beta to make a prediction on new data T=Hβ^T = H\hat\beta

If you’re interested in seening python implementation please check this repository:

https://github.com/burnpiro/elm-pure

And here is a preview of how the model works on MNIST dataset:

https://github.com/burnpiro/elm-pure/blob/master/ELM%20example.ipynb

As you can see, simple version of ELM achieves >91% accuracy on MNIST dataset and it takes around 3s to train the network on intel i7 7820X CPU.

Performance comparison

I’m going to use metrics from the original paper in this section and it might surprise you how long some training is done in compare with previous MNIST example, but remember that original paper was published in 2006 and networks were trained on Pentium 4 1.9GHz CPU.

Datasets

Problems samplesTraining samplesTestingAttributesClassesSatellite image44002000367Image segmentation1500810187Shuttle435001450097Banana520049000022\small \begin{array} {|r|r|}\hline \text{Problems samples} & \text{Training samples} & \text{Testing} & \text{Attributes} & \text{Classes} \\ \hline \text{Satellite image} & 4400 & 2000 & 36 & 7 \\ \hline \text{Image segmentation} & 1500 & 810 & 18 & 7 \\ \hline \text{Shuttle} & 43500 & 14500 & 9 & 7 \\ \hline \text{Banana} & 5200 & 490000 & 2 & 2 \\ \hline \end{array}

Results

ProblemsAlgorithmsTraining sTesting sAcc TrainAcc TestNodesSatellite imageELM14.920.3493.5289.04500BP125610.0895.2682.34100Image segmentELM1.400.0797.3595.01200BP4745.70.0496.9286.27100ShuttleELM5.7400.2399.6599.4050BP6132.20.2299.7799.2750BananaELM2.1920.0692.3691.57100BP6132.221.1090.2688.09100\footnotesize \begin{array} {|r|r|}\hline \text{Problems} & \text{Algorithms} & \text{Training s} & \text{Testing s} & \text{Acc Train} & \text{Acc Test} & \text{Nodes} \\ \hline \text{Satellite image} & \text{ELM} & 14.92 & 0.34 & 93.52 & 89.04 & 500 \\ & \text{BP} & 12561 & 0.08 & 95.26 & 82.34 & 100 \\ \hline & & & & & & & & \\ \hline \text{Image segment} & \text{ELM} & 1.40 & 0.07 & 97.35 & 95.01 & 200 \\ & \text{BP} & 4745.7 & 0.04 & 96.92 & 86.27 & 100 \\ \hline & & & & & & & & \\ \hline \text{Shuttle} & ELM & 5.740 & 0.23 & 99.65 & 99.40 & 50 \\ & BP & 6132.2 & 0.22 & 99.77 & 99.27 & 50 \\ \hline & & & & & & & & \\ \hline\text{Banana} & ELM & 2.19 & 20.06 & 92.36 & 91.57 & 100 \\ & BP & 6132.2 & 21.10 & 90.26 & 88.09 & 100 \\ \hline \end{array}

We can ignore training time for now because it’s obvous that gradient descent takes longer than matrix invert. The most important information form this result table is Accuracy and Nodes. In the first two datasets, you can see that author used different size of BP to achieve the same results as ELM. Size of BP network in first case was 5x smaller and 2x smaller in second case. That affects testing times (it’s faster to run 100 nodes NN than 500 nodes NN). That tells us how accurate is our method in aproximating dataset.

It is hard to find any tests of ELM networks on popular datasets but i’ve manage to do so. Here is a benchmark on CIFAR-10 and MNIST

DatasetAlgorithmsAcc TestCIFAR-10ELM 1000 (1x)10.64ELM 3000 (20x)71.40ELM 3500 (30x)87.55ReNet (2015)87.65EfficientNet (2019)98.90MNISTELM 51292.15DELM 1500099.43RNN99.55BP 6-layer 570099.65\begin{array} {|rr|} \hline \text{Dataset} & \text{Algorithms} & \text{Acc Test} \\ \hline \text{CIFAR-10} & \text{ELM 1000 (1x)} & 10.64 \\ & \text{ELM 3000 (20x)} & 71.40 \\ & \text{ELM 3500 (30x)} & 87.55 \\ & \text{ReNet (2015)} & 87.65 \\ & \text{EfficientNet (2019)} & 98.90 \\ \hline & & & & & & & & \\ \hline \text{MNIST} & \text{ELM 512} & 92.15 \\ & \text{DELM 15000} & 99.43 \\ & \text{RNN} & 99.55 \\ & \text{BP 6-layer 5700} & 99.65 \\ \hline \end{array}

Where:

  • DELM is a deep ELM
  • ReNet is described in this paper
  • RNN is a Recurrent Neural Network
  • EfficientNet is described in this paper

I didn’t find training times for ELMs so there was no way to compare them with results from other networks but all those multiplaiers (20x, 30x) are relative differences in training time based on training of ELM 1000 on CIFAR-10. If there is a 30x time increase between ELM 1000 and ELM 3500 then you can imagine how long it would take to train DELM which has 15000 neurons.

Conclussion

ELMs are not as accurate as traditional neural networks, but they can be used when dealing with problems which require real-time retraining of the network. I’m going to write another article describing evolution and usage of ELMs soon. For now, it’s up to you to create an opinion about those networks.

There was a lot of controvesies behind ELMs and I’m not the best person to juge. I’m just going to forward you to wikipedia page with the description.

References:

  • Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong Siew. Extreme learning machine: Theory and applications, 2006 https://www.ntu.edu.sg/home/egbhuang/pdf/ELM-NC-2006.pdf
  • C.R. Rao, S.K. Mitra, Generalized Inverse of Matrices and its Applications, Wiley, New York, 1971.
  • D. Serre, Matrices: Theory and Applications, Springer, New York, 2002.
  • S. Tamura, M. Tateishi, Capabilities of a four-layered feedforward neural network: four layers versus three, IEEE Trans. Neural Networks 8 (2) (1997) 251–255.