What is ELM?
ELM (Extreme Learning Machines) are feedforward neural networks. “Invented” in 2006 by G. Huang.
As said in the original paper:
this algorithm tends to provide good generalization performance at extremely fast learning speed
Hence the phrase “Extreme” in ELM.
Why ELM is different from standard Neural Network
ELM doesn’t require gradient based backpropagation to work. It uses Moore-Penrose generalized inverse to set its weights.
First, we look at standard SLFN (Single hidden Layer Feedforward Neural network):
It’s pretty straightforward:
- multiply inputs by weights
- add bias
- apply the activation function
- repeat steps 1-3 number of layers times
- calculate output
- repeat everything
ELM removes step 4 (because it’s always SLFN), replaces step 6 with matrix inverse, and does it only once, so step 7 goes away as well.
Before going into details we need to look how ELM output is calculated:
- L is a numbe of hidden units
- N is a number of training samples
- is weight vector between th hidden layer and output
- w is a weight vector between input and hidden layer
- g is an activation function
- b is a vias vector
- x in an input vector
It is quite similar to what’s going one in standard NN with backpropagation but if you look closely you can see that we’re naming weight between hidden layer and output as . This matrix is a special matrix because that is our pseudo-inverse. We can shorten the equation and write it as:
- m is a number of outputs
- H is called Hidden Layer Output Matrix
- T is a training data target matrix
The theory behind the learning (You can skip this section if you want)
Now we have to dig dipper into theories behind the network to decide what to do next.
Given a standard SLFN with N hidden nodes and activation function
which is infinitely differentiable in any interval, for N arbitrary distinct samples , where and , for any and randomly chosen from any intervals of and , respectively, according to any continuous probability distribution, then with probability one, the hidden layer output matrix of the SLFN is invertible and
Function is infinitely differentiable if it’s a smooth function
Given any small positive value and activation function which is infinitely differentiable in any interval, there exists such that for arbitrary distinct samples , where and , for any and randomly chosen from any intervals of and , respectively, according to any continuous probability distribution, then with probability one,
(Serre, Rao, and Mitra in Generalized Inverse of Matrices and its Applications). A matrix G of order is a Moore-Penrose generalized inverse of matrix A of order , if:
I’m not going to prove those theorems but if you’re interested please refer Page 3, ELM-NC-2006 for further explanation.
Now what we have to do is to define our cost function. Bassing our assumptions on Capabilities of a four-layered feedforward neural network: four layers versus three we can see that SLFN is a lineary system if the input weights and the hidden layer biases can be chosen randomly.
Because our ELM is a linear system then we can create optimization objective:
To aproximate the solution we need to use Rao’s and Mitra’s work again:
p. 51 of Rao and Mitra Generalized Inverse of Matrices and its Applications
Let there exist a matrix such that is a minimum norm least-squares solution of a linear system . Then it is necessary and sufficient that , the Moore–Penrose generalized inverse of matrix .
Now we can figure out that because H is invertable we can calculate as:
After going through some difficult math we can define learning algorith now. Algorithm itself is relatively easy:
- Randomly assign weight and bias ,
- Calculate hidden layer output H
- Calculate output weight matrix
- Use to make a prediction on new data
If you’re interested in seening python implementation please check this repository:
And here is a preview of how the model works on MNIST dataset:
As you can see, simple version of ELM achieves >91% accuracy on MNIST dataset and it takes around 3s to train the network on intel i7 7820X CPU.
I’m going to use metrics from the original paper in this section and it might surprise you how long some training is done in compare with previous MNIST example, but remember that original paper was published in 2006 and networks were trained on Pentium 4 1.9GHz CPU.
We can ignore training time for now because it’s obvous that gradient descent takes longer than matrix invert. The most important information form this result table is Accuracy and Nodes. In the first two datasets, you can see that author used different size of BP to achieve the same results as ELM. Size of BP network in first case was 5x smaller and 2x smaller in second case. That affects testing times (it’s faster to run 100 nodes NN than 500 nodes NN). That tells us how accurate is our method in aproximating dataset.
It is hard to find any tests of ELM networks on popular datasets but i’ve manage to do so. Here is a benchmark on CIFAR-10 and MNIST
- DELM is a deep ELM
- ReNet is described in this paper
- RNN is a Recurrent Neural Network
- EfficientNet is described in this paper
I didn’t find training times for ELMs so there was no way to compare them with results from other networks but all those multiplaiers (20x, 30x) are relative differences in training time based on training of ELM 1000 on CIFAR-10. If there is a 30x time increase between ELM 1000 and ELM 3500 then you can imagine how long it would take to train DELM which has 15000 neurons.
ELMs are not as accurate as traditional neural networks, but they can be used when dealing with problems which require real-time retraining of the network. I’m going to write another article describing evolution and usage of ELMs soon. For now, it’s up to you to create an opinion about those networks.
There was a lot of controvesies behind ELMs and I’m not the best person to juge. I’m just going to forward you to wikipedia page with the description.
- Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong Siew. Extreme learning machine: Theory and applications, 2006 https://www.ntu.edu.sg/home/egbhuang/pdf/ELM-NC-2006.pdf
- C.R. Rao, S.K. Mitra, Generalized Inverse of Matrices and its Applications, Wiley, New York, 1971.
- D. Serre, Matrices: Theory and Applications, Springer, New York, 2002.
- S. Tamura, M. Tateishi, Capabilities of a four-layered feedforward neural network: four layers versus three, IEEE Trans. Neural Networks 8 (2) (1997) 251–255.