Tricks of training neural nets

How to train neural networks has been a huge challenge to researchers since the very first day of deep learning. Unlike most classic machine learning algorithms, neural networks, especially deep ones, require more data and more parameters to tune, which make them more versatile, powerful and scalable to express the distribution of real-world data, but also create huge difficulty for training.

Neural networks, or artificial neural networks(ANNs), are said to be inspired by the biological neurons. It was first introduced in 1943 (surprisingly early, uh?), but progress was slow. The training technique remained challenging until the invention of back propagation in 1986. In short, it is still gradient descent as in many classic machine learning algorithms, but back propagation combines it with chain rule. Back propagation is an algorithm that is still widely used today, and it is with no doubt one of the biggest breakthroughs in the history of deep learning. With this algorithm, for the first time people began to believe that it is feasible to train a neural network. However, due to the complex hypothesis space, training a neural net still requires much experiences and some fancy tricks, and this is why I am writing this blog post today.

Most part of this article comes from the wonderful book, Hands on Machine Learning with Scikit-learn and Tensorflow, 2nd edition, many thanks to the authors.

## Basics

### Tools for optimization

Grid search and randomized search have been universal solutions to fine-tuning the hyperparamters for most machine learning algorithms, and luckily, they can be applied to deep learning too. If you are using Keras or tf.Keras, you are super lucky because you can wrap up your Keras models with keras.wrappers.scikit_learn, so that you can use GridSearchCV and RandomizedSearchCV in scikit-learn.

You can also use some Python libraries to optimize hyperparameters, for example, Hyperopt, hyperas, Scikit-Optimize. Also, companies like Google are offering hyperparameter optimization services now, such as hyperparameter tuning services provided by Google Cloud ML Engine. Moreover, nowadays, AutoML has been a heated topic, and there have been such kind of services available: maybe the handmade training era will end soon?

It is possible, and very promising. But we have to admit there is still a long way to go, and that is why you still need to learn about training tricks today.

### Tips for tuning hyperparamters

We all know deeper networks can model more complex functions, but they are harder to train. So when we are designing our models, it can be tough to decide the number of hidden layers. For this, the best (and maybe easiest) practice could be gradually ramp up the numbers until starting to overfitting.

Another hyperparameter you have to consider is the number of neurons per layer. People found that using the same number of neurons in all hidden layers performs good enough. It is good because in this way we only have one hyperparameter to tune instead of one per layer. Just like for the number of layers, we can try increasing the number of neurons gradually until the network starts overfitting. But experience tells us it is hard to implement. A simpler approach is to pick a model with more layers and neurons than we actually need, and then use early stopping and other regularization techniques such as dropout to prevent our nets from overfitting.

Other hyperparameters includes the learning rate, batch size and so on. As for the learning rate, in general, the optimal learning rate is about half of the maximum learning rate. So we can start with a large learning rate that will make our training algorithm diverge, and then divide the value by 3 and try again, and repeat until the training algorithm stops diverging. The optimal batch size is usually lower than 32. We know larger batch size can lead to more precise estimate of the gradients, but since we normally will not use the vanilla gradient descent, the precision of gradients are not that important than for other machine learning algorithms. On the other hand, however, we know, having a larger batch means higher degree of parallelism and therefore, faster training. So normally the batch size will not exceed 32, but will also not be too small. Note that if you use batchNorm, the size should generally larger than 20. Further, we don’t really need to tweak the number of training iterations, because early stopping is preferred.

### Activation functions

Gradient vanishing is a long-existing problem when training deep neural nets. It refers to the phenomenon that the gradients get smaller and smaller as back propagation progresses down to the lower layers. When gradient vanishing happens, the weights of lower layers remains hardly changed because their gradients are neglectable.

In 2010, a paper written by Xavier Glorot and Yoshua Bengio(one of the Turing Award receiver in 2018) dived deep in this problem, and found a few suspects. One of these suspects is the sigmoid activation function and the normal distribution initialization scheme. They found that under this setting, the variance of the outputs is much greater than the inputs and will keeps increasing until the sigmoid function saturates, i.e. gets close to the value 0 or 1. When it saturates, the gradients are close to zero so there is virtually no gradient to propagate back through the network, then gradient vanishing happens.

Glorot and Bengio argue that to solve this problem, we need the variance of the outputs of outputs of each layer to be equal to the variance of its input, we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction. They further proposed a novel initialization method, called Xavier initialization or Glorot initialization. The pseudo-code is as below:

for-each input-hidden weight
variance = 2.0 / (fan-in +fan-out)
stddev = sqrt(variance)
weight = gaussian(mean=0.0, stddev)
end-for

As you can see, it is just a normal distribution with mean 0 and variance $\sigma^2 = \dfrac{2}{\text{fan}_{in} + \text{fan}_{out}}$There is also a uniform distribution version: the range of this uniform distribution is from $-\sqrt{\dfrac{6}{(\text{fan}_{in} + \text{fan}_{out})}}$ to $\sqrt{\dfrac{6}{(\text{fan}_{in} + \text{fan}_{out})}}$. Glorot initialization proves good for activation functions like linear(i.e. no activation), tanh, sigmoid and softmax. There are also some variants, whose difference is only the scale of the variance and whether they use $\text{fan}_{in}$ or $\text{fan}_{avg} = \dfrac{\text{fan}_{in} + \text{fan}_{out}}{2}$. For example, He initialization is suitable for ReLU and its variants like ELU, and its variance is $\dfrac{2}{\text{fan}_{in}}$. LeCun initialization differs from Glorot only in that it uses $\text{fan}_{in}$ instead of $\text{fan}_{avg}$, and it is for SELU. All these methods have uniform version, you can get the range $r$ by the equation: $r =\sqrt{3\sigma^2}$.

Keras uses Glorot initialization with a uniform distribution by default.

Because of Glorot and Bengio, people started to realize that ReLU is a better activation function than sigmoid in deep learning, for the positive values will not saturate and it is quite fast to compute. But ReLU also has its own drawbacks. The problem is called dying ReLUs: during training, some neurons stop output anything but 0, so they are effectively dead. To solve this, people came up with some variants of ReLU, for example, the leaky ReLU = $\max(\alpha z,z)$ and the randomized leaky ReLU(RReLU). The parametric leaky ReLU treats $\alpha$ as a learnable parameter. Experiments show it strongly outperforms ReLU on large image datasets but tends to overfit on smaller datasets.

In 2015, a new activation function called exponential linear unit(ELU) was proposed. It is reported to outperform all ReLU variants both in training time and the results of the test set. The only difference between ELU and leaky ReLU is that its negative part is $\alpha(\exp(z) - 1)$ instead of $\alpha z$. Notice that the exponential function makes it slower at test time. SELU is just a scaled version of ELU.

From practice, we say in general SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > sigmoid.

### Optimizers

Since we are talking about tricks here, the algorithm details of each optimizer will not be emphasized. For more details, refer to the book I mentioned above, it has a wonderful introduction to these optimizers.

• When using momentum, the momentum value of 0.9 usually works well.
• Nesterov Accelerated Gradient(NAG) is a small variant of momentum, and is always faster than vanilla Momentum optimization.
• RMSProp fixes the problem of Adagrad by accumulating only the gradients from the most recent iterations, the decay rate is typically set to 0.9 and it almost works well for all cases so you don’t have to tune it. Except on very simple problems, it almost always beats Adagrad.
• Adam combines the idea of momentum and RMSProp. The momentum decay hyperparameter is typically set to 0.9, while the scaling decay hyperparameter is often initialized to 0.999. Since Adam is an adaptive learning rate algorithm, it requires less tuning of the learning rate.

### Batch normalization

In a 2015 paper, Sergey Ioffe and Christian Szegedy proposed a technique called Batch Normalization (BN) to address the vanishing/exploding gradients problems.

The technique consists of adding an operation in the model just before or after the activation function of each hidden layer, simply zero-centering and normalizing each input, then scaling and shifting the result using two new parameter vectors per layer: one for scaling, the other for shifting. In other words, this operation lets the model
learn the optimal scale and mean of each of the layer’s inputs
.

This method is thoroughly discussed all over the internet, and the steps are quite clear and simple, so I will not talk about the algorithm details here.

The authors claimed that this technique considerably improved all the DNNs they experimented with. The vanishing gradients problem was strongly reduced and the networks become less sensitive to the weight initialization. They were able to use large learning rates to speed the training process.

Though it achieved very good performance, BatchNorm is tricky to use in RNNs. One method to deal with gradient exploding is gradient clipping.

In Keras, implementing gradient clipping can be achieved by just setting the clipvalue or clipnorm argument when creating an optimizer:

optimizer = keras.optimizer.SGD(clipvalue=1.0)
model.compile(lose="mse", optimizer=optimzer)


With the above code, every component of the gradient vector will be clipped to a value between -1.0 and 1.0. In practice it works quite good, but you may argue that by doing this, the direction of the gradient vector will be changed drastically under this setting. For example, if the gradient vector is [0.9, 100], after clipping you will get [0.9, 1.0]. To avoid this, you may try the argument clipnorm, this will clip the whole gradient if its L2 norm is greater than the threshold. Like we said, simply performing clipping by value is good enough in most cases. In practice you may try both and see the results on the validation set to decide.

## Others

The book actually covers much more than what I mentioned above, for example, reusing pretrained layers, dropout and MC dropout. You should visit chapter 10 and chapter 11 if you need more guidelines.