Data Science at Home
History and applications of Deep Learning: A New Podcast Episode
What is deep learning?
If you have no patience, deep learning is the result of training many layers of non-linear processing units for feature extraction and data transformation e.g. from pixel, to edges, to shapes, to object classification, to scene description, captioning, etc.
History
As old as the 80s! Then why this approach has been abandoned for a while?
The answer is in the lack of big training data and computing power in the early days.
However, five major events occurred in the past and all of them contributed to define and make what we today call deep learning possible.
Fukushima’s Neocognitron introduced convolutional neural networks partially trained by unsupervised learning with human-directed features in the neural plane.
Backpropagation Yann LeCun et al. (1989) applied supervised backpropagation to such architectures. Weng et al. (1992) published convolutional neural networks Cresceptron for 3-D object recognition from images of cluttered scenes and segmentation of such objects from images.
Max-pooling (1992) appeared to be first proposed by Cresceptron to enable the network to tolerate small-to-large deformation in a hierarchical way, while using convolution. Max-pooling helps, but does not guarantee, shift-invariance at the pixel level.
People tried to train deep networks and they mostly failed. Why?
Sepp Hochreiter‘s diploma thesis of 1991 formally identified the reason for this failure as the vanishing gradient problem, which affects many-layered feedforward networks and recurrent neural networks.
What are vanishing gradients?
Recurrent networks are trained by unfolding them into very deep feed forward networks, where a new layer is created for each time step of an input sequence processed by the network. As errors propagate from layer to layer, they shrink exponentially with the number of layers, impeding the tuning of neuron weights which is based on those errors (LSTMs were proposed as a solution in 1997)
Pre-training (Geoffrey Hinton)
Other methods use unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by supervised back-propagation to classify labeled data. The deep model of Hinton et al. (2006) involves learning the distribution of a high-level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine (Smolensky, 1986) to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly.
Looks like a tongue-twister, right? Well, it basically says that if trained well, a network can generate data that are similar to the ones that were fed from the training set. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data ...