# Deep learning

This post hasn't been updated for 7 years

**Deep learning**
Deep learning (also known as deep structural learning or hierarchical learning) is a set of algorithms in machine learning that attempt to learn in multiple levels, corresponding to different levels of abstraction. If we consider a simple case, where there might be two sets of neurons. One neurons set will receive an input signal and others will send an output signal. When the input layer receives an input signal, it passes the modified version of the input to the next layer. In a deep network, there are many layers between the input and output, allowing the algorithm to use multiple processing layers, composed of multiple linear and non-linear transformations.

During the past several years, the techniques developed from deep learning research have already been impacting a wide range of signal and information processing work within the traditional and the new, widened scopes including key aspects of machine learning and artificial intelligence.

*Three important reasons for the popularity of deep learning today are*

- The drastically increased chip processing abilities (e.g., general purpose graphical processing units or GPGPUs),
- The significantly lowered cost of computing hardware, and
- The recent advances in machine learning and signal/information processing research.

These advances have enabled the deep learning methods to effectively exploit complex, compositional nonlinear functions, to learn distributed and hierarchical feature representations, and to make effective use of both labeled and unlabeled data.

Deep learning typically uses artificial neural networks. The levels in these learned statistical models correspond to distinct levels of concepts, where higher level concepts are defined from lower level ones, and the same lower level concepts can help to define many higher level concepts.

*Deep learning has two key aspects:*

- models consisting of multiple layers or stages of nonlinear information processing and
- methods for supervised or unsupervised learning of feature representation at successively higher, more abstract layers. Deep learning is in the intersections among the research areas of neural networks, artificial intelligence, graphical modeling, optimization, pattern recognition, and signal processing.

Historically, the concept of deep learning originated from artificial neural network research. Feed-forward neural networks or MLPs with many hidden layers, which are often referred to as deep neural networks (DNNs), are good examples of the models with a deep architecture. Back-propagation (BP), popularized in 1980’s, has been a well-known algorithm for learning the parameters of these networks. Unfortunately back-propagation alone did not work well in practice.

Using hidden layers with many neurons in a DNN significantly improves the modeling power of the DNN and creates many closely optimal configurations. Even if parameter learning is trapped in to a local optimum, the resulting DNN can still perform quite well since the chance of having a poor local optimum is lower than when a small number of neurons are used in the network. Using deep and wide neural networks, however, would cast great demand to the computational power during the training process.

Most machine learning and signal processing techniques had exploited shallow structured architectures. These architectures typically contain at most one or two layers of nonlinear feature transformations. Examples of the shallow architectures are Gaussian mixture models (GMMs), Support vector machines (SVMs), logistic regression, kernel regression, multilayer perceptrons (MLPs) with a single hidden layer including extreme learning machines (ELMs). For instance, SVMs use a shallow linear pattern separation model with one or zero feature transformation layer. Shallow architectures have been shown effective in solving many simple or well constrained problems, but their limited modeling and representational power can cause difficulties when dealing with more complicated real world applications involving natural signals such as human speech, natural sound and language, and natural image and visual scenes.

Deep learning algorithms are contrasted with shallow learning algorithms by the number of parameterized transformations a signal encounters as it propagates from the input layer to the output layer. Here parameterized transformation is a processing unit the has trainable parameters, such as weights and thresholds.

For simplicity, we can think DNNs as decision-making black boxes. They take an array of numbers (that can represent pixels, audio waveforms, or words), run a series of functions on that array, and output one or more numbers as outputs. The outputs are usually a prediction of some properties we’re trying to guess from the input, for example whether or not an image is a picture of a cat.

The functions that are run inside the black box are controlled by the memory of the neural network, arrays of numbers known as weights that define how the inputs are combined and recombined to produce the results. Dealing with real-world problems like cat-detection requires very complex functions, which mean these arrays are very large, containing around 60 million numbers in the case of one of the recent computer vision networks. The biggest obstacle to using neural networks has been figuring out how to set all these massive arrays to values that will do a good job transforming the input signals into output predictions.

One of the theoretical properties of neural networks that has kept researchers working on them is that they should be teachable. It’s pretty simple to show on a small scale how you can supply a series of example inputs and expected outputs, and go through a mechanical process to take the weights from initial random values to progressively better numbers that produce more accurate predictions.

**How Deep learning neural network architectures differ from "normal" neural networks?**
Deep learning neural network architectures differ from "normal" neural networks because they have more hidden layers and they can be trained in an UNSUPERVISED or SUPERVISED manner for both UNSUPERVISED and SUPERVISED learning tasks. Moreover, people often talk about training a deep network in an unsupervised manner, before training the network in a supervised manner.

**Why so many layers in DNN?**
Deep learning works because of the architecture of the network AND the optimization routine applied to that architecture. The network is a directed graph, meaning that each hidden unit is connected to many other hidden units below it. So each hidden layer going further into the network is a NON-LINEAR combination of the layers below it, because of all the combining and recombining of the outputs from all the previous units in combination with their activation functions. When the OPTIMIZATION routine is applied to the network, each hidden layer then becomes an OPTIMALLY WEIGHTED, NON-LINEAR combination of the layer below it. When each sequential hidden layer has less units than the one below it, each hidden layer becomes a LOWER DIMENSIONAL PROJECTION of the layer below it as well. So the information from the layer below is nicely summarized by a NON-LINEAR, OPTIMALLY WEIGHTED, LOWER DIMENSIONAL PROJECTION in each subsequent layer of the deep network.

**Problems with deep neural networks**
If DNNs are naively trained, many issues can arise. Two common issues are overfitting and computation time.

DNNs are prone to overfitting because of the added layers of abstraction, which allow them to model rare dependencies in the training data. Overfitting Regularization methods can be applied during training to help combat overfitting. A more recent regularization method applied to DNNs is dropout regularization. In dropout, some number of units are randomly omitted from the hidden layers during training. This helps to break the rare dependencies that can occur in the training data.

There are many training parameters to be considered with a DNN, such as the size (number of layers and number of units per layer), the learning rate and initial weights. Sweeping through the parameter space for optimal parameters may not be feasible due to the cost in time and computational resources.

All Rights Reserved