The development of the perceptron was a big step towards the goal of creating useful connectionist networks capable of learning complex relations between inputs and outputs. In the late 1950’s, the connectionist community understood that what was needed for further development of connectionist models was a mathematically-derived (and thus potentially more flexible and powerful) rule for learning. By early 1960’s, the Delta Rule [also known as the Widrow & Hoff Learning rule or the Least Mean Square (LMS) rule] was invented by Widrow and Hoff. This rule is similar to the perceptron learning rule by McClelland & Rumelhart, 1988, but is also characterized by a mathematical utility and elegance missing in the perceptron and other early learning rules.
The Delta Rule uses the difference between target activation (i.e., target output values) and obtained activation to drive learning. For reasons discussed below, the use of a threshold activation function (as used in both the McCulloch-Pitts network and the perceptron) is dropped & instead a linear sum of products is used to calculate the activation of the output neuron (alternative activation functions can also be applied). Thus, the activation function is called a Linear Activation function, in which the output node’s activation is simply equal to the sum of the network’s respective input/weight products. The strength of network connections (i.e., the values of the weights) are adjusted to reduce the difference between target and actual output activation (i.e., error).
A set of data points are said to be linearly separable if the data can be divided into two classes using a straight line. If the data is not divided into two classes using a straight line, such data points are said to be called non-linearly separable data.
Although the perceptron rule finds a successful weight vector when the training examples are linearly separable, it can fail to converge if the examples are not linearly separable.
A second training rule, called the delta rule, is designed to overcome this difficulty.
If the training examples are not linearly separable, the delta rule converges toward a best-fit approximation to the target concept.
The key idea behind the delta rule is to use gradient descent to search the hypothesis space of possible weight vectors to find the weights that best fit the training examples.
This rule is important because gradient descent provides the basis for the BACKPROPAGATON algorithm, which can learn networks with many interconnected units.