gradient can be obtained by differentiating E from Equation (42), as

where xid denotes the single input component xi for training example d We now have an equation that gives in terms of the linear unit inputs xid, outputs Od, and target values td associated with the training examples Substituting Equation (46) into Equation (45) yields the weight update rule for gradient descent
To summarize, the gradient descent algorithm for training linear units is as follows: Pick an initial random weight vector Apply the linear unit to all training examples, then compute Awi for each weight according to Equation (47) Update each weight wi by adding Awi, then repeat this process This algorithm is given in Table 41 Because the error surface contains only a single global minimum, this algorithm will converge to a weight vector with minimum error, regardless of whether the training examples are linearly separable, given a sufficiently small learning rate q is used If r) is too large, the gradient descent search runs the risk of overstepping the minimum in the error surface rather than settling into it For this reason, one common modification to the algorithm is to gradually reduce the value of r) as the number of gradient descent steps grows
4433 STOCHASTIC APPROXIMATION TO GRADIENT DESCENT
Gradient descent is an important general paradigm for learning It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever (1) the hypothesis space contains continuously parameterized hypotheses (eg, the weights in a linear unit), and (2) the error can be differentiated with respect to these hypothesis parameters The key practical difficulties in applying gradient descent are (1) converging to a local minimum can sometimes be quite slow (ie, it can require many thousands of gradient descent steps), and (2) if there are multiple local minima in the error surface, then there is no guarantee that the procedure will find the global minimum
CHAF'l'ER 4 ARTIFICIAL NEURAL NETWORKS
Each training example is a pair of the form (2,t), where x' is the vector of input values, and
t is the target output value q is the learning rate (eg, 05)
Initialize each w, to some small random value Until the termination condition is met, Do 0 Initialize each Awi to zero 0 For each (2,t ) in trainingaxamples, Do w Input the instance x' to the unit and compute the output o For each linear unit weight w , , Do