Adaptive initial step size selection for Simultaneous Perturbation Stochastic Approximation

In this section, we will compare the original SPSA and our modified SPSA as described
in Algorithm 2 using 10 analytical test functions and a parameter estimation example
of a nonlinear dynamic system.

Test functions

To see the effect of the new adaptive stepping algorithm in SPSA, the minimum points
of ten different mathematical test functions were sought. Except for Griewank function,
the following conditions were applied. The functions’ responses were minimized from
arbitrary starting points (D-dimensional product space with lower bound -2 and upper bound 2). If exceeded in any of its D dimensions, that parameter was replaced by -10 if it was less than -10 or was replaced
by 10 if it was larger than 10. For Griewank function, it was randomly started from
. If exceeded in any of its D dimensions, that parameter was replaced by ?600 if it was less than ?600 or was replaced
by 600 if it was larger than 600. For all ten functions, the iteration was stopped
when 2000 evaluations of the objective function were reached. For convenience, we
will label our proposed algorithm as “A_SPSA” and the standard SPSA as “SPSA”.

The optimizations for each of the ten objective functions were started from 20 different
starting points. After the 2000 iterations, the distributions of objective values
were plotted with respect to . Eleven different values of between and (up to for Griewank) were used to make the plot. The dimensions of the functions were set
to be 20, i.e. .

The definitions of the ten functions are given in the following. The Rosenbrock function
is described as

(7)

The Sphere function is described as

(8)

The Schwefel function is described as

(9)

The Rastrigin function is described as

(10)

The Skewed Quartic function Spall (2003], ex. 6.6) is described as

(11)

where the matrix in the Skewed Quartic function is a square matrix with upper triangular elements
set to 1 and the lower triangular elements set to zero. The Griewank function is described
as

(12)

The Ackley function is described as

(13)

The Manevich function is described as

(14)

The Ellipsoid function is described as

(15)

The Rotated Ellipsoid function is described as

(16)

Each of Figs. 2, 3, 4, 5, 6, 7, 8, 9, 10 and 11 show three different cases of noisy measurements of the outputs. The subfigures (a)
have no noise added, subfigures (b) and (c) have Gaussian noise added to the true
output with standard deviation of 0.1 and 1.0 respectively. In all the three noise levels of the ten functions,
was used.

A general trend observed from the figures is that when the initial step size is large,
the original SPSA tends to diverge to big objective values. The SPSA with the proposed
initial step size reduction, on the other hand, effectively mitigates this divergence
problem producing smaller objective values in general as the (a priori) initial step
size is increased. This is because if the two function evaluations in the iteration
are not smaller than the starting point value , the algorithm will reduce the step size (by halving a) and restart at , which is the point that gave the smallest output in the history of iterations. However,
note that the iteration index k in and is not reinitialized. For the ten functions tested, A_SPSA achieved its best performance
when was close to 10 or 100 for Griewank function. This indicates that one can simply
set the minimum perturbation close to the magnitude of the difference between upper and lower bound of the parameter
in consideration. This may not be a guarantee for the best results but doing so does
not cause the optimization to diverge to large responses and the results achieved
are not substantially worse than the cases with best settings for a.

As mentioned earlier, the value for c is important when the measurements of y contain noise. Figure 12 shows how the choice of c affects the outcome of optimizations. The figure shows the case of the 20 dimensional
Sphere Function with Gaussian noise having standard deviation . Among the three values of c, namely 0.01, 0.1 and 1.0, gave the best results for A_SPSA. At , however, A_SPSA showed little improvement in the objective value regardless of magnitude. This is caused by a becoming prematurely too small in the divergent early iterations. On the other hand,
the standard SPSA showed a good reduction at , and . at both and 1.0. This implies that for A_SPSA, a range of values of good c can be narrower than that of the standard SPSA. On the other hand, the choice of
(and therefore a) is much easier for A_SPSA. We can, for example, let (U ? L), where min(U ? L) is the minimum difference between upper and lower bounds of the domain of parameter
vector . In practice, it is better to scale all the input dimensions to fall in similar or
equal intervals.

Figure 13 shows the results of optimizing the Rosenbrock and Rastrigin functions using three
different values of multiplication factor of a: 0.1, 0.5, and 0.9. The difference in multiplication factor does not change the general
trend that larger produces better results and that divergence does not occur. One could tune the value
of the multiplication factor, but the default value of 0.5 that we showed in the Algorithm
1 generally produces satisfactory results compared to other values of multiplication
factors between 0 and 1. The Fig. 13 (b) also shows that may not be an optimal setting since smaller value is shown to produce better optimization results when the reduction rate is slow at
0.9. This implies that in a bumpy (highly multimodal) function like Rastrigin, the
slow decrease in a can adversely affect the minimization of the objective value by a large number of
resets to . The opposite is true with Rosenbrock function in (a), in which the slow reduction
factor 0.9 gave the best result at .

For all the mathematical functions tested in this paper, optimization using SPSA diverges
almost surely if the is large. However, A_SPSA and SPSA give closely matching results when the initial
step sizes are relatively small (i.e., the left hand side of the plots in Figs. 2, 3, 4, 5, 6, 7, 8, 9, 10 and 11). This is because, in cases that divergence does not happen, the adaptation of a does not take place in A_SPSA and therefore SPSA and A_SPSA have identical behavior.
This is a confirmation that Algorithm 1 does not alter, in any significant way, the
finite sample convergence characteristics of the original SPSA when the divergence
does not manifest.

Fig. 2. Initial parameter change and distribution of responses after 2000 function evaluations for “Rosenbrock”. a No noise, b ? = 0.10, c ? = 1.0

Fig. 3. Initial parameter change and distribution of responses after 2000 function evaluations for “Sphere”. a No noise, b ? = 0.10, c ? = 1.0

Fig. 4. Initial parameter change and distribution of responses after 2000 function evaluations for “Schwefel”. a No noise, b ? = 0.10, c ? = 1.0

Fig. 5. Initial parameter change and distribution of responses after 2000 function evaluations for “Rastrigin”. a No noise, b ? = 0.10, c ? = 1.0

Fig. 6. Initial parameter change and distribution of responses after 2000 function evaluations for “Skewed Quartic”.
a No noise, b ? = 0.10, c ? = 1.0

Fig. 7. Initial parameter change and distribution of responses after 2000 function evaluations for “Griewank”. a No noise, b ? = 0.10, c ? = 1.0

Fig. 8. Initial parameter change and distribution of responses after 2000 function evaluations for “Ackley”. a No noise, b ? = 0.10, c ? = 1.0

Fig. 9. Initial parameter change and distribution of responses after 2000 function evaluations for “Manevich”. a No noise, b ? = 0.10, c ? = 1.0

Fig. 10. Initial parameter change and distribution of responses after 2000 function evaluations for “Ellipsoid”. a No noise, b ? = 0.10, c ? = 1.0

Fig. 11. Initial parameter change and distribution of responses after 2000 function evaluations for “Rotated Ellipsoid”.
a No noise, b ? = 0.10, c ? = 1.0

Fig. 12. Effect of choice of c to the final response of “Sphere” with Gaussian noise of after 2000 function evaluations. ac = 0.01, bc = 0.10, cc = 1.00

Fig. 13. Effect of choice of the reduction factor of a to the responses after 2000 function evaluations. a Rosenbrock (no noise), b Rastrigin (no noise)

Nonlinear dynamics example

We consider a parameter estimation problem with Lorenz attractor. Its nonlinear dynamics
is described as

(17)

(18)

(19)

We seek to identify the system parameters by minimizing the one-time-step-ahead prediction error of the state given the current state . We use fourth-order Runge–Kutta method to obtain .

Let us denote as one-time-step-ahead prediction given by the estimated system with parameters . Then, we can define the prediction error as

(20)

Thus, the optimization to be solved is

(21)

The index k above is the same as the index k in the SPSA algorithms. So the SPSA iteration proceeds along with the time steps
of the dynamic system to compute .

We set the true parameters to be and pretend to not to know them. We set the time increment to be and simulate from to 20, obtaining target state with . We let and at each value of we run both A_SPSA and SPSA 20 times.

For this problem, we set the parameter space as three-dimensional product space . The initial state is . The initial guess (starting point) of the parameter set is a random pick from .

Figure 14 show the box plots of final when started from different values of . The smallest median of final is obtained at for SPSA and and 1000 for A_SPSA. The best medians of final obtained for A_SPSA () is smaller compared to that of SPSA (). However, both SPSA and A_SPSA had some runs that did not converge to the above
mentioned near-zero values even at these .

Again, for A_SPSA, the best setting were obtained when was set to large values near the order of magnitude of the distance between upper
and lower bound of the domain, while for SPSA, the best was at an interior value between and .

Fig. 14. Initial parameter change and distribution of (after 8000 function evaluations)

Figure 15 shows the trajectory of the reference Lorenz attractor and the simulation of the
Lorenz attractor whose system parameters s, r, and b were successfully identified by A_SPSA. The time t is run from 0 to 20 starting from the same initial condition used in the identification.
The figure shows excellent match.

Fig. 15. State evolution of the target and identified Lorenz attractor, to 20

Figure 16 shows the box plots of parameters estimated by A_SPSA and SPSA starting at their
best settings. The corresponding statistics are shown in Tables 1 and 2. The boxes appear collapsed as single horizontal lines at medians since the spaces
between first quartiles and third quartiles are very narrow. Some non-converging cases
are visible as dots on the figure. The figure and the tables show that the parameter
estimates are more consistent from run to run in A_SPSA than that of SPSA as A_SPSA
has narrower first and third quartile differences.

Fig. 16. Distribution of the parameters identified by A_SPSA and SPSA. a A_SPSA with , b SPSA with

Table 1. Statistics of identified Lorenz Attractor parameters by 20 SPSA runs at

Table 2. Statistics of identified Lorenz attractor parameters by 20 A_SPSA runs at