Abstract
We review the methods and applications of automatic differentiation, a research and development activity, which has evolved in various computational fields since the mid 1950's. Starting from very simple basic principles that are familiar from school, one arrives at various theoretical and practical challenges. The resulting activity encompasses mathematical research and software development; it is now oftenreferredtoas algorithmic differentiation. From a geometrical and algebraic point of view, differentiation amounts to linearization, a concept that naturally extends to infinite dimensional spaces. In contract to other surveys, we will emphasize this interpretation as it has become more important recently and also facilitates the treatment of nonsmooth problems by piecewise linearization.
Jacobians; Taylor expansions; piecewise linearization
1 INTRODUCTION AND MOTIVATION
1.1 Functions in their prime
Mathematicians are fond of writing statements like F ∈ C 1,1(Ω). This short hand means that the mapping or function F : Ω ⊆ X → Y has a locally Lipschitz-continuous derivative on the domain Ω ⊆ X, with X and the range Y being linear spaces. Moreover, the derivative F'(x) at some fixed point x is then interpreted as a linear mapping between X and Y . This abstract notion seems to be very far removed from the differentiation taught in school, where one learns for example that
Here we have a very constructive way of computing the derivative function F'(x), which can then be easily evaluated at any given point x. In contrast, attaching a prime as superscript to F in order to obtain F'(x) is little more than the declaration that such a so-called Fréchet derivative exists. Only in rather rare circumstances it can be written down in terms of a procedure for evaluating it. Before discussing a more constructive approach let us first reconcile the two notions in terms of linearization.
Throughout the paper we will denote by y = F(x) real valued functions so that Y = ℝ, whereas y = F(x) denotes vector valued functions usually with range Y = ℝm , the m- dimensional Euclidean space. The domain will always be X = ℝn, i.e., the independent variables form a vector x = (xj) j=1...n with xj ∈ ℝ.
1.2 Drawing the line
Suppose for the given ẋ we wish to estimate the value of F(ẋ + Δx) for small increments Δx ∈X that belong to the same linear space as ẋ. Then local Lipschitz-continuous differentiability of F at x is equivalent to the first order Taylor-expansion
Here ║Δx║is a norm that measures the size of elements Δx ∈ X , and O (ρp) a function of 0 ≤ρ ∈ ℝ that is bounded by cρp for some constant c ≥ 0. With ║·║simultaneously denoting a norm on the range space Y we can write more precisely
In other words, the original function F (x) is approximated by its linearization F(x) + F' (x up to the quadratic error term c║Δx║2. For the case when X = ℝ = Y so that x, Δx and their values are real numbers we may have the situation depicted in Figure 1. + Δ)Δ
Here ƒ (ẋ) + ƒ'(ẋ). Δx represents a straight line with the slope ƒ' (ẋ) ∈ R and the fixed reference value ƒ (ẋ). Obviously the tangent line provides useful information about the behavior of the function F(x) for x near ẋ. For example, it is well understood that ẋ can only be a minimum of ƒ(x) if ƒ' (ẋ) = 0. Here the derivative serves the role of verifying whether a certain optimality criterion is satisfied. If it is not, one can go downhill by stepping from a current point ẋ to a next point given by
Here the step multiplier α needs to be chosen positive and sufficiently small to ensure progress towards a minimum. This so-called descent approach is the core method of optimization and also works when n > 1 with ƒ'(ẋ) the gradient vector as discussed below.
1.3 Newton's method and Jacobians
When n = 1 = m, the slope F'(ẋ) can also be used in Newton's method where, given a current point ẋ, a new point ẋ is computed as
For vector functions F with n = m one uses instead the formulation
which is equivalent to (1.4) in the scalar case m =1 =n.
For general m = n > 1 the computation of the new iterate y = F(x) is available as the so-called Jacobian matrix requires the solution of a linear system of equations, which can be done for example by Gaussian elimination with pivoting. The key assumption for that is that the first derivative of
where the partial derivatives ∂ Fi /∂xj = ∂yi ∂xj are evaluated at the current point ẋ. In the scalar case, m = 1 = n and for rather simple vector functions F one may sometimes be able to derive F' = ƒ' by hand. However, in general that is a very tedious and error prone task best left to a computer.
Evaluating Jacobians of general dimensions m × n with minimal or at least reasonable computational effort is the core task of automatic differentiation. Trying to absolutely minimize the number of arithmetic operations is impractical because it leads to a combinatorial problem that is very difficult, more specifically, NP hard.
Before we go on let us offer a few apologetic remarks on true mathematicians, who wish to have things defined precisely. Throughout the paper we will restrict our consideration to finite dimensional spaces X = ℝn , Y = ℝm, and correspondingly identify the Fréchet derivative F'(x) with its matrix representation in terms of the Cartesian basis of X and Y . Moreover, we will interpret differentiability always in the sense of local Lipschitz-continuous differentiability, so that there are no order o(ρ) terms at all. Thus, we can get by without any explicit limit process, but propagate first and higher order Taylor expansions forward and backward.
In any case, there is a sometimes confusing range of ways to denote derivatives, starting with the controversy between the Newton followers and the Leibniz camp in the early 18th century. Apart from Newton's prime superscript we will also use the Leibniz expression v with respect to the variable vector x ∈ ℝn and n. to denote its directional derivative with respect to some given direction ẋ ∈ ℝ to denote the derivative of some scalar or vector quantity
1.4 Differentiation rules
Automatic differentiation is based on the classical rules expressing the derivatives of the composition of two functions F and G in terms of their individual derivatives, namely:
-
(i)
-
(ii)
-
(iii)
-
(iv)
-
(v)
Here it is assumed that F and G are once locally Lipschitz-continuously differentiable on their respective domains, which coincide except for the concatenation (ii) and the chain rule (iii). Then their composites will have the same differentiability properties. In the final section we will consider generalizations of these rules to the piecewise smooth scenario.
1.5 Composite function model
Every realistic computational model from the sciences, engineering, and economics is built up from simple building blocks. The specification typically involves several layers of abstractions, but we may think of it as a single computer program in an imperative language like Fortran and C, or systems like Mathematica and Maple. By applying the above simple rules recursively we must arrive at the derivatives of elementary operations and functions that are part of the programming environment. In other words, we will assume that the function in question is evaluated by a sequence of ℓ >> 1 instructions
Here max (j,k) < i, each ◦ is an addition + or a multiplication *, and
is an elemental function from a given library. Φ must include in particular a constant setting v = c, the reciprocal v = rec(u) = 1/u and typically the square root v = sqrt(u) = u - w = u+(-1)*w and divisions u/w = u*rec(w) can be performed as an addition or multiplication after multiplication of the second argument by -1 or computing its reciprocal, respectively. Identifying the first vk ≡ xk+n for k = 1 - n ... 0 with the independent variables and the last vk ≡ yk - l+m with the dependents, we are left with the vector z = (vk)k=1...l-m of intermediate quantities. The total set of variables is therefore given by. Note that subtractions
As an example, we may specify a vector function y = F(x) : ℝ3 → ℝ2 first by the formula
Then we may decompose the formula into the following sequence of elemental operations:
More generally, we will consider three-part function evaluation procures of the form given in Table 2. Here the precedence relation j ≺ i indicates that the variable vi depends directly on the variable vj . It must be acyclic so that each intermediate is unambiguously defined by known quantities and we may order the vi such that j ≺ i ⇒ j < i. In actual computer programs some of the intermediate quantities vi will share the same memory location because they are not needed at the same time. For example, in the simple example above v4 and v5 may overwrite v1 or v3.
That makes no difference for the forward mode of differentiation, but poses a challenge to the reverse mode as we will see. For the time being we will stay with our single assignment assumption that each variable vi has its own memory location and occurs exactly once on the left-hand side of an instruction. For details see the standard reference [15[15] GRIEWANK A & WALTHER A. 2008. Principles and Techniques of Algorithmic Differentiation, Second Edition, SIAM.].
1.6 Cost of elemental derivatives
Our key assumption is that all elemental functions φi () are at least once Lipschitz continuously differentiable on some neighborhood Di of their current argument
Here ni, the number of arguments, is equal to 1 for unary nonlinear functions and to 2 for binary arithmetic operations. Other elementary functions like atan2(·) or, for example, basic linear algebra subroutines can be included. The gradient of the φi will be denoted by
Within the forward and reverse mode to be discussed below we will need to compute not so much the gradient φ'i itself, but its inner product with a given vector i ∈ ℝni or the incremental addition of vi ∈ ℝ times φ'i to a given vector ui ∈ ℝni . In other words, we need to compute
Given any finite library of elementary functions Φ and any reasonable measure of computational effort OPS we will assume the existence of a constant ω such that
More specifically, we assume that the cost of providing the gradient φ'i (ui) at the current ui and multiplying it by the vector i ∈ ℝni from the right or the scalar vi ∈ ℝ from the left is at most ω times as expensive as evaluating φi(ui) by itself. For example, in case of a multiplication
we have φ' = (vk, vj), i = (j, k) and the corresponding operations are
Thus we have in both cases two extra multiplications and either one or two extra additions. Hence, a reasonable value for ω would be 4, since multiplications and additions take about the same time on modern processors. The complexity growth factor 4 also covers the extra number of data movements, namely 4 fetches and one store compared to 2 fetches and 1 store for the original operation vi = φ i(vj, vk). It just so happens that the multiplication operation is in some sense the worst case in that ω =4 is in fact a rather conservative estimates for the derivative complexities of the other elemental functions from the typical library Φ.
2 THEORY OF FORWARD AND REVERSE
After the preparation we can now develop the basic modes of algorithmic differentiation.
2.1 The forward mode
Let us look at a straight line x(t) = x + ẋ · t in the domain ℝn. Then each corresponding intermediate value vi has a linearization
The local Lipschitz continuous differentiability follows from the chain rule, and the dot quantities of the vi can be obtained from the ẋj = j -n by the following procedure.
In the little example above we obtain the extended procedure depicted in Table 4.
Notice that in differentiating the square root v5 = v6 = 1/v2, and the exponential v7 = exp(v3) we have reused the function values themselves to simplify the derivative calculation, avoiding a second division, a second root, and a second exponential, respectively. It is quite clear that the computational effort for propagating the directional derivatives i on top of the vi is just a small multiple of propagating the vi by themselves. More precisely, in terms of the complexity measure OPS we obtain by (1.9) for the cost of evaluating y = F(x) and ẏ = F'(x) ẋ the bound, the reciprocal
In this estimate we have assumed that computational cost is essentially additive, which ignores for example delays due to the scheduling between the various subtasks and gains that might be made on a multicore machine by parallel executions of several threads. The upper bound of 1 + ω ≈ 5 is certainly rather pessimistic.
Moreover, it is also significantly worse than the penalty factor 2, which one obtains if one approximates ẏ by divided differences, i.e., utilizes the linearization
Here one only needs two values of F at neighboring points in order to estimate ẏ = F'(x) ẋ, but it is well understood that even for the optimal step multiplier ε half the number of significant digits is lost so that the accuracy is degraded and hard to predict.
If one wishes to compute the whole Jacobian, one may employ the procedure above withẋ ranging over the n Cartesian basis vectors in ℝn. The resulting complexity estimate is
This bound can be reduced in theory and practice if one executes Table 3 in vector mode, i.e., with the scalars i replaced by a vector of directional derivatives with respect to several directions ẋ, possibly also the whole gradient ∂vi /∂x ∈ ℝn . Then common intermediate quantities can be reused and the vector operations are likely to run quite fast for a moderate number n of independent variables. Moreover, if F'(x) never has more than n nonzeros in any one of its rows, one can define a seed matrix S ∈ ℝn× such that the Jacobian F'(x)S ∈ ℝm× of the reduced function F(x + Sz) with respect to z ∈ ℝ contains enough information to reconstruct F'(x) from B. This technique of row compression is described in some detail in [15[15] GRIEWANK A & WALTHER A. 2008. Principles and Techniques of Algorithmic Differentiation, Second Edition, SIAM.] and reduces the complexity growth to a multiple of n. ≤ rather than
2.2 The reverse mode
Recently, there has been a steadily growing interest in a process called adjoining scientific and industrial codes. The concept of adjoints applies originally to algebraic and differential equations on function spaces. Its discrete analog is what is called the reverse mode of differentiation. Instead of propagating the dot quantities i that represent sensitivities of intermediates with respect to independent variables forwards, the reverse mode propagates backwards the bar quantities
In other words, with y ∈ ℝm a given weight vector we consider the sensitivities of the inner product yTy with respect to variations in the intermediate vi , which may be due to round-off. For vj-n = xj the adjoint quantities vj -n = xj for j = 1 ... n form the gradient
When F = f is scalar valued so that m = 1, we may set y = 1 ∈ ℝ and obtain directly the gradient xT = ƒ'(x). What at first may just seem some notational manipulation turns out to be an exciting and fundamental result. Namely, one discovers that the transposed Jacobian vector product x = F'(x)T y can be computed just like ẏ = F'(x) ẋ at a cost that is a small multiple of that for evaluating F itself, which is certainly impossible by differencing. More specifically, the adjoint quantities vj and their combinations ui = (vj)j≺i ∈ ℝni can be propagated backwards by the reverse procedure listed in Table 5.
Here we use the implicit assumption that all vi for - n < i ≤ ℓ - m have the values they were assigned in Table 2 and that the vi for - n < i ≤ ℓ - m are initialized to zero.
As one can see, adjoint statements += vi · φ'i (ui) are executed in the opposite order in which the underlying statements vi = φi (ui) occurred in the original program. The same is true for the initialization of the yl - i , which are input variables, and the deinitialization of the xj, which are output variables. Apparently the first author to write down this reverse procedure was Seppo Linnainmaa, who listed it in Fortran at the end of his Master Thesis [23[23] LINNAINMAA S. 1970. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding error, Master’s thesis (in Finnish), Department of Computer Science, University of Helsinki.], which is otherwise written in Finnish. He interpreted and used the quantitie vi to estimate the propagation of errors in complicated programs. For more information on the history of the reverse mode see [16[16] GRIEWANK A. 2012. Who invented the Reverse Mode of Differentiation? In: MARTIN GRÖTSCHEL, (Ed.). Optimization Stories, Documenta Matematica, Extra Volume ISMP (2012): 389-400.].
The validity of Table 5 can be derived from the classical rules of differentiation using either directed acyclic graphs or matrix products [15[15] GRIEWANK A & WALTHER A. 2008. Principles and Techniques of Algorithmic Differentiation, Second Edition, SIAM.]. It should be noted that Table 5 is, in contrast to Table 3, not a single assignment code. Starting from their initial value vi = 0 for i ≤ l - m the adjoint quantities may be incremented repeatedly until they reach their final value and then occur once on the right-hand side as pre-factor of φ'i . This can be seen in the adjoint procedure Table 6, which needs to follow our little example program Table 1. Here we see that vi for i = 3, 0 - 2 are each incremented twice before they reach their final value and can then occur on the right hand side.
With respect to the computational cost we find by (1.9) in analogy to the forward mode for x = F'(x)Ty
Here we account for executing the forward sweep Table 2 and the reverse sweep Table 5 after one another. With y ranging over all m Cartesian basis vectors of the range ℝm we obtain the complete Jacobian at the cost
Obviously this operations count is advantageous compared to the forward mode if m < n and we may also employ a corresponding vector mode with vi a vector of adjoints for several weightings y. In the scalar case m = 1 with ƒ(x) = F(x), we have the striking result that
In other words, as Wolfe [36[36] WOLFE P. 1982. Checking the calculation of gradients. ACM Trans. Math. Softw., 8:337-343.] observed in 1982, gradients can 'always' be computed at a small multiple of the cost of computing the underlying function, irrespective of n, the number of independent variables, which may be huge. Since m = 1, we may also interpret the scalars vi as Lagrange multipliers of the defining relations vi -φi (ui) = 0 with respect to the single dependent y = vℓ viewed as objective function. This interpretation was used amongst others by the oceanographer Thacker in [32[32] THACKER WC. 1991. Automatic differentiation from an oceanographer’s perspective, in [12], pp. 191-201.]. It might be used to identify critical and calm parts of an evaluation process, possibly suggesting certain simplifications, e.g. the local coarsening of meshes.
2.3 The Sedgewick-Speelpenning example
To highlight the properties of the reverse mode, let us consider a very simple example of variable dimension that was originally suggested by the late Arthur Sedgewick, the PhD supervisor of Bert Speelpenning at the University of Urbana Champaign. They considered a simple product and its gradient
Computing each gradient component independently would require n2 - O(n) multiplications and there are cheaper alternatives. Firstly, once ƒ has been evaluated at the cost of n - 1 multiplications, the whole gradient g(x) = ƒ'(x) can be obtained using n divisions. However, this approach is not entirely convincing since divisions are much more expensive than multiplications and may lead to a NaN if some component xj is zero. Now let us apply the reverse mode. Starting from the natural, forward evaluation loop
we obtain the combination of forward and reverse sweep listed in Table 7.
Here we have eliminated the assignments from the xj to the vj - n and vice versa from vj- n to xj for the sake of readability.
We could also have replaced the incremental assignments by direct assignments since no intermediate occurs more than once as an argument. That would save n zero initializations and the same number of additions.
Of course, the key observation is that this gradient procedure requires at most 4 times as many arithmetic operations as the evaluation of the function itself and involves no tests and branching. This effect was observed by Arthur S. who suggested to Bert S. rather daringly that something like that should be possible for much more general functions. That suggestion led Bert S. indeed to the reverse mode, which can be implemented in a completely automatic line by line fashion and yields gradients with optimal complexity, at least up to a small constant. The AD community only learnt in 2012 at a conference in Fort Collins about Arthur S. and usually simply refers to Speelpenning's example.
2.4 Storing the trajectory
When m and n are similar, the forward mode is still preferable because of the following memory issue. The forward mode as specified in Table 3 can be performed more or less in-place with the storage essentially doubled as the scalars i ∈ ℝ must be stored in addition to the vi. The dot quantities i and j can and should share the same memory location exactly when that is true for vi and j. The results will still be consistent in that i is the directional derivative of vi,even if there were some unintended overwriting, for example through the aliasing of calling parameters.
The situation is radically different in the reverse mode. Here the quantity vj maybe usedlast in calculating some vi with j ≺ i and then overwritten by some vk with k > i, all that still during the forward sweep Table 2. Then the old value of vj will no longer be available when we have to perform the calculation + = i · φ'i (ui) with vj one of the components of ui ∈ ℝni .If φi is nonlinear, the partials cij(ui) cannot be evaluated. Hence, the old value of vj must be written on a stack just before it is overwritten and then recuperated on the reverse sweep.
Alternatively, some AD implementations prefer to store the partials cij. For either strategy and even if linear operations are identified, we must expect that on average about one floating point number needs to be stored per elemental function. Consequently, the maximal memory requirement for the combined forward and reverse sweep will be much larger than that for the original evaluation of F in that
Here MEM (x) represents the memory requirement for the combined forward and reverse sweep, which can mostly be stored onto a stack.
In case of the Sedgewick-Speelpenning example the function itself can be evaluated using just O(1) memory locations apart from the independents xj for j = 1 ... n. More specifically, all partial products vi can be stored in the same memory cell as y. However, for the sake of the reverse sweep, the vi must be kept around either an array or some stack on a remote storage device.
Note that the memory estimate above applies to the vector and scalar cases m > 1and m = 1 alike. Hence, from a memory point of view it is advantageous to propagate several adjoints simultaneously backward, for example in an optimization calculation with a handful of active constraints. Originally, the memory usage was a big concern because memory size was severely limited. Today the issue is more the delay caused by large data movements from and to external storage devices, whose size seems almost unlimited. As already suggested by Benett [1[1] BENNETT CH. 1973. Logical reversibility of computation. IBM J. Research and Development, 17:525-532.] and Ostrowski [29[29] OSTROVSKII GM, VOLIN YU M & BORISOV WW. 1971. Über die Berechnung von Ableitungen. Wiss. Z. Tech. Hochschule für Chemie, 13:382-384.] et al., the memory can be reduced by orders of magnitude through an appropriate compromise between storage and recomputation of intermediates, described as checkpointing in [15[15] GRIEWANK A & WALTHER A. 2008. Principles and Techniques of Algorithmic Differentiation, Second Edition, SIAM.].
3 OTHER ASPECTS OF THE BASIC MODES
3.1 Gradients and adjoint dynamics
The cheapness of gradients is of great importance for nonlinear optimization, but still not widely understood, except in the time dependent context. There we may have, on the unit time interval 0 ≤ t ≤ 1, the primal dual pair of evolutions
Here the state v belongs to some Euclidean or Banach space and v to its topological dual. Correspondingly, the right-hand side F(v) and its dual F' (v)T v may be strictly algebraic or involve differential operators.
Then it has been well understood since Pontryagin that the gradient of a scalar function y = f (v(1)) with respect to the initial point x is given by v(0). It can be computed at maximally ω = 2 times the computational effort of the forward calculation of v(t) by additionally integrating the second, linear evolution equation backward. In the simplest mode without checkpointing this requires the storage of the full trajectory v(t) unless the right-hand side F is largely linear. Also for each t the adjoint states v(t) represent the sensitivity of the final value y = f with respect to perturbations of the primal state v(t).
Of course, the same observations apply to appropriate discretizations, which implies again the proportionality between the operations count of the forward sweep and memory need of the reverse sweep for the gradient calculation. To avoid the full trajectory storage one may keep only selected checkpoints during the forward sweep and then recuperate the primal trajectory in pieces on the way back, when the primal states are actually needed.
In some sense the reverse mode is just a discrete analog of the maximum principle going back to Pontryagin. Naturally, the discretizations of dynamical systems have more structure than our general evaluation loop, but the key characteristics of the reverse mode are the same.
3.2 Numerical stability
We have demonstrated that and how the forward and reverse mode can be used to compute first order derivatives of functions given by evaluation procedures. Moroever, the computational complexity was found to be quite reasonable. Now the natural question is how numerically accurate the results are, especially in view of the fact that numerical analysts strongly believe differentiation to be an ill-conditioned process. However, this conviction is based on the notion that all we know about the function ƒ : ℝn → ℝ is an oracle that evaluates it with a certain absolute accuracy.
Under our assumptions we know a lot more about ƒ(x) in that we have its decomposition into a sequence of elementary function evaluations vi = φi (ui). The key question is how exact we can evaluate the φi and their derivatives on a given computing platform. The celebrated IEE 754 standard prescribes in detail the computation of arithmetic operations, but only makes recommendations regarding the accuracy of special function evaluations.
As discussed in [33[33] WALTHER A, KULSHRESHTHAK & GRIEWANK A. 2012. On the Numerical Stability of Algorithmic Differentiation. Computing, 94:125-149.] we can reasonably assume that the values ṽi and i obtained from i = (ṽj) j≺i and i = (j) j≺i by the forward differentiation procedure listed in Table 3 satisfy the relations
where the three perturbations i, i and εi are all of the order of the machine precision eps. While a forward error analysis would be, as usual, quite complicated, we can apply a backward error analysis that shows that algorithmic differentiation yields the exact derivative values for a slightly perturbed problem.
To obtain the pair (ṽi, i) as the exact result of (i, φi (ui) a modification given by) we define for each univariate elemental function
We contend that this artificially constructed i may be considered a small perturbation of φi given the machine precision eps. It is easy to check that it satisfies indeed exactly the relations
A very similar small perturbation can be applied to the binary operations and other elementary functions with several variables.
Hence, we have shown that the forward mode of algorithmic differentiation is indeed backward stable in the sense of Wilkinson [35[35] WILKINSON J. 1971. Modern error analysis. SIAM Rev., 13:548-568.]. In the same way we can also establish the backward stability of the reverse mode, but then there may be some O(vj belongs to some ui , i.e., occurs as argument of some φi as j ≺ i. Nevertheless, this is a very satisfactory backward stability result and in our by now twenty five years of experience with AD we have never heard a user complain about poor accuracy derivative values, once the logical bugs had been overcome. eps) terms where is the maximal number of times that any intermediate
3.3 Software and application
So far we have not really explained why AD became known as automatic differentiation. We have seen how the original evaluation procedure needs to be augmented with additional instructions to calculate the correct derivative values. In principle, this program transformation can be done by hand, but that is very time-consuming and error prone. It is also hard to manage; if the underlying program is subject to continual change, then keeping track of these changes in order to maintain the integrity of the transformed version is also time consuming. Therefore, it is usually desirable to automate at least partially the process of transformation. For a computer science oriented introduction to AD consult the excellent book [27[27] NAUMANN U. 2012. The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation, SIAM, Software, Environments, and Tools, Philadelphia.]. Here we will just try to convey the basic ideas.
Conceptually, we may think of the AD process as a transformation of the form
Here we assume that the original subprogram eval (x, y, z) evaluates the output vector y as a function of the input vectors x and z. The inputs and outputs are marked by small arrows on top and bottom, respectively. Whereas z is considered a constant parameter, x is nominated as a vector of independent variables with respect to which y is to be differentiated. Consequently, a dot derivative object dx is associated with x as an additional input, and a corresponding output dy associated with the dependent variables y is also included as a parameter in the tangent procedure listed deval corresponding to Table 3. In the simplest case dx is the directional derivative ẋ and dy is the directional derivative y described in the section on the forward mode. Then the format of the arrays x and y would be exactly the same as that of the underlying independent and dependent variables x and y, respectively. In vector mode dx and dy could contain several directional derivatives. The reverse mode function beval corresponding to Table 5 has the additional input parameter by and the new output parameter bx, which means that there is a reversal of the information flow. Again by and bx may correspond to the adjoint vectors y and x or be larger derivative objects of compatible format.
The program extension can be performed essentially line-by-line, although the reverse mode requires the trajectory transfer between the forward and reverse sweep. Also, at least on a subroutine level it might be worthwhile to perform some code analysis in order to improve the derivative code or even the original procedure. Traditionally, AD was developed for codes written in the classical procedural languages Fortran and C++. This was done using two computer science concepts employed in one guise or another, namely source transformation and operator overloading.
Source transformation requires an elaborate transformation tool that takes a certain computer code, analyzes it like a compiler and then generates a new extended source code. While the development and maintenance of such a tool is a significant effort for the developer the user needs to do very little other than making his input code truly conform to the language standard. He or she also has to nominate the independent and dependent variables and select the mode of differentiation as well as the order of the desired derivatives. The only drawback for the user may be a certain lack of flexibility in terms of what order and modes of differentiation the tool provides.
There may also be restrictions regarding recently added language features like for example in the continuous evolution of Fortran. Generally, there is the reasonable expectation that source transformation yields faster derivative codes, if only because of the application of standard compiler optimization. However, this advantage applies mainly to clearly structured numerical codes that also parallelize well. Recursions, pointers and indirect addressing can render static analysis at compile time rather ineffective.
Operator overloading tools consist mainly of a header file and a runtime library, which are both completely problem independent. The user has to retype all variables that are involved in the differentiation process to a new tool-specific type, called for example adouble in ADOL-C. The standard language compiler will then look up in the header file what needs to be done for all elemental operations and functions. So the extra derivative related instructions will only be included in the object file, but not in the source files. The user must also declare and initialize the independent and dependent variables in some tool-dependent way. When the code compiles without diagnostics and there are no runtime errors, one can be pretty sure that the derivative results are correct. For tools that provide the forward and reverse mode one can check the Lagrange identity
for random vectors ẋ and y to see if the results are consistent up to numerical round-off. If they are not, something has gone wrong in the transformation process.
Table 8 lists some of the tools in the order of their first appearance. Some of them are still available, some have metamorphosed into successor systems, and others have fallen by the wayside altogether. Of those still existent all except TAF/TAC are available under the GNU public license. The oldest one still under development with its original name and basic design is the overloading tool ADOL-C.
Source transformation (C) and operator overloading (O) AD tools. The modes are forward: → reverse: ←, both: ↔, and ⇔ with checkpointing.
There is some indication that more recent projects like Adept and the NAG-compiler integrated AD utilities can achieve a significant speed-up using expression templates and other modern language constructs. Unfortunately, no large commercial software vendor has ever invested significant resources to develop a professional computing environment with AD capabilities. The modeling languages AMPL, GAMS and also the optimization test environment CUTEr have provided AD capabilities under the hood for a long time.
There has also been a string of Matlab implementations, in particular ADiMat [4[4] BISCHOF CH H, BÜCKER HM, LANG B, RASCH A & VEHRESCHILD A. 2002. Combining Source Transformation and Operator Overloading Techniques to Compute Derivatives for MATLAB Programs. Proceedings of the Second IEEE International Workshop on Source Code Analysis and Manipulation (SCAM 2002), IEEE Computer Society.]. For these developments and the current state of AD tools in general one should consult the community website www.autodif.org.
Of course, there is also a close connection to the fully symbolic differentiation capabilities of computer algebra packages. A very exciting recent development has been the integration of AD capabilities into the problem solving environment FEnICS [24[24] LOGG A, MARDAL K-A & WELLS GN ET AL. 2012. Automated Solution of Differential Equations by the Finite Element Method. Springer.] for PDEs based on finite elements. See also the discussion by Korelc in [21[21] KORELC J. 1997. Automatic generation of finite-element code by simultaneous optimization of expressions. Theoretical Computer Science, 187:231-248.].
Application studies with the major tools can be found in the proceedings of the workshops in Breckenridge 91 [12[12] GRIEWANK A & CORLISS GF. (Eds.) 1991. Automatic Differentiation of Algorithms: Theory, Implementation, and Application. Proceedings of the SIAM Workshop on the Automatic Differentiation of Algorithms, Breckenridge, Colorado, SIAM, Philadelphia.], Santa Fe 96 [2[2] BERZ M, BISCHOF CH,CORLISS GF & GRIEWANK A. (Eds). 1996. Computational Differentiation: Techniques, Applications and Tools. Proceedings of the SIAM Workshop on the Automatic Differentiation of Algorithms in Santa Fe, New Mexico, SIAM, Philadelphia.], Nice 00 [9[9] CORLISS GF, FAURE CH, GRIEWANK A, HASCOËT L & NAUMANN U. (Eds.) 2002. Automatic Differentiation of Algorithms: From Simulation to Optimization. Conference proceedings, Nice 2000, Computer and Information Science, Springer New York.], Chicago 04 [5[5] BÜCKER HM, CORLISS GF, HOVLAND PD, NAUMANN U & NORRIS B. (Eds). 2005. Automatic Differentiation: Applications, Theory, and Implementations. Lecture Notes in Computational Science and Engineering, 50, Springer New York.], Bonn 08 [3[3] BISCHOF CH H, BÜCKER HM, HOVLAND PD, NAUMANN U & UTKE J. (Eds). 2008. Advances in Automatic Differentiation. Lecture Notes in Computational Science and Engineering, 64, Springer Berlin.], Fort Collins 12 [10[10] FORTH SH, HOVLAND PD, PHIPPS E, UTKE J & WALTHER A. (Eds.) 2012. Recent Advances in Algorithmic Differentiation. Lecture Notes in Computational Science and Engineering, 87, Springer Berlin.]. The sheer size of a computer model is no longer an objection to the successful application of AD. However, problems arise with multi-layer codes in different languages and possibly involving precompiled proprietary software. Unfortunately, there are, as yet, no agreed upon interfaces for the transfer of sensitivities from one part of a computer model to another. In particular the function signatures of derived subprograms like deval(x, dx, y, dy, z) suggested above will vary from tool to tool. Fortunately, terminology and notation in AD have been unified to a large extent, at least compared to the early days.
4 FURTHER DEVELOPMENTS
Many well established results, techniques and aspects of AD could not be elaborated in this survey paper. We have covered here only first derivative calculations in the forward and reverse mode. This is of central importance, but certainly not the whole game.
4.1 Higher derivatives
By combining the forward and reverse mode one obtains a method for evaluating second order adjoints of the form
Their cost is about ω2 ≈ 16 times as much as that of evaluating F by itself. When ẋ ranges over the Cartesian basis vectors in ℝn, one obtains the complete Hessian (yTF(x))" ∈ ℝn×n at a cost of about n ω2 ≈ 16 n times that of F. Such second order information is particularly useful in optimization.
One of the earliest application of AD was the solution of ODEs by Taylor series methods [34[34] WANNER G. 1969. Integration gewöhnlicher Differentialgleichnugen, Lie Reihen, Runge-Kutta- Methoden XI, B.I-Hochschulskripten, no. 831/831a, Bibliogr. Inst., Mannheim-Zürich, Germany.]. Up to 30-50 terms are really no problem and may yield very accurate solution trajectories (see [6[6] CHANG YF & CORLISS GF. 1994. ATOMFT: Solving ODEs and DAEs using Taylor series. Comput. Math. Appl., 28:209-233.] and [30[30] PHIPPS E, CASEY R & GUCKENHEIMER J. 2005. Periodic Orbits of Hybrid Systems and Parameter Estimation via AD. In: [5], 211-223.]). There have also been some preliminary studies concerning Differential Algebraic Equations (DAEs), where AD has in my view still a large untapped potential.
Using the fact that exp(), sin (), cos() and all the other intrinsic elemental functions are solutions of linear ODEs one can cheaply compute univariate Taylor expansions
Cheap means here that the effort only grows like O(d2) with respect to the degree d, and it could even be lowered to O(d log(d)) using FFT based polynomial arithmetic.
Apparently that possibility has not been seriously utilized since the cross-over happens too late. Based on the coefficients of a carefully selected family of univariate Taylor expansions one may then also interpolate the entries of derivative tensors of arbitrary order [13[13] GRIEWANK A, UTKE J & WALTHER A. 2000. Evaluating higher derivative tensors by forward propagation of univariate Taylor series. Mathematics of Computation, 69(231):1117-1130.]. All this happens in the forward mode and one may then also compute cheaply the gradients of these high order mixed partials with respect to a different set of parameters, e.g. in design optimization.
4.2 Combinatorial aspects
There is a natural tendency to consider differentiation as a mechanical process, and so far the only truly surprising thing that we have encountered is the reverse mode. In fact, there are many combinatorial aspects of AD that arise if one really wishes to minimize the computational effort.
Firstly, the forward and reverse mode are only the extreme variants of a general elimination approach on linearized computational graphs. Here one considers the variables vi and equivalently their indices i for i = 1 ... ℓ as the vertices of a directed acyclic graph (DAG) with ( j,i) an edge iff j ≺ i. Then the minimal and maximal vertices with respect to that partial ordering are exactly the independents j - n for j = 1 ...n and the dependents vℓ-i for i = 0 ...m - 1. The edges can be labeled with the elementary partials cij(ui) evaluated at the fixed current point x. When the graph is bipartite, i.e., ℓ = m, the edge values cij represent exactly the nonzero entries of the Jacobian F'(x). Otherwise the ℓ - m intermediate vertices can be successively eliminated according to the chain rule in the style of sparse Gaussian elimination.
More specifically, as explained in [14[14] GRIEWANKA & NAUMANNU. 2002. Accumulating Jacobians by Vertex, Edge or Face Elimination. Proceedings of CARI’02, 375-383.], the DAG can be successively simplified by vertex, edge, or face eliminations until it becomes the bipartite representation of the Jacobian. The result of this accumulation or elimination process is up to round-off independent of the order in which the individual vertices, edges, or faces are eliminated, but the operations count and memory requirement can vary drastically. Not surprising, efforts to minimize either the temporal or spatial complexity lead to NP-hard combinatorial meta-problems [26[26] NAUMANN U. 2008. Optimal Jacobian accumulation is NP-complete. Math. Program., 112:427- 441.]. We call them meta-problems because the underlying task of accumulating Jacobians from the elementary partials cij can be achieved in strongly polynomial time with respect to the problem size ℓ ≥ max(n, m).
A restricted version of the accumulation problem is that of finding, for a given sparsity pattern of F'(x), a seed matrix S ∈ ℝn× p with a minimal number of columns p ≤ n such that the compressed Jacobian F'(x) S allows the identification of all nonzero entries of F'(x). It has been known for a long time that the minimal p is the chromatic number of an associated columncoincidence graph [7[7] COLEMAN TF & MORÉ JJ. 1983. Estimation of sparse Jacobian matrices and graph coloring problems. SIAM J. Numer. Anal., 20:187-209.]. Using the reverse mode one may look for so-called adjoint seeds W ∈ℝq×m with minimal q ≤ m such that the column compressed Jacobian W F'(x) reveals all nonzero entries of F'(x). Moreover, the two compressions can be combined to further lower the number of directional derivatives or adjoint vectors needed for computing F'(x) by a so-called Coleman-Verma partition [8[8] COLEMAN TF & VERMA A. 1996. Structure and efficient Jacobian calculation, in [2], 149-159.]. Finally, there are special considerations and heuristic algorithms for the symmetric case when F'(x) is in fact a Hessian. Suitable colorings are provided by the fairly comprehensive package ColPack [11[11] GEBREMEDHIN AH, NGUYEN D, PATWARY MMA & POTHEN A. 2013. ColPack: Software for Graph Coloring and Related Problems in Scientific Computing. ACM Transactions on Mathematical Software, 40(1):Art. 1, 31.]. An alternative approach called Newsam-Ramsdell seeding in [15[15] GRIEWANK A & WALTHER A. 2008. Principles and Techniques of Algorithmic Differentiation, Second Edition, SIAM.], which absolutely minimizes p and or q, is described in [19[19] GRIEWANK A & STROGIES N. 2014. Efficient Evaluation of Sparse Jacobians by Matrix Compression, Part I:Motivation and Theory. Proceeding of the 2nd International Conference on Mathematics, Computational and Statistical Sciences, pp. 33-40.].
The full storage of the forward value trajectory can be avoided by checkpointing and similar store-recompute trade-offs at a finer grain. All these tasks are combinatorial in nature and of course AD would also benefit from a host of other optimizations like expression simplification, instruction scheduling, and register allocation. It is well understood that these compiler tasks are NP-hard and can thus only be attacked by suitable heuristics. A particularly nice example is the pebble game, where one wishes to propagate adjoint values backward through the computational graph using only a fixed number ℓ < ℓ of vertices in memory and minimizing the recomputations.
4.3 Generalized differentiation
In the first section we have noted that the evaluation of gradients and Jacobians is essential for the efficient numerical solution of unconstrained optimization and nonlinear equation problems. There are many other algorithms in nonlinear scientific computing that rely on successive linearizations via first order Taylor expansions. On the other hand, many realistic computer models are nondifferentiable in that the functional relation between input and output variables is not everywhere smooth. Also, from a theoretical point of view it has been realized that the solution operators of many computational problems are often nonsmooth, and may even have jumps like bang-bang controls. By now there is a very significant body of nonsmooth analysis, based on concepts of generalized differentiation (see e.g. [25[25] MORDUKHOVICH BS. 2006. Variational analysis and generalized differentiation. I, volume 330 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2006. Basic theory.] and [22[22] KLATTE D & KUMMER B. 2002. Nonsmooth equations in optimization, volume 60 of Nonconvex Optimization and its Applications. Kluwer Academic Publishers, Dordrecht, 2002. Regularity, calculus, methods and applications.]). This development was largely ignored by the AD community, mostly because the generalized differentiation rules did not seem amenable to an automatic implementation.
The key concepts of nonsmooth analysis on finite dimensional spaces are the following: If F : ℝn → ℝm is locally Lipschitz continuous, it follows from Rademacher's theorem that it has a Fréchet derivative F'(x) at all x ∈ℝn \ S with S a set of measure zero such that
Moreover, the induced norm ║F'(x)║ is bounded above by any local Lipschitz constant L with respect to suitable vector norms. Then we may define, for all x ∈ ℝn, the set
which we call the limiting Jacobian. It is never empty due to the density of the Fréchet differentiable points and the uniform boundedness of the nearby Jacobians.
As an example we obtain for the mapping F(x) = |x|
The convex hull
is called the Clarke generalized derivative. For the example above ∂LF(x) and ∂F(x) differ only at x = 0, with the latter expanding to ∂F(0) = [-1, 1].
For the limiting Jacobian we obtain the following generalized differentiation rules, which immediately imply corresponding set inclusions for the Clarke Jacobian.
In contrast to the list of corresponding smooth differentiation rules listed in Section 1.4, we have added the last rule applying to the absolute value function abs(x) = |x|. Of course, in general all of the set inclusions above are proper, i.e., not satisfied as equalities even if we restrict our attention to functions defined by evaluation procedures with elementals φi ∈Φ ∪{abs}. In other words, the only nonsmooth elemental function that we allow is the absolute value function, which yields immediately max and min as
The resulting functions are called composite piecewise smooth in [20[20] KHAN KA & BARTON PI. 2013. Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function. ACM Transactions on Mathematical Software (TOMS), 39(4):23:1-23:28.] and will be studied in the remainder of this survey paper. The more general concept of piecewise smoothness has been examined for example in [31[31] SCHOLTES ST. 2012. Introduction to piecewise differentiable equations. SpringerBriefs in Optimization, Springer Heidelberg.].
4.4 Between the lines
Clearly, composite piecewise smooth functions are locally Lipschitz continuous and thus differentiable at almost all points F'(x) is undefined. Instead, we believe that generalized algorithmic differentiation should provide an approximating function ΔF(x) at all ∈ ℝ. However, the resulting linearizations (1.1) will only be valid on a small ball about whose radius does not exceed the distance to the next point where the Fréchet derivative , Δ such that
or, again more precisely, for a constant c
In other words, we are looking for a generalized Taylor expansion with uniform quadratic error term. To get an idea how this is constructively possible, let us consider a univariate function of the form F(x) = max (F1(x),F2(x)) as depicted in Figure 2.
Since F1 and F2 are assumed to be smooth, the function F is everywhere differentiable except at the kink point x*, where the two values tie. There the generalized gradient ∇C F(x*) = [F'1 (x*), F'2 (x*)] in the sense of Clarke is the interval spanned by the negative slope F'1 (x*) of the red branch and the positive slope F'2 (x*) of the blue branch. This reflects the fact that the set-valued Clarke derivative is just the convex and outer semicontinuous hull of the classical derivative F'(x), which is undefined at x = x* itself.
At any x ≠ x* Clarke's and all the other derivative concepts reduce simply to the slope, which gives no indication of the nearby kink whatsoever. If the F is repeatedly evaluated as part of a larger interactive computation in floating point arithmetic, the kink will almost certainly never be hit exactly, and modeling F by the tangent line of either F1 or F2 may of course yield rather poor results. All we are suggesting is to model F by the dashed green function F(F(x). As we will see later, this piecewise linear function is obtained by approximating F1(x) as well as F2(x) by their tangent line at the base point ) + Δ;Δ and then taking the maximum afterwards.
It is intuitively clear that this approximating function varies continuously with F on both sides of its kink. In general, the second order approximation of a composite piecewise smooth function by a piecewise linear function can be achieved according to the following extremely simple recipe. and yields a rather good approximation to the original function
Replace all smooth elemental functions by their tangent line or plane, and the piecewise linear elementals abs, max and min by themselves.
That means we set for smooth elementals
and for the absolute value function
Then we can evaluate the composite incremental function Δy = ΔF(e can evaluate the composite incremental functi;Δ x) by the modification of Table 3 listed in Table 9.
Obviously, very little has changed and the computational complexity for evaluating Δy given Δx is again at most ω ≈ 4 times that of evaluating the composite piecewise smooth function F itself. When there are no nonsmooth elementals at all, Table 9 is equivalent to Table 3. So far we know of no straightforward generalization of the reverse mode listed for the smooth case in Table 5.
It should be noted that ΔF(n to ℝm. That situation might take a little getting used to, since one tends to expect that the end product of any differentiation process is some collection of derivative vectors or matrices. In fact, one can describe ΔF([18[18] GRIEWANK A, BERNT J-U, RADONS M & STREUBEL T. 2014. Solving piecewise lineare equations in abs-normal form. Sybmitted to Linear Algebra and Applications.]. It can be generated directly by a minor extension of standard AD tools, which has for example been provided for ADOL-C.;·) is not just a matrix, but a continuous piecewise linear map from ℝ;·) in terms of matrices and vectors by the so-called abs-normal form described in
In sharp contrast to the standard concepts of generalized differentiation, which are very volatile as functions of the development point F(F and G with compatible domains and dimensions, the piecewise linearization Δ;·) varies locally Lipschitz continuously with respect to . It also satisfies exact propagation rules, namely, we have for
Now, what can we do using the piecewise linear models of composite piecewise smooth functions described above? Well, mostly all the things that one does with linearizations of smooth functions, such as solving equations, (un)constrained minimization and the solution of discretized ordinary and partial differential equations. The uniform second order approximation property ensures rapid convergence of such successive piecewise linearization procedures from within sizable domains of attraction.
One instant benefit of piecewise linearization is the capability to compute one or more elements of the limiting Jacobian set
More precisely, we can compute limiting Jacobians of ΔF([28[28] NESTEROV YU. 2005. Lexicographic differentiation of nonsmooth functions. Mathematical programming, 104(2-3):669-700, Springer.] and is has been shown in [20[20] KHAN KA & BARTON PI. 2013. Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function. ACM Transactions on Mathematical Software (TOMS), 39(4):23:1-23:28.] and [17[17] GRIEWANK A. 2013. On Stable Picewise Linearization and Generalized Differentiation. Optimization Methods and Software, 28(6):1139-1178.] that these are then also limiting Jacobians of the underlying CPS function F. Furthermore, they have a desirable feature called conic activity at [31[31] SCHOLTES ST. 2012. Introduction to piecewise differentiable equations. SpringerBriefs in Optimization, Springer Heidelberg.].;·) by lexicographic differentiation , which is a little stronger than the concept of essential activity used by Scholtes
5 SUMMARY
In this survey we have considered the problem of locally approximating a function F : ℝn → ℝm defined as composite of elementary functions φi as described in Section 1. If all these elementals are smooth, i.e., Lipschitz continuously differentiable, we obtain a classical Taylor expansion in terms of the Jacobian F'(n →ℝm evaluated at the current point ):ℝ.
In Section 2 it is described how, rather than evaluating all entries of F'(F'(F'(Ty in the forward and reverse mode of algorithmic differentiation, respectively.) simultaneously, one prefers to compute matrix-vector products ) ẋ or )
The procedures for doing that are listed in the Tables 3 and 5 with the I/O characteristics indicated in (3.2). The temporal complexity of both modes is essentially the same, namely just about 3-4 times as many operations are needed as for evaluating the underlying function F. However, there is a large gap in the spatial complexity since the reverse mode requires the recuperation of the forward value trajectory on the way back, in one way or another. At the beginning of Section 3 we consider this memory aspect and answer the question of numerical accuracy quite satisfactorily in the sense of Wilkinson backward stability.
The two basic modes have been implemented in almost a hundred different software tools adapted to the most important languages and computing environments. To give a flavor of this evolution we have listed some of them in Table 8 without any ambition of being representative. The continual tool development can be tracked on the website www.autodif.org. There is a lot more to AD than the basic modes, as we have indicated in Section 4. That applies in particular to the evaluation of second or higher derivatives and also certain combinatorial tasks that arise in the handling of problems that are sparse or otherwise structured. Much remains to be done in this respect, also in view of parallel and other modern computer architectures.
In the final Subsection 4.4 we have sketched how piecewise smoothness arising from the Lipschitz elementals abs, min, and max can be handled very naturally by piecewise linearization. In effect one obtains a generalized Taylor approximation by a piecewise linear model with a uniform error of order 2 in terms of the distance to the reference point F. That scenario is of particular interest in design optimization and optimal control, where good piecewise linear approximations may need to be discontinuous.. This model varies Lipschitz continuously w.r.t and can be used for successive piecewise linearization techniques, to solve piecewise smooth equations and other fundamental tasks of scientific computing. So far the model is computed exclusively in the forward mode and it is not at all clear how the concept of adjoints should be generalized and implemented for nonsmooth
REFERENCES
-
[1]BENNETT CH. 1973. Logical reversibility of computation. IBM J. Research and Development, 17:525-532.
-
[2]BERZ M, BISCHOF CH,CORLISS GF & GRIEWANK A. (Eds). 1996. Computational Differentiation: Techniques, Applications and Tools. Proceedings of the SIAM Workshop on the Automatic Differentiation of Algorithms in Santa Fe, New Mexico, SIAM, Philadelphia.
-
[3]BISCHOF CH H, BÜCKER HM, HOVLAND PD, NAUMANN U & UTKE J. (Eds). 2008. Advances in Automatic Differentiation. Lecture Notes in Computational Science and Engineering, 64, Springer Berlin.
-
[4]BISCHOF CH H, BÜCKER HM, LANG B, RASCH A & VEHRESCHILD A. 2002. Combining Source Transformation and Operator Overloading Techniques to Compute Derivatives for MATLAB Programs Proceedings of the Second IEEE International Workshop on Source Code Analysis and Manipulation (SCAM 2002), IEEE Computer Society.
-
[5]BÜCKER HM, CORLISS GF, HOVLAND PD, NAUMANN U & NORRIS B. (Eds). 2005. Automatic Differentiation: Applications, Theory, and Implementations. Lecture Notes in Computational Science and Engineering, 50, Springer New York.
-
[6]CHANG YF & CORLISS GF. 1994. ATOMFT: Solving ODEs and DAEs using Taylor series. Comput. Math. Appl., 28:209-233.
-
[7]COLEMAN TF & MORÉ JJ. 1983. Estimation of sparse Jacobian matrices and graph coloring problems. SIAM J. Numer. Anal., 20:187-209.
-
[8]COLEMAN TF & VERMA A. 1996. Structure and efficient Jacobian calculation, in [2], 149-159.
-
[9]CORLISS GF, FAURE CH, GRIEWANK A, HASCOËT L & NAUMANN U. (Eds.) 2002. Automatic Differentiation of Algorithms: From Simulation to Optimization. Conference proceedings, Nice 2000, Computer and Information Science, Springer New York.
-
[10]FORTH SH, HOVLAND PD, PHIPPS E, UTKE J & WALTHER A. (Eds.) 2012. Recent Advances in Algorithmic Differentiation. Lecture Notes in Computational Science and Engineering, 87, Springer Berlin.
-
[11]GEBREMEDHIN AH, NGUYEN D, PATWARY MMA & POTHEN A. 2013. ColPack: Software for Graph Coloring and Related Problems in Scientific Computing. ACM Transactions on Mathematical Software, 40(1):Art. 1, 31.
-
[12]GRIEWANK A & CORLISS GF. (Eds.) 1991. Automatic Differentiation of Algorithms: Theory, Implementation, and Application. Proceedings of the SIAM Workshop on the Automatic Differentiation of Algorithms, Breckenridge, Colorado, SIAM, Philadelphia.
-
[13]GRIEWANK A, UTKE J & WALTHER A. 2000. Evaluating higher derivative tensors by forward propagation of univariate Taylor series. Mathematics of Computation, 69(231):1117-1130.
-
[14]GRIEWANKA & NAUMANNU. 2002. Accumulating Jacobians by Vertex, Edge or Face Elimination. Proceedings of CARI’02, 375-383.
-
[15]GRIEWANK A & WALTHER A. 2008. Principles and Techniques of Algorithmic Differentiation, Second Edition, SIAM.
-
[16]GRIEWANK A. 2012. Who invented the Reverse Mode of Differentiation? In: MARTIN GRÖTSCHEL, (Ed.). Optimization Stories, Documenta Matematica, Extra Volume ISMP (2012): 389-400.
-
[17]GRIEWANK A. 2013. On Stable Picewise Linearization and Generalized Differentiation. Optimization Methods and Software, 28(6):1139-1178.
-
[18]GRIEWANK A, BERNT J-U, RADONS M & STREUBEL T. 2014. Solving piecewise lineare equations in abs-normal form. Sybmitted to Linear Algebra and Applications.
-
[19]GRIEWANK A & STROGIES N. 2014. Efficient Evaluation of Sparse Jacobians by Matrix Compression, Part I:Motivation and Theory. Proceeding of the 2nd International Conference on Mathematics, Computational and Statistical Sciences, pp. 33-40.
-
[20]KHAN KA & BARTON PI. 2013. Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function. ACM Transactions on Mathematical Software (TOMS), 39(4):23:1-23:28.
-
[21]KORELC J. 1997. Automatic generation of finite-element code by simultaneous optimization of expressions. Theoretical Computer Science, 187:231-248.
-
[22]KLATTE D & KUMMER B. 2002. Nonsmooth equations in optimization, volume 60 of Nonconvex Optimization and its Applications Kluwer Academic Publishers, Dordrecht, 2002. Regularity, calculus, methods and applications.
-
[23]LINNAINMAA S. 1970. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding error, Master’s thesis (in Finnish), Department of Computer Science, University of Helsinki.
-
[24]LOGG A, MARDAL K-A & WELLS GN ET AL. 2012. Automated Solution of Differential Equations by the Finite Element Method. Springer.
-
[25]MORDUKHOVICH BS. 2006. Variational analysis and generalized differentiation. I, volume 330 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences] Springer-Verlag, Berlin, 2006. Basic theory.
-
[26]NAUMANN U. 2008. Optimal Jacobian accumulation is NP-complete. Math. Program., 112:427- 441.
-
[27]NAUMANN U. 2012. The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation, SIAM, Software, Environments, and Tools, Philadelphia.
-
[28]NESTEROV YU. 2005. Lexicographic differentiation of nonsmooth functions. Mathematical programming, 104(2-3):669-700, Springer.
-
[29]OSTROVSKII GM, VOLIN YU M & BORISOV WW. 1971. Über die Berechnung von Ableitungen. Wiss. Z. Tech. Hochschule für Chemie, 13:382-384.
-
[30]PHIPPS E, CASEY R & GUCKENHEIMER J. 2005. Periodic Orbits of Hybrid Systems and Parameter Estimation via AD. In: [5], 211-223.
-
[31]SCHOLTES ST. 2012. Introduction to piecewise differentiable equations. SpringerBriefs in Optimization, Springer Heidelberg.
-
[32]THACKER WC. 1991. Automatic differentiation from an oceanographer’s perspective, in [12], pp. 191-201.
-
[33]WALTHER A, KULSHRESHTHAK & GRIEWANK A. 2012. On the Numerical Stability of Algorithmic Differentiation. Computing, 94:125-149.
-
[34]WANNER G. 1969. Integration gewöhnlicher Differentialgleichnugen, Lie Reihen, Runge-Kutta- Methoden XI, B.I-Hochschulskripten, no. 831/831a, Bibliogr. Inst., Mannheim-Zürich, Germany.
-
[35]WILKINSON J. 1971. Modern error analysis. SIAM Rev., 13:548-568.
-
[36]WOLFE P. 1982. Checking the calculation of gradients. ACM Trans. Math. Softw., 8:337-343.
Publication Dates
-
Publication in this collection
Sep-Dec 2014
History
-
Received
11 Jan 2014 -
Accepted
04 Mar 2014