Implementing QR solve and nonlinear optimization in D

Author: Dr Chibisi Chima-Okereke Created: November 3, 2020 17:54:14 GMT Published: November 4, 2020 00:54:00 GMT

Introduction

This article presents techniques in linear algebra and nonlinear optimization, it pulls on some Active Analytics articles presented in the past which implement various techniques in D and uses them to construct a linear solve, and Newton and Quasi-Newton optimizers. It begins by first using techniques created in a previous article on QR decomposition $A = QR$ and converts these algorithms into functions for solving a linear system for $x$ where $Ax = b$. We then take this solver and use it while implementing gradient and Newton-based optimization techniques for minimizing a function $f(\mathbf{x})$, where we solve $d_{k} = -G_{k}^{-1}g_{k}$ for $d_{k}$ at each step $k$ where $g_{k} = \nabla{f(\mathbf{x}_{k})}$ and $G_{k} = \nabla^{2}{f(\mathbf{x}_{k})}$. We then implement selected Quasi-Newton methods and then finish with implementing the L-BFGS algorithm commonly used for large scale optimization problems. These optimization techniques will be used to demonstrate minimizing The Powell Singular Function, given by the equations:

$$ f(\mathbf{x}) = (x_{1} + 10x_{2})^{2} + 5(x_{3} - x_{4})^{2} + (x_{2} - 2x_{3})^{4} + 10(x_{1} + x_{4})^{4} $$

$$ x^{(0)} = [3, -1, 0, 1]^{T}; \quad f(\mathbf{x}^{*}) = 0; \quad x^{*} = [0, 0, 0, 0]^{T}; $$

The equations for the gradient vector and hessian will be implemented in the code later on.

This article references some previous articles written on implementing tools that are used here. These are:

Implementing QR decomposition in D.
Implementing mapply in D.
Line search in D.
Implementing bracketing functions in D.
Numerical differentiation in D using finite differences.
Introduction to generic and metaprogramming in D, for those not already familiar with the D programming language.

We shall refer to them periodically in this article. The following books were very useful references when writing the code for this article:

Optimization theory and methods, nonlinear programming, by W. Sun, and Y.-X. Yuan.
Numerical Optimization, Second Edition, by J. Nocedal, and S. J. Wright.

The latter reference is a useful and well known text in numerical optimization literature, however I found the former reference particularly useful for implementing the code here. The book follows an ordered pattern of theory, algorithms, theorem, and proofs. Whether you are a practitioner, or a student, it has useful content throughout and is a very detailed, thorough, yet approachable text. There is no doubt that any future articles on this subject will refer to these two wonderful books.

Please note that the code presented here are for demonstration purposes. It serves to inform and educate only, and does not aim to be production ready, high performance code, though hopefully the process of performance tuning in D will be visited at some point in the future. Enjoy.

Implementing a QR solver from QR decomposition

A previous article implemented the Gram-Schmidt, Modified Gram-Schmidt, and Householder QR decompositions. They will be used here to create matrix solve functions. The long and short of a QR solve for a linear system $Ax = b$, where $A = QR$ is an $m \times n$ matrix is given by:

$$ y = Q^{T}b; \quad Rx = y; $$

then solve:

$$ Rx = Q^{T}b $$

for $x$ using back substitution since $R$ is an upper triangle matrix. In the Gram-Schmidt $QR$, the columns of $Q$ are helpfully generated, and so $Q^{T}b$ is easy to iteratively construct. The case for the Householder QR is given in Algorithm 10.2 of Numerical linear algebra, L. N. Trefethen, D. Bau III, and by Algorithm 5.3.2 of Matrix Computations, 4th edition, by G. H. Golub and C. F. Van Loan. The algorithm is given below:

for $i = 1 \dots n$:
$b_{i:m} = b_{i:m} - 2v_{i}(v_{i}^{T}b_{i:m})$

Then solve $R_{1:n, 1:n} x_{1:n} = b_{1:n}$ using back substitution.

Back Substitution

Back-substitution in this context is process by which we solve for $x$ when we have a square upper triangular matrix of coefficients $R$ of size $n \times n$ that multiplies our unknown vector $x$ of size $n$ given us a known vector $b$ also of size $n$. In this case we simply start at the bottom right of $R_{n, n}$, divide by the corresponding $b_{n}$ element to get $x_{n}$, then use this value to begin to solve for the next element. For a $5 \times 5$ matrix size, the process is given by:

$$ \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} & a_{15}\\
0 & a_{22} & a_{23} & a_{24} & a_{25}\\
0 & 0 & a_{33} & a_{34} & a_{35}\\
0 & 0 & 0 & a_{44} & a_{45}\\
0 & 0 & 0 & 0 & a_{55}\\
\end{bmatrix} \begin{bmatrix} x_{1}\\
x_{2}\\
x_{3}\\
x_{4}\\
x_{5}\\
\end{bmatrix} = \begin{bmatrix} b_{1}\\
b_{2}\\
b_{3}\\
b_{4}\\
b_{5}\\
\end{bmatrix} $$

we first set x := b then

$$ \begin{matrix} x_{5} & = & \frac{x_{5}}{a_{55}}\\
x_4 & = & x_4 - a_{45} x_{5}\\
x_3 & = & x_3 - a_{35} x_{5}\\
x_2 & = & x_2 - a_{25} x_{5}\\
x_1 & = & x_1 - a_{15} x_{5}\\
\end{matrix} $$

then

$$ \begin{matrix} x_{4} & = & \frac{x_{4}}{a_{44}}\\
x_3 & = & x_3 - a_{34} x_{4}\\
x_2 & = & x_2 - a_{24} x_{4}\\
x_1 & = & x_1 - a_{14} x_{4}\\
\end{matrix} $$

then

$$ \begin{matrix} x_{3} & = & \frac{x_{3}}{a_{33}}\\
x_2 & = & x_2 - a_{23} x_{3}\\
x_1 & = & x_1 - a_{13} x_{3}\\
\end{matrix} $$

then

$$ \begin{matrix} x_{2} & = & \frac{x_{2}}{a_{22}}\\
x_1 & = & x_1 - a_{12} x_{2}\\
\end{matrix} $$

and finally $x_{1} = \frac{x_{1}}{a_{11}}$

a very easy algorithm to implement (and parallelize and vectorize if necessary). There is a simple implementation for the upper triangle substitution in D:

enum UPLO{Upper, Lower}
alias Upper = UPLO.Upper;
alias Lower = UPLO.Lower;


T[] backSubstitution(UPLO uplo, T)(Matrix!(T) R, T[] y)
if(uplo == Upper)
{
  size_t n = R.nrow;
  T[] x = y.dup;
  x[$ - 1] /= R[$ - 1, $ - 1];
  for(long i = (n - 1); i >= 1; --i)
  {
    for(long j = i - 1; j >= 0; --j)
    {
      x[j] -= R[j, i]*x[i];
    }
    x[i - 1] /= R[i - 1, i - 1];
  }
  return x;
}

The QR solvers

Below is a simple object for containing the artifacts from the $QR$ solve.

struct QRSolve(T)
{
  Matrix!(T) Q;
  Matrix!(T) R;
  T[] x;
  this(ref Matrix!(T) Q, ref Matrix!(T) R, ref T[] x)
  {
    this.Q = Q;
    this.R = R;
    this.x = x;
  }
}

The code for implementing the Gram-Schmidt QR solver discussed above is given below:

auto qrSolver(QRAlgorithm algo, T)(Matrix!(T) A, T[] b)
if(isFloatingPoint!(T) && (algo == GramSchmidt))
{
  auto Q = Matrix!(T)(A.nrow, A.ncol);
  auto R = zeros!(T)(A.ncol, A.ncol);
  T _norm = norm(A[0]);
  R[0, 0] = _norm;
  Q[0] = map!((a) => a/_norm)(A[0]);
  long n = A.ncol;
  T[] y = new T[A.ncol];
  for(long i = 1; i < n; ++i)
  {
    T[] qi = A[i].dup;
    T[] ai = A[i];
    for(long j = 0; j < i; ++j)
    {
      auto rji = dot(Q[j], ai);
      R[j, i] = rji;
      auto tmp = map!((a) => a*rji)(Q[j]);
      qi[] -= tmp[];
    }
    _norm = norm(qi);
    R[i, i] = _norm;
    Q[i] = map!((a) => a/_norm)(qi);
    if(allZero(Q[i]))
    {
      assert(0, "Error, qi = 0");
    }
    // Accumulate y here
    y[i] = dot(Q[i], b);
  }
  y[0] = dot(Q[0], b);
  // Call to back-substitution here
  y = backSubstitution!(Upper)(R, y);
  return QRSolve!(T)(Q, R, y);
}

the code Modified Gram-Schmidt QR solver is given below:

auto qrSolver(QRAlgorithm algo, T)(Matrix!(T) A, T[] b)
if(isFloatingPoint!(T) && (algo == ModifiedGramSchmidt))
{
  auto Q = Matrix!(T)(A.nrow, A.ncol);
  auto V = A.dup; long n = A.ncol;
  auto R = zeros!(T)(A.ncol, A.ncol);
  T[] y = new T[A.ncol];
  for(long i = 0; i < n; ++i)
  {
    T rii = norm(V[i]);
    R[i, i] = rii;
    Q[i] = map!((a) => a/rii)(V[i]);
    y[i] = dot(Q[i], b);
    for(long j = i + 1; j < n; ++j)
    {
      T rij = dot(Q[i], V[j]);
      R[i, j] = rij;
      V[j] = mapply!((a, b) => a - b*rij)(V[j], Q[i]);
    }
  }
  y = backSubstitution!(Upper)(R, y);
  return QRSolve!(T)(Q, R, y);
}

The code for Householder $QR$ solver as described above is given below:

auto qrSolver(QRAlgorithm algo, T)(Matrix!(T) A, T[] b)
if(isFloatingPoint!(T) && (algo == Householder))
{
  auto _R = A.dup;
  long n = A.ncol;
  Matrix!(T) Q;
  T[] y = b.dup;
  for(long i = 0; i < n; ++i)
  {
    T[] v = _R[i..$, i].array;
    v[0] += norm(v)*sign(v[0]);
    T normV = norm(v);
    v = map!((a) => a/normV)(v);
    T tmp = 2*dot(v, y[i..$]);
    y[i..$] -= map!((a) => a*tmp)(v)[];
    if(i == 0)
    {
      Q = F(v);
    }else{
      updateQ(Q, F(v), i);
    }
    T[] twoV = map!((a) => 2*a)(v);
    _R[i..$, i..$] -= outer(twoV, gemv!(Column)(_R[i..$, i..$], v));
  }
  auto R = zeros!(T)(n, n);
  for(long i = 0; i < n; ++i)
  {
    for(long j = i; j < n; ++j)
    {
      R[i, j] = _R[i, j];
    }
  }
  y = backSubstitution!(Upper)(R, y[0..n]);
  return QRSolve!(T)(Q, R, y);
}

Below is the data for the problem to be solved:

auto A = Matrix!(double)([0.8594598509, 0.8886035203, 0.8149294811, 
                          0.7431045200, 0.8032585254, 0.0587533356, 
                          0.7245921139, 0.5380305406, 0.7342256338, 
                          0.6982547215, 0.7176400044, 0.0539911194, 
                          0.3670289037, 0.9701228316, 0.8404100032, 
                          0.4112932913, 0.3075223914, 0.5798244230, 
                          0.0015286701, 0.7890766996, 0.9781337455, 
                          0.2921431712, 0.0432923459, 0.9428416709, 
                          0.9646959945, 0.0354323143, 0.4898468039, 
                          0.4513681016, 0.2107982126, 0.4445287671, 
                          0.8115565467, 0.7058405790, 0.5527189195, 
                          0.5410537042, 0.9117912347, 0.1149175267, 
                          0.8406228190, 0.6040554044, 0.4260203703, 
                          0.2376075180, 0.2164094832, 0.1800869710, 
                          0.7479251262, 0.0009715103, 0.8810979640, 
                          0.8647838791, 0.5856765260, 0.0127644690, 
                          0.5744975219, 0.1985024847], 10, 5);
auto b = [1.7540145, 1.1653482, 1.1352384, 1.5483096, 
          2.0712038, 0.5012161, 1.5209896, 1.1975549, 
          0.8785516, 1.0212622];
writeln("A: ", A);
writeln("b: ", b);

output:

A: Matrix(10 x 5)
0.8595    0.7176    0.9781    0.8116    0.2164    
0.8886    0.05399   0.2921    0.7058    0.1801    
0.8149    0.367     0.04329   0.5527    0.7479    
0.7431    0.9701    0.9428    0.5411    0.0009715 
0.8033    0.8404    0.9647    0.9118    0.8811    
0.05875   0.4113    0.03543   0.1149    0.8648    
0.7246    0.3075    0.4898    0.8406    0.5857    
0.538     0.5798    0.4514    0.6041    0.01276   
0.7342    0.001529  0.2108    0.426     0.5745    
0.6983    0.7891    0.4445    0.2376    0.1985

b: [1.75401, 1.16535, 1.13524, 1.54831, 2.0712, 0.501216, 1.52099, 1.19755, 0.878552, 1.02126]

Solvers:

writeln("qr solve for GramSchmidt: ", qrSolver!(GramSchmidt)(A, b));
writeln("qr solve for Modified GramSchmidt: ", qrSolver!(ModifiedGramSchmidt)(A, b));
writeln("qr solve for Householder: ", qrSolver!(Householder)(A, b));

truncated output:

qr solve for GramSchmidt: ... [0.344246, 0.427925, 0.356672, 0.979463, 0.213499]

qr solve for Modified GramSchmidt: ... [0.344246, 0.427925, 0.356672, 0.979463, 0.213499]

qr solve for Householder: ... [0.344246, 0.427925, 0.356672, 0.979463, 0.213499]

Now that we can successfully solved for $x$, we can use this to solve for our direction $d_{k}$ in Newton’s Method.

Newton and Quasi-Newton nonlinear optimization methods

The code for The Powell Singular Function introduced above is given below:

The objective function:

T powell(T)(T[] x)
{
  T value = 0;
  value += pow(x[0] + 10*x[1], 2.0);
  value += 5*pow(x[2] - x[3], 2.0);
  value += pow(x[1] - 2*x[2], 4.0);
  value += 10*pow(x[0] - x[3], 4.0);
  return value;
}

the gradient function:

T[] powellGradient(T)(T[] x)
{
  T[] grad = new T[x.length];
  grad[0] = 2*(x[0] + 10*x[1]) + 40*pow(x[0] - x[3], 3.0);
  grad[1] = 20*(x[0] + 10*x[1]) + 4*pow(x[1] - 2*x[2], 3.0);
  grad[2] = 10*(x[2] - x[3]) - 8*pow(x[1] - 2*x[2], 3.0);
  grad[3] = -10*(x[2] - x[3]) - 40*pow(x[0] - x[3], 3.0);
  return grad;
}

the Hessian function:

Matrix!(T) powellHessian(T)(T[] x)
{
  Matrix!(T) hessian = Matrix!(T)(4, 4);
  hessian[0, 0] = 2*(1 + 60*pow(x[0] - x[3], 2.0));
  hessian[0, 1] = 20;
  hessian[1, 0] = hessian[0, 1];
  hessian[0, 2] = 0; 
  hessian[2, 0] = 0;
  hessian[0, 3] = -120*pow(x[0] - x[3], 2.0);
  hessian[3, 0] = hessian[0, 3];
  
  hessian[1, 1] = 4*(50 + 3*pow(x[1] - 2*x[2], 2.0));
  hessian[1, 2] = -24*pow(x[1] - 2*x[2], 2.0);
  hessian[2, 1] = hessian[1, 2];
  hessian[1, 3] = 0;
  hessian[3, 1] = hessian[1, 3];

  hessian[2, 2] = 10 + 48*pow(x[1] - 2*x[2], 2.0);
  hessian[2, 3] = -10;
  hessian[3, 2] = hessian[2, 3];

  hessian[3, 3] = 10*(1 + 12*pow(x[0] - x[3], 2.0));

  return hessian;
}

The convergence criterion used will be $\lVert g_{k}\rVert < \epsilon$ where $\epsilon = T_{min}^{0.5}$ for floating point type $T$. For double floating point type, it is much more strict than commonly used values such as $\epsilon = 10^{-5}$.

Barzilai and Borwein Steepest Descent

The update function for the parameter $x$ gradient descent toward the minimum of the function $f(x)$ is given by $x_{k + 1} = x_{k} + \alpha_{k}d_{k}$. Barzilai and Borwein descent uses the direction given by $d_{k} = -g_{k}$ and the step length $\alpha$ at iteration $k$ is given by $\alpha_k = \frac{s_{k - 1}^{T} s_{k - 1}}{s_{k - 1}^{T} y_{k - 1}}$, where $s_{k - 1} = x_{k} - x_{k - 1}$ and $y_{k - 1} = g_{k} - g_{k - 1}$. We use standard line search techniques discussed in a previous article to calculate the first $\alpha_{1}$ and obtain the increments required which we then use to obtain $\alpha_{k}$ thereafter.

The code that does this is given below, firstly a simple gradient descent object to hold the output:

struct GradientDescent(T)
{
  long niter;
  T obj;
  T[] params;
  T[] grad;
}

then the implementation of the gradient descent function

auto gradientDescent(alias fun, alias funGrad, T, 
          T eps = T.epsilon^^0.5)(T[] x)
if(isFloatingPoint!(T))
{
  T alpha;
  T[] d = new T[x.length];
  T[] si = new T[x.length];
  T[] yi = new T[x.length];
  T[] grad = funGrad(x);
  long i = 0;
  while(norm(grad) >= eps)
  {
    d[] = -grad[];
    
    // line search or calculate alpha
    if(i != 0)
    {
      alpha = dot(si, si)/dot(si, yi);
    }else{
      alpha = lineSearch!(Cubic, StrongWolfe, fun, funGrad)(1.0, x, d);
    }
    
    // compute new x and new grad
    T[] x_new = mapply!((a, b) => a + alpha*b)(x, d);
    T[] grad_new = funGrad(x_new);
    si[] = x_new[] - x[];
    yi[] = grad_new[] - grad[];
    x = x_new;
    grad = grad_new;
    ++i;
  }
  return GradientDescent!(T)(i, fun(x), x, grad);
}

where fun and funGrad are the objective and gradient functions. The code and output for this method is given below:

auto x0 = [3, -1, 0, 1.0];
writeln("Gradient descent (Barzilai and Borwein): ",
      gradientDescent!(powell, powellGradient)(x0));

output:

Gradient descent (Barzilai and Borwein): 
GradientDescent!double(213, 1.68817e-12, 
[0.000844147, -8.44146e-05, 0.000517916, 
0.000517917], [2.34175e-09, 3.90636e-09, 
4.92903e-09, 4.92904e-09])

As you can see, it took 213 iterations to get below the threshold criterion, while this is a lot, it is actually many times better than vanilla gradient descent.

Newton descent method

The Newton Method is a special case of descent, and the descent direction is given by $d_{k} = -G_{k}^{-1}g_{k}$ and as discussed above, we solve using $QR$ decomposition. In this particular case we also use a line search. The function parameters fun, funGrad, and funHessian will be used to pass the objective function, its gradient, and its Hessian respectively. We will also include a case where finite difference is used to calculate the gradient and the Hessian. The details of how to do this is discussed in a previous article.

The object to hold the output for Newton Methods is given below:

struct NewtonDescent(T)
{
  long niter;
  T obj;
  T[] params;
  T[] grad;
  Matrix!(T) hessian;
}

then the function for the Newton Method:

auto newton(alias fun, alias funGrad, alias funHessian,
          T, T eps = T.epsilon^^0.5)(T[] x)
if(isFloatingPoint!(T))
{
  T alpha;
  T[] d = new T[x.length];
  T[] grad = funGrad(x);
  Matrix!(T) hessian;
  long i = 0;
  while(norm(grad) >= eps)
  {
    hessian = funHessian(x);
    d[] = -qrSolver!(Householder)(hessian, grad).x[];
    // line search playing with the value of c2 
    // matters alot - forces better alpha choices compare 0.9 with 0,2
    alpha = lineSearch!(Cubic, StrongWolfe, 
                    fun, funGrad, 1E-4, 0.2)(1.0, x, d);
    // compute new x and new grad
    T[] x_new = mapply!((a, b) => a + alpha*b)(x, d);
    T[] grad_new = funGrad(x_new);
    x = x_new;
    grad = grad_new;
    ++i;
  }
  return NewtonDescent!(T)(i, fun(x), x, grad, hessian);
}

The line search parameters matter a lot for all the Newton and Quasi-Newton methods. A balance between getting good reductions with each iteration without forcing too many evaluations or causing the optimization process to fail altogether.

The code is executed with:

writeln("Newton: ", newton!(powell, powellGradient, 
            powellHessian)(x0));
  
writeln("Finite Difference Newton: ", newton!(
            powell, 
            (a) => ND!(powell, 1)(a), 
            (a) => ND!(powell, 2)(a))(x0));

output:

Newton: NewtonDescent!double(8, 1.39016e-12, 
[0.000725789, -7.25789e-05, 0.000116126, 0.000116126], 
[9.0642e-09, -1.13339e-10, 2.26753e-10, -9.06436e-09], 
Matrix(4 x 4)
2          20         0          -0.0004014 
20         200        -2.007e-05 0          
0          -2.007e-05 10         -10        
-0.0004014 0          -10        10         
)

Finite Difference Newton: NewtonDescent!double(10, 1.18041e-14, 
[6.71259e-06, -6.71258e-07, -0.000143474, -0.000143473], 
[1.42546e-10, 1.64274e-10, -2.74336e-10, -4.88582e-11], 
Matrix(4 x 4)
2          20         0          -6.052e-06 
20         200        -5.34e-06  0          
0          -5.34e-06  10         -10        
-6.052e-06 0          -10        10         
)

As expected these methods have much better convergence stats, in fact they will have the best convergence statistics of all the methods described here.

Iterative Hessian construction using Quasi-Newton Methods

Explicit calculation of Hessian matrices can be expensive, and sometimes, there is no analytical form of the Hessian that can be easily implemented. Some Quasi-Newton methods provide alternative construction of estimates of the Hessian $B_{k}$, and estimates of its inverse $H_{k}$. These involve iterative updating methods:

$$ B_{k + 1} = F_{B}(B_{k}, s_{k}, y_{k}) $$

$$ H_{k + 1} = F_{H}(H_{k}, s_{k}, y_{k}) $$

where $F_{B}$, and $F_{H}$ are updating, functions for $B_{k}$ and $H_{k}$ respectively. The theory for how these are constructed are covered in the referenced text, and we will be showing implementation details for DFP and BFGS cases.

A simple framework for Quasi-Newton solvers

The technique used here is to leverage a little of the D languages' extensive metaprogramming capability to implement a function that can be statically dispatched for any Quasi-Newton solver (if the preliminary items are implemented).

Choice of whether we want to use $B_{k}$ or $H_{k}$:

enum HessianType{Hessian, inverseHessian}
alias Hessian = HessianType.Hessian;
alias inverseHessian = HessianType.inverseHessian;

this choice of Hessian or inverseHessian will cause the respective matrix to be returned in the NewtonDescent object.

These are the methods we shall use:

enum QuasiNewton{DFP,/* could contain more method names */BFGS}
alias DFP = QuasiNewton.DFP;
alias BFGS = QuasiNewton.BFGS;

Initialization for $H_{0}$ or $B_{0}$ as described in the references:

Matrix!(T) initHessian(HessianType hType, T)(T[] s, T[] y)
{
  static if(hType == inverseHessian)
  {
    T m = dot(s, y)/dot(y, y);
    return m*eye!(T)(s.length);
  }else{
    T m = dot(y, y)/dot(s, y);
    return m*eye!(T)(s.length);
  }
}

If we use $B_{k}$ we need to obtain the direction $d_{k}$ by solving ($d_{k} = -B_{k}^{-1}g_{k}$) and if we use $H_{k}$ we need to obtain the direction by matrix multiplication ($d_{k} = -H_{k}g_{k}$):

T[] direction(HessianType hType, T)(Matrix!(T) hessian, T[] grad)
{
  static if(hType == inverseHessian)
  {
    T[] d = new T[grad.length];
    d[] = -gemv!(Row)(hessian, grad)[];
  }else{
    T[] d = new T[grad.length];
    d[] = -qrSolver!(Householder)(hessian, grad).x[];
  }
  return d;
}

These are the functions to update $B_{k}$ and $H_{k}$ for DFP:

$$ B_{k + 1} = (I - \rho_{k} y_{k} s_{k}^{T}) B_{k} (I - \rho_{k} s_{k} y_{k}^{T}) + \rho_{k} y_{k} y_{k}^{T} $$

where $\rho_{k} = \frac{1}{y_{k}^T s_{k}}$, and

$$ H_{k + 1} = H_{k} + \frac{s_{k}s_{k}^{T}}{s_{k}^{T}y_{k}} - \frac{H_{k} y_{k}y_{k}^{T} H_{k}}{y_{k}^{T} H_{k} y_{k}} $$

Matrix!(T) qHessian(HessianType htype, QuasiNewton quasi, T)(
                  Matrix!(T) B, T[] s, T[] y)
if((htype == Hessian) && (quasi == DFP))
{
  T rho = 1/dot(y, s);
  auto I = eye!(T)(y.length);
  auto term2 = I - rho*outer(s, y);
  return gemm(gemm!(CblasTrans)(term2, B), term2) + rho*outer(y, y);
}
Matrix!(T) qHessian(HessianType htype, QuasiNewton quasi, T)(
                  Matrix!(T) H, T[] s, T[] y)
if((htype == inverseHessian) && (quasi == DFP))
{
  auto term2 = outer(s, s)/dot(s, y);
  auto tmp = gemv!(Row)(H, y);
  auto term3 = outer(tmp, tmp)/dot(y, gemv!(Row)(H, y));
  return H + term2 - term3;
}

then the same for BFGS:

$$ B_{k + 1} = B_{k} + \frac{y_{k} y_{k}^{T}}{y_{k}^{T} s_{k}} - \frac{B_{k} s_{k} s_{k}^{T} B_{k}}{s_{k}^{T} B_{k} s_{k}} $$

$$ H_{k + 1} = (I - \rho_{k} s_{k} y_{k}^{T}) H_{k} (I - \rho_{k} y_{k} s_{k}^{T}) + \rho_{k} s_{k} y_{k}^{T} $$

Matrix!(T) qHessian(HessianType htype, QuasiNewton quasi, T)(
                  Matrix!(T) B, T[] s, T[] y)
if((htype == Hessian) && (quasi == BFGS))
{
  auto term2 = outer(y, y)/dot(y, s);
  auto tmp = gemv!(Row)(B, s);
  auto term3 = outer(tmp, tmp)/dot(s, gemv!(Row)(B, s));
  return B + term2 - term3;
}
Matrix!(T) qHessian(HessianType htype, QuasiNewton quasi, T)(
                  Matrix!(T) H, T[] s, T[] y)
if((htype == inverseHessian) && (quasi == BFGS))
{
  T rho = 1/dot(y, s); auto I = eye!(T)(y.length);
  auto term2 = (I -  rho*outer(y, s));
  return gemm(gemm!(CblasTrans)(term2, H), term2) + rho*outer(s, s);
}

Now we present the implementation of the Quasi-Newton function:

auto qnewton(alias fun, alias funGrad, HessianType hType,
          QuasiNewton quasi,
          T, T eps = T.epsilon^^0.5)(T[] x)
if(isFloatingPoint!(T))
{
  T alpha; size_t n = x.length;
  T[] d = new T[n];
  T[] si = new T[n];
  T[] yi = new T[n];
  T[] grad = funGrad(x);
  auto hessian = eye!(T)(n);
  long i = 0;
  T gradNorm = norm(grad);
  while(gradNorm >= eps)
  {
    if(i > 1)
    {
      hessian = qHessian!(hType, quasi)(hessian, si, yi);
      d = direction!(hType)(hessian, grad);
    }else if(i == 1){
      hessian = initHessian!(hType)(si, yi);
      d = direction!(hType)(hessian, grad);
    }else{
      d[] = -grad[];
    }
    
    if(i != 0)
    {
      // c2 in [0.5, 0.7] depending on update method
      alpha = lineSearch!(Cubic, StrongWolfe, 
                  fun, funGrad, 1E-4, 0.7)(1.0, x, d);
    }else{
      alpha = lineSearch!(Cubic, StrongWolfe, 
                  fun, funGrad, 1E-4, 0.9)(1.0, x, d);
    }
    
    // compute new x and new grad
    T[] x_new = mapply!((a, b) => a + alpha*b)(x, d);
    T[] grad_new = funGrad(x_new);
    si[] = x_new[] - x[];
    yi[] = grad_new[] - grad[];
    x = x_new;
    grad = grad_new;
    gradNorm = norm(grad);
    ++i;
  }
  
  return NewtonDescent!(T)(i, fun(x), x, grad, hessian);
}

The code that runs these methods for our given problem for DFP is:

writeln("Quasi Newton (DFP): ", qnewton!(powell, 
              powellGradient, Hessian, DFP)(x0));
writeln("Quasi Newton (DFP inv): ", qnewton!(powell, 
              powellGradient, inverseHessian, DFP)(x0));

output:

Quasi Newton (DFP): NewtonDescent!double(155, 2.62291e-16, 
[-6.17912e-05, 6.17916e-06, 9.75811e-06, 9.75802e-06], 
[7.681e-10, 7.8275e-09, 8.38457e-10, -8.23786e-10], 
Matrix(4 x 4)
12.52      126.2      11.11      -11.11     
126.2      1273       112.2      -112.2     
11.11      112.2      22.67      -22.67     
-11.11     -112.2     -22.67     22.68      
)
Quasi Newton (DFP inv): NewtonDescent!double(155, 2.62203e-16, 
[-6.17642e-05, 6.17646e-06, 9.77592e-06, 9.77583e-06], 
[8.59114e-10, 8.73758e-09, 9.35866e-10, -9.21201e-10], 
Matrix(4 x 4)
7236       -718.6     2351       2340       
-718.6     71.36      -233       -231.9     
2351       -233       3.085e+04  3.085e+04  
2340       -231.9     3.085e+04  3.084e+04  
)
L-BFGS: LBFGS!double(43, 5.07079e-14, 
[0.000295664, -2.95663e-05, 2.88896e-05, 2.88897e-05], 
[1.93657e-09, 1.17687e-08, -1.49693e-09, 7.42831e-10])

For BFGS:

writeln("Quasi Newton  (BFGS): ", qnewton!(powell, 
              powellGradient, Hessian, BFGS)(x0));
writeln("Quasi Newton (BFGS inv): ", qnewton!(powell, 
              powellGradient, inverseHessian, BFGS)(x0));

output:

Quasi Newton  (BFGS): NewtonDescent!double(66, 2.79176e-16, 
[0.000105412, -1.05412e-05, 5.53775e-05, 5.53775e-05], 
[-3.7871e-10, -3.84434e-09, -1.55817e-10, 1.65083e-10], 
Matrix(4 x 4)
3.26       32.61      0.2004     -0.2009    
32.61      326.2      2.008      -2.013     
0.2004     2.008      13.96      -13.96     
-0.2009    -2.013     -13.96     13.96      
)

Quasi Newton (BFGS inv): NewtonDescent!double(66, 2.79176e-16, 
[0.000105412, -1.05412e-05, 5.53775e-05, 5.53775e-05], 
[-3.7871e-10, -3.84434e-09, -1.55816e-10, 1.65083e-10], 
Matrix(4 x 4)
5.976e+04  -5973      6.275e+04  6.275e+04  
-5973      597.1      -6265      -6265      
6.275e+04  -6265      5.257e+05  5.257e+05  
6.275e+04  -6265      5.257e+05  5.257e+05 
)

The L-BFGS method

The L-BFGS method or Limited Memory BFGS allows nonlinear problems to be solved whose Hessian matrix is so large that it does not fit into, or can not be processed in memory, so the usual Newton or Quasi-Newton methods can not be used. New descent direction $d_{k}$ are calculated from $m$ previous $s_{k}$, and $y_{k}$ vectors which are stores in matrices $S$ and $Y$.

The direction $d_{k}$ is calculated from the two loop recursion given in Algorithm 7.4 in Nocedal and Wright:

$q \leftarrow \nabla{f(x)}$;
$\textbf{for}$ $i = k - 1, k-2, \ldots, k-m$
    $\alpha_{i}\leftarrow \rho_{i}s_{i}^{T}q$;
    $q\leftarrow q - \alpha_{i}y_{i}$;
$\textbf{end(for)}$
$r \leftarrow H_k^{0} q$
$\textbf{for}$ $i = k - m, k - m + 1, \ldots, k-1$
    $\beta \leftarrow \rho_{i} y_{i}^{T} r$;
    $r \leftarrow r + s_{i}(\alpha_{i} - \beta) $;
$\textbf{end(for)}$
$\textbf{stop}$ with results $H_{k}\nabla{f_{k}} = r$.

The direction is given by $d_{k} = -r$. The rest of the algorithm is as before and the matrices $S$, and $Y$ are updated at each iteration as well as the vector $\rho$. L-BFGS can be initialized in a number of ways, for instance by using the previous Barzilai and Borwein calculation for $\alpha_{k}$ after the first iteration, however in the implementation below, we fill $S$, $Y$, and $\rho$ using line search assisted gradient descent. Once m iterations are done, we revert to the algorithm above. The latest calculations of $s_{k}, y_{k}$, and $\rho$ are prepended to the new section of the respective data structures and the old values discarded from the end column or array so as to keep only $m$ items respectively. This is actually the opposite indexing arrangement as in the reference text which arranges $k - m, k - m - 1, \ldots, k - 1 $ but it does not matter in this case.

The data structure for the L-BFGS algorithm is given below:

struct LBFGS(T)
{
  long niter;
  T obj;
  T[] params;
  T[] grad;
}

and the function for the algorithm is given below:

auto lbfgs(alias fun, alias funGrad,
          T, T eps = T.epsilon^^0.5)(T[] x, size_t m = 5)
if(isFloatingPoint!(T))
{
  T alpha; size_t n = x.length;
  T[] d = new T[n];
  T[] si = new T[n];
  T[] yi = new T[n];
  
  /* Arranged k - 1, ..., k - m */
  auto S = Matrix!(T)(n, m);
  auto Y = Matrix!(T)(n, m);

  T[] grad = funGrad(x);
  long i = 0;
  T gradNorm = norm(grad);
  T[] rho = new T[m];
  T[] alpha_j = new T[m];
  while(gradNorm >= eps)
  {
    if(i >= m)
    {
      //double recursion loops happen here
      T[] q = grad.dup;
      for(long j = 0; j < m; ++j)
      {
        T tmp = rho[j]*dot(S[j], q);
        q = mapply!((a, b) => a - tmp*b)(q, Y[j]);
        alpha_j[j] = tmp;
      }

      T H0 = dot(si, yi)/dot(yi, yi);
      d = map!((a) => a*H0)(q);
      for(long j = (m - 1); j >= 0; --j)
      {
        T tmp = alpha_j[j] - rho[j]*dot(Y[j], d);
        d = mapply!((a, b) => a + tmp*b)(d, S[j]);
      }
      d[] = -d[];
    }else{
      d[] = -grad[];
    }
    
    if(i != 0)
    {
      alpha = lineSearch!(Cubic, StrongWolfe, 
                  fun, funGrad, 1E-4, 0.7)(1.0, x, d);
    }else{
      alpha = lineSearch!(Cubic, StrongWolfe, 
                  fun, funGrad, 1E-4, 0.9)(1.0, x, d);
    }
    
    T[] x_new = mapply!((a, b) => a + alpha*b)(x, d);
    T[] grad_new = funGrad(x_new);
    si[] = x_new[] - x[];
    yi[] = grad_new[] - grad[];
    
    // Update S and Y matrices and rho
    if(i < m)
    {
      S[m - i - 1] = si;
      Y[m - i - 1] = yi;
      rho[m - i - 1] = 1/dot(yi, si);
    }else{
      // remove the last column
      S.refColumnRemove(m - 1);
      Y.refColumnRemove(m - 1);
      rho[1..$] = rho[0..($ - 1)].dup;
      // prepend the new column
      S.prependColumn(si);
      Y.prependColumn(yi);
      rho[0] = 1/dot(yi, si);
    }

    x = x_new;
    grad = grad_new;
    gradNorm = norm(grad);
    ++i;
  }
  
  return LBFGS!(T)(i, fun(x), x, grad);
}

the L-BFGS is executed (with m = 10) as:

writeln("L-BFGS: ", lbfgs!(powell, powellGradient)(
            x0, 10));

with outputs:

L-BFGS: LBFGS!double(43, 5.07079e-14, 
[0.000295664, -2.95663e-05, 2.88896e-05, 2.88897e-05], 
[1.93657e-09, 1.17687e-08, -1.49693e-09, 7.42831e-10])

This is a better result than the Quasi-Newton methods presented earlier, though certainly not as good as the earlier Newton Methods.

Summary

In this article, the implementation of a linear algebra algorithm was presented, and we even scratched the surface of implementing nonlinear optimization algorithms. These algorithms form the basis of many analytical techniques used today in fields such as science, engineering, energy, finance, and many more. This article is built upon implementations of various algorithms presented in earlier Active Analytics articles, and has presented useful and informative techniques.

Implementing code on these topics is interesting and useful, and hopefully more articles on numerical analysis topics will be visited at a later stage.

Thank you for reading.