Recursive Subspace and Quadratic Quasi-Newton Optimization

  1. 2nd-order derivative estimates are very sensitive to the assumption that the function is smooth and deterministic — any re-sampling of the data will invalidate the history of the known gradient rays, and numeric resolution issues can interfere with training.
  2. In regions of high nonlinearity, which is fairly common, the strategy may suggest a step resulting in a fitness decrease. This is why the L-BFGS strategy prescribes the Armijo & Wolfe line search conditions to guarantee convergence — in the event of a “bad” step, just try a smaller step. As long as the vector is in the direction of gradient descent vector, we are guaranteed to find an “ok” step even if it is very small.
  3. The estimation of the inverse hessian is a nontrivial operation. Various strategies balance memory, CPU, and accuracy, but the cost can still be significant. This also adds complexity and additional metaparameters.
  4. At a sufficiently small step size, not only are the extra compute resources going to waste, but we pretty much know that the direction we are choosing is not optimal because it is at an angle with the known steepest descent vector.
  1. L-BFGS with normalized network
  2. L-BFGS with non-normalized network
  3. Modified L-BFGS, with normalized network, using strong line search
  4. Recursive Subspace with non-normalized network, using L-BFGS as the inner optimizer.
  5. Quadratic Quasi-Newton with normalized network
  6. Recursive Subspace with non-normalized network, using QQN as the inner optimizer.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store