Performance and Optimizations

We can use the MNIST tutorial as an example to compare automatic differentiation with "manual" differentiation:

benchmarks

In the above, we compare:

  1. "Manual" differentiation of a 784 x 300 x 100 x 10 fully-connected feed-forward ANN.
  2. Automatic differentiation using backprop and the lens-based accessor interface
  3. Automatic differentiation using backprop and the "higher-kinded data"-based pattern matching interface
  4. A hybrid approach that manually provides gradients for individual layers but uses automatic differentiation for chaining the layers together. See the section "Dealing with Overhead from Redundant Updates" for details.

Sources of Overhead

One immediate result is that simply running the network and functions (using evalBP) incurs virtually zero overhead. This means that library authors could actually export only backprop-lifted functions, and users would be able to use them without losing any performance.

As for computing gradients, there exists some associated overhead. There are three main sources:

  1. The construction and traversal of the Wengert tape used to implement automatic differentiation. However, this overhead is typically negligible for backpropagating any numerical computations of non-trivial complexity.

  2. Redundant updates of entire data types during gradient accumulation. This will be, by far, the dominating source of any overhead compared to manual differentiation for any numerical computation of non-trivial complexity.

  3. Inefficiencies associated with "naive" differentiation, compared to manual symbolic differentiation. However, this inefficiency is typically negligible except in edge cases.

In addition, usage of the "Higher-Kinded Data"-based pattern matching interface (over the lens-based accessor interface) incurs some penalties from the inefficient nature of GHC Generics in general.

Optimization Techniques

Dealing with Overhead from Redundant Updates

By far the dominating source of overhead when using backprop is the redundant update of data type fields when accumulating gradients.

Example

That is, if we had a data type like:

data MyType = MT { _mtX :: Double
                 , _mtY :: Double
                 , _mtZ :: Double
                 }
    deriving (Show, Generic)

makeLenses ''MyType

instance Backprop MyType

and we use all three fields somehow:

myFunc :: Reifies s W => BVar s MyType -> BVar s Double
myFunc mt = (mt ^^. mtX) * (mt ^^. mtY) + (mt ^^. mtZ)

and we compute its gradient:

gradBP myFunc (MT 5 7 2) => 
MT {_mtX = 7.0, _mtY = 5.0, _mtZ = 1.0}

The library will first compute the derivative of the first field, and embed it into MyType:

MT { _mtX = 7.0, _mtY = 0.0, _mtZ = 0.0 }

Then it'll compute the derivative of the second field and embed it:

MT { _mtX = 0.0, _mtY = 5.0, _mtZ = 0.0 }

And finally compute the derivative of the third field and embed it:

MT { _mtX = 0.0, _mtY = 0.0, _mtZ = 1.0 }

And it'll compute the final derivative by add-ing all three of those together.

This is not too bad with Doubles, but when you have huge matrices, there will be six redundant addition of zeroes for a data type with three fields...and those additions of zero matrices can incur a huge cost.

In general, for a data type with \(n\) fields where you use \(m\) of those fields, you will have something on the order of \(\mathcal{O}(n m)\) redundant additions by zero.

Mitigating

One way to mitigate these redundant updates is to prefer data types with less fields if possible, or re-factor your data types into multiple "levels" of nesting, to reduce the amount of redundant additions by zero. That is, instead of having a giant ten-field data type, have two five-field data types, and one type having a value of each type. This also works well with recursive "linked list" data types, as well, as long as you write functions on your linked lists inductively.

You can also be careful in how many times you use ^^. (viewVar), because each usage site incurs another addition-by-zero in the gradient accumulation. If possible, refactor all of your ^^. into a single binding, and share it within your expression, instead of using it again several times for the same field in the same expression.

You can also use clever lenses too "simulate" having a data type with less fields than you actually have. For example, you can have a lens on the first two fields:

mtXY :: Lens' MyType (Double, Double)
mtXY f (MT x y z) = (\(x', y') -> MT x' y' z) <$> f (x, y)

This treats accessing both fields as effectively a single access to a single tuple field, and so cuts out an extra addition by zero.

As a last resort, you can completely eliminate redundant additions by zero by providing manual gradients to functions using your data type.

myFunc' :: Reifies s W => BVar s MyType -> BVar s Double
myFunc' = liftOp1 . op1 $ \(MT x y z) ->
    ( (x * y) + z
    , \d -> MT (d * y) (x * d) d
    )
gradBP myFunc' (MT 5 7 2) => 
MT {_mtX = 7.0, _mtY = 5.0, _mtZ = 1.0}

See the writing manual gradients page for more information on exactly how to specify your operations with manual gradients.

Once you do this, you can use myFunc' as a part of any larger computation; backpropagation will still work the same, and you avoid any redundant additions of zero:

gradBP (negate . sqrt . myFunc) (MT 5 7 2) => 
MT {_mtX = -0.5753964555687505, _mtY = -0.41099746826339323, _mtZ = -8.219949365267865e-2}
gradBP (negate . sqrt . myFunc') (MT 5 7 2) => 
MT {_mtX = -0.5753964555687505, _mtY = -0.41099746826339323, _mtZ = -8.219949365267865e-2}

When you use myFunc' in a function, it will be efficiently backpropagated by the backprop library.

This is useful for situations like optimizing artificial neural networks that are a composition of multiple "layers": you can manually specify the derivative of each layer, but let the backprop library take care of finding the derivative of their composition. This is exactly the "hybrid" mode mentioned in the benchmarks above. As can be seen by benchmark results, this brings the manual and automatic backprop results to almost within range of random variance of each other.

However, I don't recommend doing this, unless as a last resort for optimization. This is because:

  1. The whole point of the backprop library is to allow you to never have to specify manual gradients
  2. It is very very easy to make a mistake in your gradient computation and introduce subtle bugs
  3. It is difficult to modify your function if you want to tweak what it returns. Compare changing the multiplication to division in the original myFunc vs. the manual myFunc'
  4. It makes it harder to read and understand (and subsequently refactor) your code.

However, this option is available as a low-level performance hack.

Dealing with Overhead from Naive Differentiation

Automatic differentiation is a mechanical process that is nothing more than glorified book-keeping and accumulation. It essentially "hitches a ride" on your normal computation in order to automatically accumulate its gradient. It isn't aware of the analytical nature of computations, and cannot do any symbolic or analytical simplifications like re-associating additions or canceling out factors that humans might perform if manually differentiating.

In most cases, this is "good enough" and will not be any significant source of inefficiency in the larger picture. At least, it won't be worth the cognitive overhead in squeezing out a one or two percent increase in performance. However, there are some edge cases where this might become a concern worth looking at.

A common example is the composition of the softmax activation function and the cross-entropy error function often used in deep learning. Together, their derivatives are somewhat complex, computationally. However, the derivative of their composition, crossEntropy x . softMax actually has an extremely "simple" form, because of how some factors cancel out. To get around this, libraries like tensorflow offer an optimized version of the composition with manually computed gradients.

-- import Numeric.LinearAlgebra.Static.Backprop

softMax
    :: (KnownNat n, Reifies s W)
    => BVar s (R n)
    -> BVar s (R n)
softMax x = konst (1 / totx) * expx
  where
    expx = exp x
    totx = sumElements expx

crossEntropy
    :: (KnownNat n, Reifies s W)
    => R n
    -> BVar s (R n)
    -> BVar s Double
crossEntropy x y = -(log y `dot` auto x)

(Note the usage of auto :: a -> BVar s a to lift a normal value into a BVar)

Now, you can use crossEntropy x . softMax as a BVar s (R n) -> BVar s Double function, and the result and gradient would be correct. It would backpropagate the gradient of crossEntropy into softMax. However, you can take advantage of the fact that some factors in the result "cancel out", and you can drastically simplify the computation.

Their normal composition would naively be:

softMaxCrossEntropy
    :: (KnownNat n, Reifies s W)
    => R n
    -> BVar s (R n)
    -> BVar s Double
softMaxCrossEntropy x y = -(log softMaxY `dot` auto x)
  where
    expy     = exp y
    toty     = sumElements expy
    softMaxY = konst (1 / toty) * expy

Which you can probably guess has a decently complex gradient, just from all of the chained operations we have going on.

However, if you work things out on pencil and paper, you'll find a nice form for the gradient of the cross entropy composed with softmax, \(f(x,y)\):

\[ \nabla_y f(\mathbf{x}, \mathbf{y}) = \mathrm{softmax}(\mathbf{y}) - \mathbf{x} \]

Basically, the gradient is just the result of softMax vector-subtracted from the target.

After computing the gradient by hand, we can write softMaxCrossEntropy with our manual gradient:

-- using the non-lifted interfaces
-- import qualified Numeric.LinearAlgebra        as HU
-- import qualified Numeric.LinearAlgebra.Statuc as H

softMaxCrossEntropy'
    :: (KnownNat n, Reifies s W)
    => R n
    -> BVar s (R n)
    -> BVar s Double
softMaxCrossEntropy' x = liftOp1 . op1 $ \y ->
    let expy     = exp y
        toty     = HU.sumElements (H.extract expy)
        softMaxY = H.konst (1 / toty) * expy
        smce     = -(log softMaxY `H.dot` x)
    in  ( smce
        , \d -> H.konst d * (softMaxY - x)
        )

Our gradient is now just softMaxY - x, which I can assure you is much, much simpler than the automatic differentiation-derived gradient. This is because a lot of factors show up on the top and bottom of functions and cancel out, and a lot of positive and negative additions also end up canceling out.

Again, refer to the writing manual gradients page for more information on exactly how to specify your operations with manual gradients.

Once you do this, softMaxCrossEntropy' is now a function you can use normally and compose with other backpropagatable functions. You won't be able to functionally tell apart crossEntropy x . softMax from softMaxCrossEntropy', and the two will behave identically, propagating gradients with other BVar functions:

gradBP ((**2) . crossEntropy (H.vec3 1 0 0) . softMax) (H.vec3 0.9 0.2 0.3) => 
[-0.7314742091958236,0.3474654731904,0.38400873600542373]
gradBP ((**2) . softMaxCrossEntropy (H.vec3 1 0 0)) (H.vec3 0.9 0.2 0.3) => 
[-0.7314742091958236,0.3474654731904,0.38400873600542373]
gradBP ((**2) . softMaxCrossEntropy' (H.vec3 1 0 0)) (H.vec3 0.9 0.2 0.3) => 
[-0.7314742091958235,0.34746547319040005,0.38400873600542373]

softMaxCrossEntropy' will be more efficient in computing gradients.

Again, I don't recommend doing this in most cases, and this should always be a last resort. To me, this is even less warranted than the situation above (mentioning redundant additions) because any losses due to naive AD should be negligible. Only doing this after profiling and benchmarking, when you are sure that a particular function composition is causing your bottleneck. Don't do this for any ol' composition you write, because:

  1. Again, the whole point of this library is to allow you to avoid computing gradients by hand.
  2. Computing gradients by hand is very tricky and there are many places where you could introduce a bug in a subtle way that might not be apparent even through initial testings.
  3. This is very fragile, and any future changes to your function will require you to completely re-compute and re-write your giant lifted function.
  4. It is again much harder to read and understand your code.

But, if you profile and benchmark and conclude that a bad composition is bottleneck, know that this path is available.