Algorithmic Differentiation of QR decompositions

Published in AlgoDiff 2024, 2024

Slides from the talk are linked here.

These contain some additional context that is not present in the preprint.

However, some items (flop count derivations) are not presented (for reasons of time).

Flop counts and other updates are in the version to be published by Springer.

What is a QR decomposition and why should you care?

A QR decomposition is a fundamental decomposition in matrix analysis. The decomposition appears as a subroutine in eigenvalue and eigenvector calculations, solving least squares problems, and many applications with Krylov subspaces.

If you’ve done any matrix algebra calculations in an application, you’ve probably used a QR decomposition somewhere-perhaps without knowing.

The first portion of the slides gives a brief background on the QR and how the decomposition is often implemented or derived. The emphasis in the slides is on the intuition of the QR calculations and not on being formally rigorous.

The Results

We demonstrate a new equation for propagating a gradient backwards through a DAG which has a QR decomposition. We derive the mathematical equivalence between our formula and an existing formula. We argue that our formula is better and should therefore be preferred.

The new equation is 3.8 and the existing equation is 3.3. These are given on slides 15 and 14 respectively. We advocate to prefer our equation 3.8 over existing 3.3, although they are mathematically equivalent-which we also prove-our equation 3.8 is more efficient to calculate.

The Recommendations

Use our formula for the Matrix gradient backpropagation. It is more efficient in practice and in theory. We show simulation studies with timings and derive flop counts for comparisons via asymptotic considerations.

Critiques

Some critiques from the AlgoDiff review and conference:

  1. We did not include flop counts in our comparisons of the two formula
  2. We needed example matrices for an empirical study
  3. Applications were not fleshed out in enough detail

These three critiques are valid-and helpful-we addressed each one in the revised version which is submitted for publication. A section with flop count derivations was added. The finding is that our formula has lower flop counts, therefore is faster in general for larger scale matrices. The flop count derivations are not in the slides but the empirical results are on slides 21-23 from the talk.

To address the second comment a subsection with a table was added and I compared against several matrices from Matrix Market matrix database. This is a standard collection of matrices to use as test problems and easily facilitates comparisons. If you’re looking for these results they are in slide 25 from the talk.

To address the third comment, potential applications were added. The paper really is focused on the theory and demonstrating and argument our formula is better rather than focusing on applications. However, some applications were included in the revised version. For example, the occurrence of wide matrices was a point that was raised, some discussants suggested that these matrices are rare. In the revised version I have some discussion of when they occur. For example any physics application where the Buckingham Pi theorem is used will have a wide matrix. This theorem applies in many applications and out results enable you to propogate a matrix derivative through a compute DAG that leverages the Buckingham Pi theorem. Common physics applications that leverage the Buckingham Pi theorem are in computational fluid dynamics, amongst many others.

Also, you see wide matrices in a lot of deep learning workflows where you want to multiply a tall matrix with another to get a less tall-or square-matrix. Leveraging our results will make these nodes in the DAG much more efficient.

Conclusion

If you are writing a gradient propagation software that supports matrices then when you develop support for QR decompositions I’d recommend you read our paper and leverage our findings.

If you are interested in doing further research in this area, feel free to drop me an email. I have some ideas and am always happy to speak to potential research partners both for theoretical research (new formula and algorithms) as well as applications (using QR and backprop in your application).

If you are using PyTorch or TensorFlow, you already have access to these algorithms. We merged them as part of this work effort.

This work has been used in several subsequent papers, including a nice paper in NeurIPS 2024. This paper looks at scaling gradients of functions of Matrices, in these setups you often use eigevalue and eigenvector calculations as subroutines, and those often use QR.

Updates

Once the submitted paper is published by springer I’ll add a link here.

Recommended citation: Roberts D, Roberts L. QR and LQ Decomposition Matrix Backpropagation Algorithms for Square, Wide, and Deep - Real or Complex - Matrices and Their Software Implementation. arXiv preprint arXiv:2009.10071. 2024
Download Paper