2∥A∥F with high probability. Typically in computer science we are given an error tolerance ε, and want to know how many matrix-vector products we need to compute. (i.e. how small can we make ℓ while still being accurate?) In order to build a guarantee here, we assume A is positive semi-definite (PSD), which in turn implies that ∥A∥F≤tr(A). With this in mind, we can prove the standard result for Hutchinson's Estimator:
The analysis of line 1 is always tight, since we know the variance of Hℓ(A) exactly.
The analysis of line 3 is always tight, since we just set ℓ to the smallest value that gets error ε.
The analysis of line 2 is not always tight.
The second line is tight only if ∥A∥F≈tr(A). That is, the analysis above only tells us that Hutchinson's Estimator needs O(ε1) samples if A is the kind of matrix that has ∥A∥F≈tr(A). So, what kind of matrix is that?
If A is the kind of matrix that is hard for Hutchinsons's Estimator to handle, then tr(A) is well approximated by the top few eigenvalues of A
This leads us to pick the following rough design of an algorithm:
Find a good low-rank approximation A~k
Notice that tr(A)=tr(A~k)+tr(A−A~k)
Compute tr(A~k) exactly
Approximate tr(A−A~k) with Hutchinson's Estimator
In the next section, we state a formal version of Claim 1 (see Lemma 3), show how to compute such a matrix A~k (see QQ⊺A in Theorem 1), and bound the complexity of the Hutch++ estimator (see Theorem 2).
Special thanks to Joel A. Tropp for working most of this out, and encouraging us to look into sharpening the constants in the analysis of Hutch++.
By expressing the variance of Hutch++ with all of its constants laid bare, and by using a very simple analysis, this analysis will hopefully allow practitioners to easily the exact parameters in Hutch++ code.
Before bounding the variance of Hutch++, we include the proof of one lemma and import another theorem:
Let A∈Rd×d be a PSD matrix. Let Ak be the best rank-k approximation to A. Then, ∥A−Ak∥F≤2k1tr(A).
Proof. Let v=[λ1…λn] be the vector of eigenvalues of A, such that λ1≥λ2≥...λn≥0. Then vk:=[λ1…λk0…0] is the vector of eigenvalues of Ak. With this notation, we just need to prove that ∥v−vk∥2≤2k1∥v∥1. The rest of this proof is just Lemma 7 from this paper.
Hölder's inequality states that ∑i=1naibi≤∥a∥p⋅∥b∥q for any vector a,b so long as p1+q1=1. If we take a=b=v−vk, p=1, and q=1, we find that ∥v−vk∥22≤∥v−vk∥1∥v−vk∥∞. Further, note that by our construction of vk, ∥v∥1=∥vk∥1+∥v−vk∥1. Then,
Sample S∈Rn×(k+p) with N(0,1) entries. Let Q be any orthonormal basis for AS (e.g. a QR decomposition). Then, E[∥(I−QQ⊺)A∥F2]≤(1+p−1k)∥A−Ak∥F2
Proof. This is almost exactly the statement of Theorem 10.5 in (Halko et al. (2011)), but not exactly. Theorem 10.5 bounds the expected Frobenius norm (not squared). Instead, at the bottom of page 57 and top of page 58 from this version of the paper, we see that the proof of Theorem 10.5 actually proves
Where, ∥Σ2∥F2=∥A−Ak∥F2, completing the importing of this theorem.
Now, we proceed to the analysis of the variance of Hutch++:
Fix parameters k and ℓ. Let p=2k+1 and construct Q∈Rn×k+p as in Theorem 1. Then let Hutch++(A):=tr(Q⊺AQ)+Hℓ((I−QQ⊺)A)
We first look at the expectation by using the Tower Law, the expectation of Hutchinson's estimator, the cyclic property of trace, and the linearity of trace: E[Hutch++(A)]=E[tr(Q⊺AQ)+Hℓ((I−QQ⊺)A)]=E[tr(Q⊺AQ)]+E[E[Hℓ((I−QQ⊺)A)∣Q]]=E[tr(Q⊺AQ)]+E[tr((I−QQ⊺)A)]=E[tr(AQQ⊺)]+E[tr(A−QQ⊺A)]=E[tr(AQQ⊺)+tr(A−QQ⊺A)]=tr(A) We now move onto the Variance bound. First, recall the Conditional Variance Formula, which says for any random variables X,Y,
Taking Y=Hutch++(A) and X=Q, we can bound the variance of Hutch++:
The second term above is always zero: Var[E[Hutch++(A)∣Q]]=Var[E[tr(Q⊺AQ)+Hℓ((I−QQ⊺)A)∣Q]]=Var[tr(Q⊺AQ)+tr((I−QQ⊺)A)]=Var[tr(A)]=0 So, we just need to bound the first term from the conditional variance formula: E[Var[Hutch++∣Q]]=E[Var[tr(Q⊺AQ)+Hℓ((I−QQ⊺)A)∣Q]]=E[Var[Hℓ((I−QQ⊺)A)∣Q]]≤ℓ2E[∥(I−QQ⊺)A∥F2]≤ℓ2(1+p−1k)∥A−Ak∥F2 Where the first inequality uses the variance of Hutchinson's Estimator, and the second uses Theorem 1. Recall that we set p=k+1. After substituting and simplifying the expression, we find
The variance in Theorem 2 still has two parameters in it, which leave a bit of ambiguity for practitioners. So, we quickly analyze the choice of k and ℓ that minimizes our bound on the variance.
Formally, suppose we are allowed to compute exactly m matrix-vector products with A. Then,
Computing Q uses k+p=2k+1 products
Computing tr(Q⊺AQ) uses k+p=2k+1 products
Computing Hℓ((I−QQ⊺)A) uses ℓ products
So, we have m=2(2k+1)+ℓ=4k+2+ℓ. We can then verify that k=8m−2 and ℓ=2m−1 minimizes the variance. Notably, this is equivalent to setting ℓ=4k, which looks more intuitive. This produces a final variance of Var[Hutch++(A)]≤(m−2)216tr2(A).
 We analyze variance in this page for convenience, but all of these confidence intervals hold with high probability (i.e. with logarithmic dependence on the failure probability), and this is widely analyzed in the formal publications.