The randomized truncated SVD
Orː How I learned to stop worrying and love the possibility of failure
There are many things that make a great algorithm: Providing a solution to an actual problem is definitely important. Additional points if the algorithm is fast or easy to explain. However, not many algorithms contain a piece of life advice.
In this post, I'll talk about one algorithm that ticks all these boxes: the randomized truncated singular-value decomposition (SVD). I'll cover the idea and a basic implementation in 15 lines of Python.
Introduction
The singular value decomposition of a $m \times n$ real matrix $M$ is a factorization
$$ M = U \Sigma V^\mathrm{T}, $$where $\Sigma$ is a $r \times r$ diagonal matrix with non-negative entries called singular values and $V^\mathrm{T}$ denotes the transpose of $V$. Additionally, the columns of $U \in \mathbb{R}^{m \times r}$ and $V \in \mathbb{R}^{n \times r}$ are orthonormal and called left-singular vectors and right-singular vectors, respectively.
One common use case of the SVD is principle component analysis (PCA), which is the most fundamental dimensionality reduction technique. The PCA can be computed using the SVD of the centered data matrix $X$, which has $N$ rows and each row represents one sample of a $n$ dimensional feature vector. $X$ being centered means that column means have been subtracted and are now equal to zero.
The relationship between SVD and PCA can be summarized as follows: Denote by $\sigma_1$ the largest singular value of $\frac{1}{\sqrt{N - 1}} X$ and by $V_1$ the corresponding right-singular vector. Then, $V_1$ gives the direction with the largest variance of the data and $\sigma_1$ is the explained variance in this direction. The right-singular vector $V_2$ of the next-largest singular value then represents the direction of largest variance in the dataset not explained by $V_1$, etc. In the case of a dataset with only $n = 2$ features, we can visualize these directions together with the original data points:
N = 200
samples = multivariate_normal(cov=np.array([[1, 0.4], [0.4, 1]])).rvs(size=N)
df = pd.DataFrame({"x": samples[:, 0], "y": samples[:, 1]})
SVD = namedtuple("SVD", "U, S, V")
svd = SVD(*np.linalg.svd(samples / np.sqrt(N - 1)))
scatter_plot = (
alt.Chart(df)
.mark_point(shape="cross", size=40)
.encode(
x=alt.X("x", scale=alt.Scale(domain=(-2, 2))),
y=alt.Y("y", scale=alt.Scale(domain=(-2, 2))),
)
).properties(
width=400,
height=400
).interactive()
df = pd.DataFrame({
"x": [0, svd.S[0] * svd.V[0, 0], 0, svd.S[1] * svd.V[1, 0]],
"y": [0, svd.S[0] * svd.V[0, 1], 0, svd.S[1] * svd.V[1, 1]],
"Legend": ['σ₁V₁', 'σ₁V₁', 'σ₂V₂', 'σ₂V₂']
})
arrow_plot = (
alt.Chart(df).mark_line().encode(
x="x", y="y", detail="Legend", color="Legend"
)
).interactive()
alt.layer(scatter_plot, arrow_plot)
For higher dimensional datasets, the first few largest singular values of $X$ and the corresponding singular vectors often capture a large fraction of the general properties of $X$. This combined with its simplicity makes PCA such a widely used tool for dimensionality reduction and data visualization.
The example of PCA highlights that for many use cases, computing the full SVD is unnecessary. Instead, the truncated SVD, i.e. computing the largest singular values and the corresponding singular vectors, is often sufficient. By only computing a small subset of the full SVD, the truncated SVD can also be much faster. However, efficient algorithms for truncated SVD such as Krylov subspace methods tend to be complex and challenging to implement. That's the reason why most scientific computing packages including Scipy are wrapping libraries like Arpack -- a more than 20 year old implementation in Fortran77.
This is where the randomized truncated SVD gets to shine: Not only can we implement a basic version in 15 lines of Python, that implementation also performs just as well as the much more intricate algorithm used in Scipy. And as the cherry on top, we even benefit from GPU acceleration using the same exact implementation in pytorch.
Randomized Truncated SVD
The idea behind randomized matrix decompositions is quite simple1: First, we need to find a $m \times p$ matrix Q with orthonormal columns such that $$ M \approx Q Q^{\mathrm{T}} M $$ We'll refer to this as the Q-condition. Next, we compute the (full) SVD for B $$ B = Q^{\mathrm{T}} M = \hat U \Sigma V. $$ The Q-condition now implies, that $U = Q \hat U$ together with $\Sigma$ and $V$ is a SVD for $M$. To put this into words: We use the matrix $Q^{\mathrm{T}}$ to "project" $M$, compute the standard SVD of the projection, and then transform the right-singular vectors back using $Q$.
So why would this algorithm be any faster than simply computing the full SVD? Especially since we're computing the full SVD of $B$ as the second step. Indeed, the version presented above can only provide a speedup if $p$ is much smaller than $n$, and hence, $B$ has much fewer components than $M$. To make things worse, it turns out that with no further restrictions on $M$, the Q-condition is actually impossible2 to fulfill unless $p \ge n$. However, if we are only interested in the $k$ largest singular values and the matrix $M$ fulfills certain technical conditions, the following construction fulfills a (slightly modified version) of the Q-condition for $p = k + 10$ with high probability1:
def randomized_range_finder(M, *, size, n_iter, n_oversamples=10):
# Intialize Q randomly using Normal-distributed components
Q = torch.randn(M.shape[1], size + n_oversamples, device=M.device)
# Perform power-iteration a few times
for i in range(n_iter):
Q = M.transpose(-1, -2) @ (M @ Q)
# Ortho-normalize the columns of Q
Q, _ = torch.linalg.qr(M @ Q)
return Q
As previously mentioned, the matrix $Q$ returned from randomized_range_finder
requires $p$ to be only marginally larger than the number of singular values $k$ we are trying to compute.
This results in a tremendous speed-up compared to the full PCA in case $k$ is much smaller than $n$ as commonly the case, e.g. for PCA.
We will confirm this in the experiments section below.
Before we discuss the elephant in the room, namely the fact that above statement is only true with high probability, let us complete the implementation by turning the truncated SVD algorithm from the beginning of this section into code:
SVD = namedtuple("SVD", "U, S, V")
def randomized_svd(M, n_components):
Q = randomized_range_finder(M, size=n_components, n_iter=7)
# Project M
B = Q.transpose(-1, -2) @ M
# Compute full SVD of B
Uhat, s, Vt = torch.linalg.svd(B)
# Un-project U
U = Q @ Uhat
return SVD(U[:, :n_components], s[:n_components], Vt[:n_components, :])
The qualifier "with high probability" for randomized_range_finder
implies that there is a small chance that $Q$ computed in this way does not fulfill the Q-condition and the resulting truncated SVD is wrong.
This probabilistic nature is in stark contrast to the majority of linear algebra algorithms, which are deterministic.
It also implies that we can never be 100% sure that the result we obtained is correct.
However, the reason why the randomized truncated SVD is so powerful in practice is that we have full control over its failure probability:
We can make the failure probability smaller by increasing the constant in the formula for $p = k + \ldots$.
In fact, the choice $p = k + 10$ ensures that the probability of failure is so small that it can be considered to be zero for all practical purposes.
In other words, even though there is a small chance of the algorithm failing, it is so small that we will never encounter this in practice.
And this is where a simple algorithm turns into life advise: If we require a solution for the truncated SVD that's perfect – both fast and 100% reliable – we end up with a very complex solution. The kind of solution I would have never chosen to write a blog post about voluntarily. However, by giving us a little bit of room for failure, we end up with a much more elegant and simple solution. And by controlling the probability of these failure cases, we can make sure that they remain a theoretical possibility only and and that they will never occur in practice.
Testing it out
As mentioned above, the randomized truncated SVD as it's implemented here works only if the matrix $M$ fulfills certain technical conditions. The most important one is that the singular values we're discarding are small enough such that they can be distinguished sufficiently easy from the singular values we're interested in. In practice this is often the case since otherwise we would not discard those singular values in the first place (e.g. in the case of PCA). There are also further improvements to the algorithm that make them more stable if this is not the case, and hence, these versions rely less on this assumption (see for example the implementation used in scikit-learn).
Here, we take the easy way out and simply generate matrices that fulfill these conditions using the following functions:
def sample_lowrank_matrix(size, max_rank, *, noise=0.0):
A = torch.randn(size, max_rank)
M = A @ A.transpose(-1, -2) / size + noise * torch.randn(size, size)
return M
These are matrices that have only max_rank
non-zero singular values in case noise == 0.0
.
If we set noise
to some non-zero value, but keep it small enough, the singular values will still be sufficiently different for the truncated algorithm to work well.
M = sample_lowrank_matrix(1000, 5)
svd_random = randomized_svd(M, 10)
svd_standard = torch.linalg.svd(M)
def to_numpy(svd: SVD):
return SVD(*map(lambda t: t.cpu().numpy(), svd))
def svd_flip(svd: SVD):
max_abs_cols = torch.argmax(np.abs(svd.U), dim=0)
signs = torch.sign(svd.U[max_abs_cols, range(svd.U.shape[1])])
svd.U[:] *= signs
svd.V[:] *= signs[:, None]
return svd
def plot_singular_values(limit_ranks, **kwargs):
kwargs = {name: to_numpy(val) for name, val in kwargs.items()}
data = [
pd.DataFrame(
{
"Rank": np.arange(len(svd.S))[:limit_ranks],
"Value": svd.S[:limit_ranks],
"Algorithm": name,
}
)
for name, svd in kwargs.items()
]
data = pd.concat(data)
return (
alt.Chart(data)
.mark_point(size=40)
.encode(
x="Rank",
y=alt.Y("Value", axis=alt.Axis(title="Singular Value")),
color="Algorithm",
shape="Algorithm",
tooltip=["Algorithm", "Value"],
)
).interactive()
def plot_singular_vector_errors(svd1, svd2, *, limit_ranks):
svd1, svd2 = to_numpy(svd_flip(svd1)), to_numpy(svd_flip(svd2))
errors = np.linalg.norm(svd1.U[:, :limit_ranks] - svd2.U[:, :limit_ranks], axis=0)
data = pd.DataFrame(
{
"Rank": np.arange(len(errors)), "Error":errors
}
)
return (
alt.Chart(data)
.mark_line(color="gray", strokeDash=[2,2])
.encode(
x="Rank",
y=alt.Y(
"Error",
axis=alt.Axis(title='Singular Vector Error', titleColor="gray"),
#scale=alt.Scale(domain=(0, 1.414))
),
tooltip=["Error"],
)
).interactive()
alt.layer(
plot_singular_values(limit_ranks=10, **{"SVD": svd_random, "Randomized SVD": svd_standard}),
plot_singular_vector_errors(svd_random, svd_standard, limit_ranks=10)
).resolve_scale(y="independent")
This plot shows the singular values in descending order on the left y-axis using the x-axis to distinguish between them.
We compute the singular values either using our truncated SVD implementation (blue circles) or using full SVD (orange squares).
Both methods agree very well.
Since we set max_rank
for the matrix M
to 5, only the first five singular values are nonzero.
The dotted line, which is scaled w.r.t. the right y-axis, shows the distance between the singular vectors computed with each method. As long as the singular values are non-zero both methods agree very well. However, for the vanishing singular values, the corresponding singular vectors show a very large discrepancy. This is due to the fact that there's no natural order for the vanishing singular values anymore and the $i$-th singular vector computed with one method does not need to match with the $i$-th singular vector computed with a different method3.
Finally, let us compare the performance of different approaches. In the plot below we compare the runtime for a variety matrix sizes for different approaches:
- "cpu" is the randomized algorithm implemented above run on CPU
- "gpu" is the randomized algorithm implemented above run on GPU
- "full" is the standard SVD included in pytorch run on CPU
- "scipy" is the truncated SVD
scipy.sparse.svds
All runtimes were measured on a Colab GPU instance with a Nvidia T4 and 2 CPU cores. You may need to zoom out to see the times for "full" and very large matrix sizes.
matrix_sizes = [2**8, 2**9, 2**10, 2**11, 2**12]
results = defaultdict(list)
results['sizes'] = matrix_sizes
n_components = 2
for size in matrix_sizes:
M = sample_lowrank_matrix(size, 3)
time = %timeit -o -q randomized_svd(M, n_components=n_components)
results['cpu'].append(time.best)
for size in matrix_sizes:
M = sample_lowrank_matrix(size, 3)
time = %timeit -q -o torch.svd(M)
results["full"].append(time.best)
for size in matrix_sizes:
M = sample_lowrank_matrix(size, 3).numpy()
time = %timeit -q -o svds(M, k=n_components)
results["scipy"].append(time.best)
if torch.cuda.is_available():
for size in matrix_sizes:
M = sample_lowrank_matrix(size, 3)
time = %timeit -q -o randomized_svd(M.cuda(), n_components=n_components)
results["gpu"].append(time.best)
df = pd.DataFrame({**results, "k": n_components})
value_vars = list(filter(lambda s: s not in {"sizes", "k"}, df.columns))
df = df.melt(id_vars=["sizes", "k"], value_vars=value_vars)
df.rename(columns={"value": "runtime", "variable": "algorithm"}, inplace=True)
(
alt.Chart(df)
.mark_point(size=40)
.encode(
x=alt.X("sizes", scale=alt.Scale(type='sqrt', zero=False), axis=alt.Axis(title="Matrix Size")),
y=alt.Y("runtime", scale=alt.Scale(domain=(0, 0.55)), axis=alt.Axis(title="Runtime [s]")),
color="algorithm",
shape="algorithm",
)
).interactive()
Note that for all truncated SVDs, we set $k=2$, that is we only computed the two largest singular values. Naturally, for larger values of $k$ the difference between the full SVD and the truncated versions would become smaller.
We clearly save a lot of time simply by using an algorithm that only computes the necessary singular values and vectors. For the largest matrix size, we see a speedup of over 50x between the full SVD and our simple implementation run on CPU. When we compare the slowest version to the randomized truncated SVD run on GPU, we see an even larger speedup of over 400x.
Conclusion
I hope this blog post could convey why I consider the randomized truncated SVD as my favorite algorithm. It solves a very common and important problem, it's fast, it easily runs on GPUs, and it's simple enough to explain in a few minutes. But the most important factor is the insight it provides into the power of probabilistic algorithms in general: If we give up the requirement that an algorithm should work 100% of the time in theory, we have a lot to gain in simplicity and performance. And as long as failure is unlikely enough, it will not make a difference in practice. So stop worrying and start loving the possibility of failure.
1. N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” arXiv:0909.4061 [math], Dec. 2010.↩
2. One way to show this is to plug in the identity matrix for $M$.↩
3. In this situation, the singular values are called degenerate. This is an extreme case of singular values not being “distinguishable enough”.↩