My focus is on the fast matrix multiplication algorithm for the congested clique model, which is described in Section 2 of this work. I was intrigued as to how an $n^\omega$-lines bilinear program for multiplying $n$-by-$n$ matrices yields round complexity $n^{1-(2/\omega)}$ for $n$ processors (rather than the natural $n^{\omega-2}$ bound).
In a nutshell, the answer is that the same bilinear program is applicable to multiplying matrices of submatrices, where the basic operation is multiplication of submatrices. Specifically, for $d$ that divides $n$, a bilinear program for multiplying $d$-by-$d$ matrices is applicable to multiplying $n$-by-$n$ matrices, viewed as $d$-by-$d$ matrices with elements that are $n/d$-by-$n/d$ matrices.
Now, setting $d$ such that the bilinear program has length at most $n$, the different multiplications can be assigned to different processors, and the foregoing task reduces to computing $\max(n,d^2)$ different linear combinations of $\max(d^2,n)$ different $(n/d)^2$-long Boolean vectors. The latter task can be performed by assigning each processor the task of computing $\max(n,d^2)$ different linear combinations of $\max(d^2,n)$ different $n/d^2$-long Boolean vectors, which requires a redistribution of the input bits. The latter redistribution can be performed in $O(n/d^2)$ rounds, which yields the stated bound.
For a more detailed description see memo.
In this work, we use algebraic methods for studying distance computation and subgraph detection tasks in the congested clique model. Specifically, we adapt parallel matrix multiplication implementations to the congested clique, obtaining an $O(n^{1-(2/\omega)})$ round matrix multiplication algorithm, where $\omega$ (smaller than 2.3728639) is the exponent of matrix multiplication. In conjunction with known techniques from centralised algorithmics, this gives significant improvements over previous best upper bounds in the congested clique model. The highlight results include:
Available from Springer Link for Distributed Computing.