On Approximating Total Variation Distance

by A. Bhattacharyya, S. Gayen, K.S. Meel, D. Myrisiotis, A. Pavan, and N.V. Vinodchandran.

Oded's comments

What puzzled me in reading the abstract is the question of how are these distributions given to us? I somehow missed the natural answer: These are product distributions over 0-1 variables, and so they are described by their margianls. So the question studied here is, given $(p_1,...,p_n)$ and $(q_1,...,q_n)$, compute (or approximate) the difference

$$\sum_{I\subseteq[n]} 
   |(\prod_{i\in I}p_i)\cdot(\prod_{i\in[n]-I}(1-p_i))
    -(\prod_{i\in I}q_i)\cdot(\prod_{i\in[n]-I}(1-q_i))|$$
Interestingly, this probalem is $\#P$-complete, whereas it has a FPTAS.

Both claims rely on the fact that if the expression is positive then it can be lower-bounded in terms of the input length. Since the hardness claim is more suprrising, it is adequate to hint that it is proved by reduction to the problem of determining, when given $u_1,...,u_n$ and $v$, the number of $I$'s that satisfy

$$(\prod_{i\in I}u_i)\cdot(\prod_{i\in[n]-I}(1-u_i)) = v,$$
which can be realted to a subset sum problem (by devising by $\prod_{i\in[n]} (1-u_i)$ and taking logs).

The approximation algorithm works by using (multiplicative) bucketing (of contributions to an expression corresponding to the TVD), and approximating the number of elements in each bucket.

The original abstract

Total variation distance (TV distance) is a fundamental notion of distance between probability distributions. In this work, we introduce and study the computational problem of determining the TV distance between two product distributions over the domain {0,1}n. We establish the following results.

  1. Exact computation of TV distance between two product distributions is #P-complete. This is in stark contrast with other distance measures such as KL, Chi-square, and Hellinger which tensorize over the marginals.
  2. Given two product distributions P and Q with marginals of P being at least 1/2 and marginals of Q being at most the respective marginals of P, there exists a fully polynomial-time randomized approximation scheme (FPRAS) for computing the TV distance between P and Q. In particular, this leads to an efficient approximation scheme for the interesting case when P is an arbitrary product distribution and Q is the uniform distribution.
We pose the question of characterizing the complexity of approximating the TV distance between two arbitrary product distributions as a basic open problem in computational statistics.

See arxiv 2206.07209.


Back to list of Oded's choices.