Instance-by-instance optimal identity testing

by Gregory Valiant and Paul Valiant

Oded's comments

For any fixed distribution $p$, in the problem of testing identity to $p$, for a given proximity parameter $\eps$, one is given samples from an unknown distribution $q$ and should distinguish the case that $q=p$ from the case that $q$ is $\eps$-far from $p$ (in total variation distance). The forgoing formulation views "testing identity" to a given distribution as a class of problems, each parameterized by a given distribution $p$, which indeed fits nicely to the framework of massively parameterized problems.

This specific formulation begs the question of what is the complexity of each of these individual problems, as oppose to upper and lower bound that hold for the entire class (i.e., the parametrized class of problems of "testing identity" to some fixed distribution over $[n]$). Indeed, prior results asserted that (1) all problems in the class can be solved by using $tildeO(sqrt(n))/\eps^4$ samples, and (2) solving some problems in the class (e.g., for $p$ being uniform over $[n]$) require $Omega(sqrt(n))/\eps^2$ samples. But this leaves open the question of what is the complexity of each of the individual problems.

The current work provides a simple algorithm that for any given distribution $p$ uses $s_p(\eps)$ samples of the unknown distribution $q$, whereas $Omega(s_p(O(\eps)))$ is a lower bound on the complexity of any algorithm for "testing identity" to $p$. The function $s_p(\eps)$ is quite weird, but it is the real complexity of the problem parametrized by $p$, not a weakness of the current work. In "typical" cases, $s_p(\eps) = f(p)/\eps^2$, where $f(p)$ is the $2/3$-norm of $p$ [sic!], and, indeed, the $2/3$-norm of the uniform distribution over $[n]$ equals $sqrt(n)$.

N.B.: These problems are fundamentally different form the problem in which one is given samples from two unknown distributions and is required to tell whenther the distributions are identical or $\eps$-far.

Addendum (Oct 2014): A weak explanation for the 2/3-norm is that it is the $Lp$-norm that assigns the uniform $n$-vector $(1/n,...,1/n)$ the value $\sqrt(n)$; that is, $(n\cdot(1/n)^p)^{1/p} = n^{1/2}$ solves to $p=2/3$. The algorithm establishing the bound is a version of the standard chi-square test, which evaluates $\sum_{i\in[k]}((X_i - k p_i)^2- k p_i)/p_i$, where $X_i$ denotes the number of occurances of $i$ in $k$ samples and $p_i$ denote the probability of $i$ in the target distribution. The version used by the algorithm is $\sum_{i\in[k]}((X_i - k p_i)^2- X_i)/p_i^{2/3}$; that is, the shift is by $X_i$ rather than by $k p_i$, and the normalization is by $p_i^{2/3}$ rather than by $p_i$.

The original abstract

See ECCC TR13-111.

Back to list of Oded's choices.