Oded's choices (Nr. 229)

Two works related to testing uniformity of distributions

Sample-Optimal Identity Testing with High Probability by Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price
Sharp Bounds for Generalized Uniformity Testing by Ilias Diakonikolas, Daniel Kane, and Alistair Stewart

Oded's comments

I have lumped together these two papers remotely related papers, since they both take a second look at the problem of ``testing whether an unknown distribution is uniform''. This computational problem has received a lot of attention, and one may wonder if there is something new and interesting to say about it. Well, the answer is a double yes.

The first paper, takes me back to 1996, when the paper Property Testing and its connection to Learning and Approximation was written. While our definitions and results included an error probability parameter, denoted $\delta$, I felt that this parameter is redundent since error reduction is a standard practice (allowing to reduce error probability of 1/3 to any desired $\delta>0$ at the cost of increasing the complexities by a $\log(1/\delta)$ factor). But eliminating the error probability as a parameter does not allow to ask whether or not one may improve over the straightforward error reduction. The first paper provides a positive answer to this question in the context of distribution testing. Specifically, it is shown that, for $\delta>2^{-n}$, the complexity of testing uniformity is proportional to $\sqrt{\log(1/\delta)}$ rather than to $\log(1/\delta)$.

The second paper provides tight bounds for a natural testing problem that generalizes (in some sense) the standard notion of ``uniformity testing''. The standard testing problem refers to testing whether an unknown distribution is identical to the uniform distribution over $[n]$, whereas the generalized problem refers to testing whether an unknown distribution is identical to a uniform distribution over some subset of $[n]$. (The generalized problem was presented by Batu and Canonne [CoRR 1708.04696], but I missed its announcement.)

The original abstract of the first paper (i.e., Sample-Optimal Identity Testing with High Probability)

We study the problem of testing identity against a given distribution (a.k.a. goodness-of-fit) with a focus on the high confidence regime. More precisely, given samples from an unknown distribution $p$ over $n$ elements, an explicitly given distribution $q$, and parameters $\epsilon, \delta\in(0,1]$, we wish to distinguish, with probability at least $1-\delta$, whether the distributions are identical versus $\epsilon$-far in total variation (or statistical) distance. Existing work has focused on the constant confidence regime, i.e., the case that $\delta = \Omega(1)$, for which the sample complexity of identity testing is known to be $\Theta(\sqrt{n}/\epsilon^2)$.

Typical applications of distribution property testing require small values of the confidence parameter $\delta$ (which correspond to small ``$p$-values''; in the statistical hypothesis testing terminology). Prior work achieved arbitrarily small values of $\delta$ via black-box amplification, which multiplies the required number of samples by $\Theta(\log(1/\delta))$. We show that this upper bound is suboptimal for any $\delta = o(1)$, and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is
\[ \Theta(\epsilon^{-2}\cdot(\sqrt{n\log(1/\delta)}+\log(1/\delta))\]
for any $n, \epsilon$, and $\delta$. For the special case of uniformity testing, where the given distribution is the uniform distribution $U_n$ over the domain, our new tester is surprisingly simple: to test whether $p = U_n$ versus $\dtv (p, U_n) \geq \epsilon$, we simply threshold $\dtv(\wh{p}, U_n)$, where $\wh{p}$ is the empirical probability distribution. We believe that our novel analysis techniques may be useful for other distribution testing problems as well.

See ECCC TR17-133.

The original abstract of the second paper (i.e., Sharp Bounds for Generalized Uniformity Testing)

We study the problem of generalized uniformity testing [BC17] of a discrete probability distribution: Given samples from a probability distribution $p$ over an unknown discrete domain $\mathbf{\Omega}$, we want to distinguish, with probability at least $2/3$, between the case that $p$ is uniform on some subset of $\mathbf{\Omega}$ versus $\epsilon$-far, in total variation distance, from any such uniform distribution.

We establish tight bounds on the sample complexity of generalized uniformity testing. In more detail, we present a computationally efficient tester whose sample complexity is optimal, up to constant factors, and a matching information-theoretic lower bound. Specifically, we show that the sample complexity of generalized uniformity testing is $\Theta\left(1/(\epsilon^{4/3}\|p\|_3) + 1/(\epsilon^{2} \|p\|_2) \right)$.

See ECCC TR17-132.

Back to list of Oded's choices.