## Estimating the Effective Support Size in Constant Query Complexity

by Shyam Narayanan and Jakub Tetek

This paper resolves a problem that has haunted me for a while. In my paper on the subject, I presented a few algorithms for estimating the effective support size of a distribution, when using sampling and evlauation oracles, where the epsilon-effective support size of a distribution D equals the smallest support size of a distribution that is epsilon-close in total variation distance from D. (See a formulation in the original (revised) abstract.)

While getting close to achieving an algorithm with complexity that only depends on the three proximity parameters (i.e., espilon, beta and gamma below), I failed to achieve this goal. In fact, at times, I was trying to prove that such a result was impossible (while failing on that front too).

The algorithm presented in the current paper is indeed very simple, but its analysis is not simple at all. The pivotal observation is that taking a sample of size $s=O(1/(\beta^2\eps))$, ordering its elements in incraesing order of probabilities, and letting $z$ denote the $(1+0.5\beta)\eps\cdot s$th element, the following function $f$ is an unbaised estimator of the desired value:

$f(x)=1/p(x)$ if $p(x)\geq p(z)$ and $f(x)=0$
where $p(x)$ denotes the probability of element $x$.

Indeed, for $S=\{x: p(x) \geq p(z)\}$, it holds that
$E[f(X)] = \sum_{x} p(x)\cdot f(x) = \sum_{x\in S} 1 = |S|$.

The problem, as usual, is bounding the variance. This works in one natural case, but not in another. Instead, a different analysis is applied in the second case. In particular, the deviation from the expectation is bounded using specific features of the setting at hand.

Comment: As stated in the paper itself, the reduction of the case of gamma=0 to the case of positive gamma was observed in the initial paper. The bicriteria approximation was introduced in that paper in order to offer greater flexibility in presenting partial progress towards the goal achieved in the current paper. In light of the results obtained in the current paper, it seems that the bicriteria formulation is still useful in keeping track of different types of error.

#### The original abstract (with minor revisions)

Estimating the support size of a distribution is a well-studied problem in statistics. Motivated by the fact that this problem is highly non-robust (as small perturbations in the distributions can drastically affect the support size) and thus hard to estimate, Goldreich [ECCC 2019] studied the query complexity of estimating the epsilon-effective support size}, Ess_eps of a distribution P, which is equal to the smallest support size of a distribution that is eps-close in total variation distance from P.

In his paper, he shows an algorithm in the dual access setting (where we may both receive random samples and query the sampling probability p(x) for any x) for a bicriteria approximation, giving an answer in [Ess_{(1+beta)eps},(1+gamma)Ess_eps] for various values beta,gamma. However, his algorithm has either super-constant query complexity in the support size or super-constant approximation ratio 1+gamma. He then asked if this is necessary, or if it is possible to get a constant-factor approximation in a number of queries independent of the effective support size, denoted n.

We answer his question by showing that not only is complexity independent of n possible for positive gamma, but also for gamma=0, that is, that the bicriteria relaxation is not necessary. Specifically, we show an algorithm with query complexity O(1/(beta eps)^3). That is, for any eps, beta in (0,1), we output in this complexity a number in [Ess_{(1+beta)eps},Ess_eps]. We also show that it is possible to solve the approximate version with approximation ratio 1+gamma in complexity O((1/beta^2 eps)+(1/beta eps gamma^2)). Our algorithm is very simple, and has 4 short lines of pseudocode.

See arXiv2211.11344.

Back to list of Oded's choices.