(1) A CLT and tight lower bounds for estimating entropy

(2) Estimating the unseen: A sublinear-sample canonical estimator of distributions

by Gregory Valiant and Paul Valiant.

Oded's comments

I did not get to study these two related works yet, but the results sound very impressive. Rather than waiting another month (or so) till I get to study them, let me take a small risk and recommend them somewhat blindly (i.e., based on the abstracts).

Note that these results refer to approximations up to an additive error term, whereas most previous works referred to constant factor approximation. In case of entropy, the additive term is a constant.

The papers are available as TR10-180 and TR10-180 of ECCC.

The original abstract of the second paper

We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear-sample additive estimators for a class of properties that includes entropy and distribution support size. Together with the lower bounds proven in the companion paper [29], this settles the longstanding question of the sample complexities of these estimation problems (up to constant factors). Our algorithm estimates these properties up to an arbitrarily small additive constant, using O(n/logn) samples; [29] shows that no algorithm on o(n/logn) samples can achieve this (where n is a bound on the support size, or in the case of estimating the support size, 1/n is a lower bound the probability of any element of the domain). Previously, no explicit sublinear-sample algorithms for either of these problems were known. Additionally, our algorithm runs in time linear in the number of samples used.


Back to list of Oded's choices.