VC dimension and distribution-free sample-based testing

by Eric Blais, Renato Ferreira Pinto Jr., and Nathaniel Harms

Oded's comments

I somehow missed this paper, but it seems that Property Testing Review missed it too. Is it the effect of COVID?

Anyhow, the main result of the paper is nicely abstracted in the first paragraph of the abstract. To be more specific, one is forced to recall the definition of VC dimension and present the definition of LVC dimension. Considering properties of $n$-bit long strings, one typically presents the VC dimension as the largest $k$ such that some $k$-subset of $[n]$ is shattered by the property (i.e., for any $k$-bit pattern, there is a string in the property whose projection on this $k$-subset yields these values). The LVC dimension with respect to size s is the largest $k$ such that for some $s$-subset $S$ of $[n]$ every $k$-subset of $S$ is shattered by the property.

The main result actually considers the property that is obtained by projecting the strings of the original property $\Pi$ on the subset $S$; that is $\Pi'=\{x_S:x\in S\}$. Denoting the VC dimension of $\Pi'$ by $\VC(\Pi')$ and the LVC dimension of $\Pi'$ (w.r.t. size |S|) by $\LVC(\Pi')$, its simpler form (stated in Corollary 1.2) asserts that if $v \eqdef \LVC(\Pi') = \VC(\Pi') \leq |S|/5$, then distribution-free sample-based testing $\Pi'$ (and so also $\Pi$) is $\Omega(v/\log v)$. Furthermore, in some cases this bound is tight. The take-home message is that in these cases (distribution-free sample-based) testing may only be logarithmically faster than learning.

Looking at the foregoing bound and take-home message, and recalling that similar bounds and take-home messages occur in the context of distribution testing, one wonder whether this is a coincidence. The answer is that it is not a coincidence, since the main result is prove by a reduction from the problem of estimating the support size of a distribution. (See Section 1.2.)

The original abstract

We consider the problem of determining which classes of functions can be tested more efficiently than they can be learned, in the distribution-free sample-based model that corresponds to the standard PAC learning setting. Our main result shows that while VC dimension by itself does not always provide tight bounds on the number of samples required to test a class of functions in this model, it can be combined with a closely-related variant that we call lower VC (or LVC) dimension to obtain strong lower bounds on this sample complexity.

We use this result to obtain strong and in many cases nearly optimal bounds on the sample complexity for testing unions of intervals, halfspaces, intersections of halfspaces, polynomial threshold functions, and decision trees. Conversely, we show that two natural classes of functions, juntas and monotone functions, can be tested with a number of samples that is polynomially smaller than the number of samples required for PAC learning.

Finally, we also use the connection between VC dimension and property testing to establish new lower bounds for testing radius clusterability and testing feasibility of linear constraint systems.

Available from STOC'21.

Back to list of Oded's choices.