Oded's choices (Nr. 348)

Generative Models of Huge Objects

by Lunjia Hu, Inbal Livni-Navon, and Omer Reingold

Oded's comments

I think that the computational learning theory perspective is the one that is most suitable for this work. In PAC learning, for a concept $c:\bitset^n\to\bitset$ taken from a concept class $C$ and a hypothesis class $H$, given a (labeled) sample drawn from $(X,c(X))$, one is required to find $h\in H$ such that, whp, the distribution $(X,h(X))$ is close to $(X,c(X))$. In proper learning one considers $H=C$ (and may also insist on a specific representation of functions in $H$), whereas in agnostic learning one considers the class of all functions (i.e., $C$ consists of all $f:\bitset^n\to\bitset$). In all these cases, the notion of being `close' is the statistical one (i.e., total varuiation distance).

The main step taken by this work is relaxing the foregoing requirement. That is, rather than requiring that for every $d:\bitset^{n+1}\to\bitset$ it holds that $\Prob[d(X,c(X))=1]$ is close to $\Prob[d(X,h(X))=1]$, this is required only for distinguishers that are in a preditermined class of distinguishers, denoted $D$. It seems adaquute to call an algorithm that meets such a requirement a $D$-robust learner (or $D$-learner), and a general term such as crude learner or pseudo-learner may be used too.

Another relaxation is that $h$ is allowed to be a randomized process (or a distribution) rather than a function; in this case, the authors view the learner as generating a random object (or a model). However, my view is that the fact that the hypothesis is a randomized process rather than a function is a relaxation, not a requirement. In fact, the authors place no restriction (say, a min-entropy bound) on the randomness of the output.

A striking result, which appears explicitly in Thm3 of Calibration for the (Computationally-Identifiable) Masses (but, as mentioned there, was implicit in prior works) states that any function can be learned in the foregoing sense within complexity that is related to the complexity of learning the class (of distinguishers) $D$ such that every $d\in D$ satisfies $d(x,b)=d'(x)\cdot b$, where $d':\bitset^n\to\bitset$. The learning algorithm (captured by the proof of Thm 4.2, which instantiates Alg 3.2) proceeds in iterations, and is actually easier to explain in terms of the probability $p_i(x)\eqdef\Prob[h_i(x)=1]$, where $h_i$ is the randomized hypothesis generated in the $i$th iteration (where $h_0$ is rather arbitrary).

Using an agnostic learning algorithm, find a distinguisher $d\in D$ such that $|\Prob[d(X,c(X))=1]-\Prob[d(X,h_i(X))=1]|>\eps$ (equiv., $|\Exp[d(X,c(X))]-\Exp[d(X,p_i(X))=1]|>\eps$), and output $h_i$ if no such distinguisher is found.
Correct $h_i$ obtaining a process $h_{i+1}$ such that $h_{i+1}(x)=h_i(x)$ if $d'(x)=0$ and $h_{i+1}(x)=(h_i(X)|d'(X)=1)$ otherwise, where $d'(x)=d(x,1)$ and $(h_i(X)|d'(X)=1)$ denotes the distribution $h_i(X)$ condition on $d'(X)=1$. In terms of the $p_j$, we have $p_{i+1}(x)=p_i(x)$ if $d'(x)=0$ and $p_{i+1}(x)=\Exp[h_i(X)|d'(X)=1]$ otherwise.

If the agnostic learner is guaranteed to find (whp) a distingyuisher with gap $\eps$ whenerver a distinguisher with gap $\eps'$ exists, then (whp) we can have at most $1/\eps$ iterations, and the output hypothesis $h$ satifies $|\Prob[d(X,c(X))=1]-\Prob[d(X,h(X))=1]| \leq \eps'$.

The truthfulness condition that is considered in the current work is analogous to the proper learning condition. Specifically, if the target function $c$ belongs to some class, then the randomized hypothesis $h$ is required to be supported by that class (or be outside it with small probability).

Actually, the general fraomework put forward in the paper allows to go beyond the standard requitrement of PAC learning, because the class of distinguishers is not necessarily restricted to algorithms (of certain class) that get a single labeled-sample (i.e., drawn either from either $(X,c(x))$ or from $(X,h(X))$). Indeed, the foregoing case is a spcial case that is captured by the notion of sample-access (see Def 2.1), but one may also consider query-access (see Def 2.3), whereas the learner and the distinguishers may be of various types (a point not clear enough in the text). I think it may also make sense to distinguish between adaptive and non-adaptive queries (esp., wrt the distguisher). Hence, such learning algorithms may be termed testing-robust, with respect to various classes of testers that may be restricted both in their query-access and computational resources.

The authors claim to be inspired by prior work on implementing huge random objects and do borrow notions (of indistinguishability, truthfulness and types of implementation) from there; but, in my opinion, the current work is far closer to computational learning theory. In particular, while the starting point of the prior work is a precise specification of the desired random object, the current work is focused on learning an unknown object (which may be confined to some class) and generating a (possibly random) model for it.

The original abstract [with a single clarifications]

This work initiates the systematic study of explicit distributions that are indistinguishable from a single exponential-size combinatorial object. In this we extend the work of Goldreich, Goldwasser and Nussboim (SICOMP 2010) that focused on the implementation of huge objects that are indistinguishable from the uniform distribution, satisfying some global properties (which they coined truthfulness). Indistinguishability from a single object is motivated by the study of generative models in learning theory and regularity lemmas in graph theory. Problems that are well understood in the setting of pseudorandomness present significant challenges and at times are impossible when considering generative models of huge objects.

We demonstrate the versatility of this study by providing a learning algorithm for huge indistinguishable objects in several natural settings including: dense functions and graphs with a truthfulness requirement on the number of ones in the function or edges in the graphs, and a version of the weak regularity lemma for sparse graphs that satisfy some global properties. These and other results generalize basic pseudorandom objects as well as notions introduced in algorithmic fairness. The results rely on notions and techniques from a variety of areas including learning theory, complexity theory, cryptography, and game theory.

See arXiv 2302.12823.

Back to list of Oded's choices.