Property Testing: Current Research and Surveys

[edited by Oded Goldreich]

This is not the webpage of Oded's forthcoming book on property testing, titled Introduction to Property Testing.
Ditto regarding Oded's lecture notes on property testing, which serve as a basis for the aforementioned book.


Appeared (in 2010) as a LNCS (Vol 6390), in the LNCS series State of the Art Survey.

LNCS 6390 is now available online. You can find information about it or access the online version.

Below is the book's tentative preface and table of contents (incl. drafts of most papers).

Tentative image for cover [regular format, high resolution format] and final cover.


Preface (by O.G., tentative, January 2010)

Property Testing is the study of super-fast (randomized) algorithms for approximate decision making. These algorithms are given direct access to items of a huge data set, and determine whether this data set has some predetermined (global) property or is far from having this property. Remarkably, this approximate decision is made by accessing a small portion of the data set.

Property Testing has been a subject of intensive research in the last couple of decades (see, e.g., recent surveys [R1,R2]), with hundreds of studies conducted in it and in closely related areas. Indeed, Property Testing is closely related to Probabilistically Checkable Proofs (PCPs), and is related to Coding Theory, Combinatorics, Statistics, Computational Learning Theory, Computational Geometry, and more.

The mini-workshop, hosted by the Institute for Computer Science (ITCS) at Tsinghua University (Beijing), brought together a couple of dozen of leading international researchers in Property Testing and related areas. At the end of this mini-workshop it was decided to compile a collection of extended abstracts and surveys that reflect the program of the mini-workshop. The result is the current volume.

Property Testing at a Glance

Property testing is a relaxation of decision problems and it focuses on algorithms that can only read parts of the input. Thus, the input is represented as a function (to which the tester has oracle access) and the tester is required to accept functions that have some predetermined property (i.e., reside in some predetermined set) and reject any function that is ``far'' from the set of functions having the property. Distances between functions are defined as the fraction of the domain on which the functions disagree, and the threshold determining what is considered far is presented as a proximity parameter, which is explicitly given to the tester.

An asymptotic analysis is enabled by considering an infinite sequence of domains, functions, and properties. That is, for any $n$, we consider functions from $D_n$ to $R_n$, where $|D_n|=n$. (Often, one just assumes that $D_n=[n]\eqdef\{1,2,...,n\}$.) Thus, in addition to the input oracle, representing a function $f:D_n\to R_n$, the tester is explicitly given two parameters: a size parameter, denoted $n$, and a proximity parameter, denoted $\eps$.

Definition: Let $\Pi=\bigcup_{n\in\N}\Pi_n$, where $\Pi_n$ contains functions defined over the domain $D_n$. A tester for a property $\Pi$ is a probabilistic oracle machine $T$ that satisfies the following two conditions:
  1. The tester accepts each $f\in\Pi$ with probability at least $2/3$; that is, for every $n\in\N$ and $f\in\Pi_n$ (and every $\eps>0$), it holds that $\prob[T^f(n,\eps)\!=\!1]\geq2/3$.
  2. Given $\eps>0$ and oracle access to any $f$ that is $\eps$-far from $\Pi$, the tester rejects with probability at least $2/3$; that is, for every $\eps>0$ and $n\in\N$, if $f:D_n\to R_n$ is $\eps$-far from $\Pi_n$, then $\prob[T^f(n,\eps)\!=\!0]\geq2/3$, where $f$ is $\eps$-far from $\Pi_n$ if, for every $g\in\Pi_n$, it holds that $|\{e\in D_n:f(e)\neq g(e)\}|>\eps\cdot n$.
If the tester accepts every function in $\Pi$ with probability 1, then we say that it has one-sided error; that is, $T$ has one-sided error if for every $f\in\Pi$ and every $\eps>0$, it holds that $\prob[T^f(n,\eps)\!=\!1]=1$. A tester is called non-adaptive if it determines all its queries based solely on its internal coin tosses (and the parameters $n$ and $\eps$); otherwise it is called adaptive.
This definition does not specify the query complexity of the tester, and indeed an oracle machine that queries the entire domain of the function qualifies as a tester (with zero error probability...). Needless to say, we are interested in testers that have significantly lower query complexity.

Research in property testing is often categorized according to the type of functions and properties being considered. In particular, algebraic property testing focuses on the case that the domain and range are associated with some algebraic structures (e.g., groups, fields, and vector spaces) and studies algebraic properties such as being a polynomial of low degree (see, e.g., [BLR,RS]). In the context of testing graph properties (see, e.g., [GGR]), the functions represent graphs or rather allow certain queries to such graphs (e.g., in the adjacency matrix model, graphs are represented by their adjacency relation and queries correspond to pairs of vertices where the answers indicate whether or not the two vertices are adjacent in the graph).

Ramifications

While most research in property testing refers to distances with respect to the uniform distribution on the function's domain, other distributions and even distribution-free models were also considered. That is, for a (known or unknown) distribution $\mu$ on the domain, we say that $f$ is $\eps$-far from $g$ (w.r.t $\mu$) if $\prob_{e\sim\mu}[f(e)\!\neq\!g(e)]>\eps$. Indeed, the foregoing definition refers to the case that $\mu$ is uniform over the domain (i.e., $D_n$).

A somewhat related model is one in which the tester obtains random pairs $(e,f(e))$, where each sample $e$ is drawn (independently) from the aforementioned distribution. Such random ($f$-labeled) example can be either obtained on top of the queries to $f$ or instead of them. This is also the context of testing distributions, where the examples are actually unlabeled and the aim is testing properties of the underlying distribution (rather than properties of the labeling which is null here).

A third ramification refers to the related notions of tolerant testing and distance approximation (cf. [PRR]). In the latter, the algorithm is required to estimate the distance of the input (i.e., $f$) from the predetermined set of instances having the property (i.e., $\Pi$). Tolerant testing usually means only a crude distance approximation that guarantees that inputs close to $\Pi$ (rather than only inputs in $\Pi$) are accepted while inputs that are far from $\Pi$ are rejected (as usual).

On the current focus on query complexity

Current research in property testing focuses mainly on query (and/or sample) complexity, while either ignoring time complexity or considering it a secondary issue. The current focus on these information theoretic measures is justified by the fact that even the latter are far from being understood. (Indeed, this stands in contrast to the situation in, say, PAC learning.)

On the importance of representation

The representation of problems' instances is crucial to any study of computation, since the representation determines the type of information that is explicit in the input. This issue becomes much more acute when one is only allowed partial access to the input (i.e., making a number of queries that result in answers that do not fully determine the input). An additional issue, which is unique to property testing, is that the representation may effect the distance measure (i.e., the definition of distances between inputs). This is crucial because property testing problems are defined in terms of this distance measure.

The importance of representation is forcefully demonstrated in the gap between the complexity of testing numerous natural graph properties in two natural representations: the adjacency matrix representation (cf. [GGR]) and the incidence lists representation (cf. [GR1]).

Things get to the extreme in the study of locally testable codes, which may be viewed as evolving around testing whether the input is ``well formed'' with respect to some fixed error correcting code. Interestingly, the general study of locally testable codes seeks an arbitrary succinct representation (i.e., a code of good rate) such that well-formed inputs (i.e., codewords) are far apart and testing well-formness is easy (i.e., there exists a low complexity codeword test).

Caveat

Needless to say, this section falls very short of providing a comprehensive account of research in property testing. Furthermore, the presentation is biased towards the presentation of models and the numerous results obtained in the various models are barely mentioned. The same holds with respect to the next section.

A Brief Historical Perspective

Property testing first appeared as a tool towards program checking (see the linearity tester of [BLR]) and the construction of PCPs (see the low-degree tests and their relation to locally testable codes, as discussed in [RS]). In these settings it was natural to view the tested object as a function, and this convention continued also in [GGR], which defined property testing in relation to PAC learning. More importantly, in [GGR] property testing is promoted as a new type of computational problems, which transcends all its natural applications.

While [BLR,RS] focused on algebraic properties, the focus of [GGR] was on graph properties. From this perspective the choice of representation became less obvious, and oracle access was viewed as allowing local inspection of the graph rather than being the graph itself. The distinction between objects and their representations became more clear when an alternative representation of graphs was studied in [GR1,GR2]. At this point, query complexity that is polynomially related to the size of the object (e.g., its square root) was no longer considered inhibiting. This shift in scale is discussed next.

Recall that initially property testing was viewed as referring to functions that are implicitly defined by some succinct programs (as in the context of program checking) or by ``transcendental'' entities (as in the context of PAC learning). From this perspective the yardstick for efficiency is being polynomial in the length of the query, which means being polylogarithmic in the size of the object. However, when viewing property testing as being applied to (huge) objects that may exist in explicit form in reality, it is evident that any sub-linear complexity may be beneficial.

The realization that property testing may mean any algorithm that does not inspect its entire input seems crucial to the study of testing distributions, which emerged with [BFRSW]. In general, property testing became identified as a study of a special type of sublinear-time algorithms.

Another consequence of the aforementioned shift in scale is the decoupling of the representation from the query types. In the context of graph properties, this culminated in the model of [KKR].

Nevertheless, the study of testing properties within query complexity that only depends on the proximity parameter (and is thus totally independent of the size of the object) remains an appealing and natural direction. A remarkable result in this direction is the characterization of graph properties that are testable within such complexity in the adjacency matrix model [AFNS].

References


Table of contents (tentative, April 2010)


Related Material Available on-line


©Copyright 2010 by Oded Goldreich.
Permission to make copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that new copies bear this notice and the full citation on the first page. Abstracting with credit is permitted.


Back to Oded Goldreich's homepage.