This work refers to the model discussed in the one of my choices from FOCS'14 (i.e., Preventing False Discovery in Interactive Data Analysis is Hard by Moritz Hardt and Jonathan Ullman). While the said work reported on the hardness of having a mechanism that supports more than a quadratic (in the sample size) number of queries, the current work shows that such a number of queries can be supported. More generally, it reduces the task of reusing a holdout sample to the task of providing a privacy preserving mechanism for the relevant queries (i.e., those required by the learning algorithm).
A great deal of effort has been made to reduce the risk of spurious scientific discoveries, from the use of holdout sets and sophisticated cross-validation techniques, to procedures for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of science: The theory assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas science is by definition an adaptive process, in which data are shared and re-used, and hypotheses and new studies are generated on the basis of data exploration and previous outcomes.
Surprisingly, the challenges of adaptivity can be addressed using insights from differential privacy, a field of study supporting a definition of privacy tailored to private data analysis. As a corollary we show how to safely reuse a holdout set a great many times without undermining its power of ``correctness protection,'' even when hypotheses and computations are chosen adaptively. Armed with this technique, the analyst is free to explore the data ad libitum, generating and evaluating hypotheses, verifying results on the holdout, and backtracking as needed.