Oblivious dimension reduction, à la the Johnson-Lindenstrauss (JL) Lemma, is a fundamental approach for processing high-dimensional data. We study this approach for Uniform Facility Location (UFL) on a Euclidean input $X\subset \R^d$, where facilities can lie in the ambient space (not restricted to X). Our main result is that target dimension $m=\tilde{O}(ϵ^{−2} ddim)$ suffices to (1+ϵ)-approximate the optimal value of UFL on inputs whose doubling dimension is bounded by ddim. It significantly improves over previous results, that could only achieve O(1)-approximation [Narayanan, Silwal, Indyk, and Zamir, ICML 2021] or dimension $m=O(ϵ^{−2}\log n)$ for n=|X|, which follows from [Makarychev, Makarychev, and Razenshteyn, STOC 2019].
Our oblivious dimension reduction has immediate implications to streaming and offline algorithms, by employing known algorithms for low dimension. In dynamic geometric streams, it implies a (1+ϵ)-approximation algorithm that uses $O(ϵ^{−1}\log n)^{\tilde{O}(ddim/ϵ^2)}$ bits of space, which is the first streaming algorithm for UFL to utilize the doubling dimension. In the offline setting, it implies a (1+ϵ)-approximation algorithm, which we further refine to run in time $((1/ϵ)^{\tilde{O}(ddim)} d + 2^{(1/ϵ)^{\tilde{O}(ddim)}}) ⋅ \tilde{O}(n)$. Prior work has a similar running time but requires some restriction on the facilities [Cohen-Addad, Feldmann and Saulpic, JACM 2021].
Our main technical contribution is a fast procedure to decompose an input X into several k-median instances for small k. This decomposition is inspired by, but has several significant differences from [Czumaj, Lammersen, Monemizadeh and Sohler, SODA 2013], and is key to both our dimension reduction and our PTAS.
The Johnson-Lindenstrauss (JL) Lemma introduced the concept of dimension reduction via a random linear map, which has become a fundamental technique in many computational settings. For a set of n points in R^d and any fixed ϵ>0, it reduces the dimension d to O(log n) while preserving, with high probability, all the pairwise Euclidean distances within factor 1+ϵ. Perhaps surprisingly, the target dimension can be lower if one only wishes to preserve the optimal value of a certain problem, e.g., max-cut or k-means. However, for some notorious problems, like diameter (aka furthest pair), dimension reduction via the JL map to below O(log n) does not preserve the optimal value within factor 1+ϵ.
We propose to focus on another regime, of moderate dimension reduction, where a problem's value is preserved within factor α=O(1) (or even larger) using target dimension log n/poly(α). We establish the viability of this approach and show that the famous k-center problem is α-approximated when reducing to dimension O(log n/α^2+log k). Along the way, we address the diameter problem via the special case k=1. Our result extends to several important variants of k-center (with outliers, capacities, or fairness constraints), and the bound improves further with the input's doubling dimension.
While our poly(α)-factor improvement in the dimension may seem small, it actually has significant implications for streaming algorithms, and easily yields an algorithm for k-center in dynamic geometric streams, that achieves O(α)-approximation using space poly(kd n^{1/α^2}). This is the first algorithm to beat O(n) space in high dimension d, as all previous algorithms require space at least exp(d). Furthermore, it extends to the k-center variants mentioned above.
We design new algorithms for k-clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine is nσ for arbitrarily small fixed σ>0. Importantly, the local memory may be substantially smaller than k. Our algorithms take O(1) rounds and achieve O(1)-bicriteria approximation for k-Median and for k-Means, namely, they compute (1+ε)k clusters of cost within O(1/ε^2)-factor of the optimum. Previous work achieves only poly(log n)-bicriteria approximation [Bhaskara et al., ICML'18], or handles a special case [Cohen-Addad et al., ICML'22].
Our results rely on an MPC algorithm for O(1)-approximation of facility location in O(1) rounds. A primary technical tool that we develop, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing certain statistics on an approximate neighborhood of every data point, which includes range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].
In cut sparsification, all cuts of a hypergraph H=(V,E,w) are approximated within 1±ϵ factor by a small hypergraph H′. This widely applied method was generalized recently to a setting where the cost of cutting each e∈E is provided by a splitting function, ge:2^e→R_+. This generalization is called a submodular hypergraph when the functions {g_e}e∈E are submodular, and it arises in machine learning, combinatorial optimization, and algorithmic game theory. Previous work focused on the setting where H′ is a reweighted sub-hypergraph of H, and measured size by the number of hyperedges in H′. We study such sparsification, and also a more general notion of representing H succinctly, where size is measured in bits.
In the sparsification setting, where size is the number of hyperedges, we present three results: (i) all submodular hypergraphs admit sparsifiers of size polynomial in n=|V|; (ii) monotone-submodular hypergraphs admit sparsifiers of size O(ϵ^{−2}n^3); and (iii) we propose a new parameter, called spread, to obtain even smaller sparsifiers in some cases.
In the succinct-representation setting, we show that a natural family of splitting functions admits a succinct representation of much smaller size than via reweighted subgraphs (almost by factor n). This large gap is surprising because for graphs, the most succinct representation is attained by reweighted subgraphs. Along the way, we introduce the notion of deformation, where g_e is decomposed into a sum of functions of small description, and we provide upper and lower bounds for deformation of common splitting functions.
Many streaming algorithms provide only a high-probability relative approximation. These two relaxations, of allowing approximation and randomization, seem necessary -- for many streaming problems, both relaxations must be employed simultaneously, to avoid an exponentially larger (and often trivial) space complexity. A common drawback of these randomized approximate algorithms is that independent executions on the same input have different outputs, that depend on their random coins. Pseudo-deterministic algorithms combat this issue, and for every input, they output with high probability the same ``canonical'' solution.
We consider perhaps the most basic problem in data streams, of counting the number of items in a stream of length at most $n$. Morris's counter [CACM, 1978] is a randomized approximation algorithm for this problem that uses O(loglog n) bits of space, for every fixed approximation factor (greater than 1). Goldwasser, Grossman, Mohanty and Woodruff [ITCS 2020] asked whether pseudo-deterministic approximation algorithms can match this space complexity. Our main result answers their question negatively, and shows that such algorithms must use $\Omega(\sqrt{\log n / \log\log n})$ bits of space.
Our approach is based on a problem that we call Shift Finding, and may be of independent interest. In this problem, one has query access to a shifted version of a known string $F\in\{0,1\}^{3n}$, which is guaranteed to start with $n$ zeros and end with $n$ ones, and the goal is to find the unknown shift using a small number of queries. We provide for this problem an algorithm that uses $O(\sqrt{n})$ queries. It remains open whether $\poly(\log n)$ queries suffice; if true, then our techniques immediately imply a nearly-tight $\Omega(\log n/\log\log n)$ space bound for pseudo-deterministic approximate counting.
Max-Cut is a fundamental problem that has been studied extensively in various settings. We study Euclidean Max-Cut, where the input is a set of points in R^d, in the model of dynamic geometric streams, that is, the input is $X\subseteq [\Delta]^d$ presented as a sequence of point insertions and deletions. Previous results by Frahling and Sohler [STOC'05] only address the low-dimensional regime, as their (1+epsilon)-approximation algorithm uses space \exp(d). We design the first streaming algorithms that use space \poly(d), and are thus suitable for a high dimension d.
We tackle this challenge of high dimension using two well-known approaches. The first one is via dimension reduction, where we show that target dimension poly(epsilon^{-1}) suffices for the Johnson-Lindenstrauss transform to preserve Max-Cut within factor (1\pm\epsilon). This result extends the applicability of the prior work algorithm with \exp(d)-space) also to high dimension. The second approach is data reduction}, based on importance sampling. We implement this scheme in streaming by employing a randomly-shifted quadtree. While this is a well-known method to construct a tree embedding, a key feature of our algorithm is that the distortion $O(d\log\Delta)$ affects only the space requirement $\poly(\epsilon^{-1} d\log\Delta)$, and not the approximation ratio 1+epsilon. These results are in line with the growing interest and recent results on streaming (and other) algorithms for high dimension.
We study the classical metric k-median clustering problem over a set of input rankings (i.e., permutations), which has myriad applications, from social-choice theory to web search and databases. A folklore algorithm provides a 2-approximate solution in polynomial time for all $k=O(1)$, and works irrespective of the underlying distance measure, so long it is a metric; however, going below the 2-factor is a notorious challenge. We consider the Ulam distance, a variant of the well-known edit-distance metric, where strings are restricted to be permutations. For this metric, Chakraborty, Das, and Krauthgamer [SODA, 2021] provided a $(2-\delta)$-approximation algorithm for $k=1$, where $\delta\approx 2^{-40}$.
Our primary contribution is a new algorithmic framework for clustering a set of permutations. Our first result is a 1.999-approximation algorithm for the metric k-median problem under the Ulam metric, that runs in time $(k \log (nd))^{O(k)}n d^3$ for an input consisting of n permutations over [d]. In fact, our framework is powerful enough to extend this result to the streaming model (where the n input permutations arrive one by one) using only polylogarithmic (in n) space. Additionally, we show that similar results can be obtained even in the presence of outliers, which is presumably a more difficult problem.
The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions.
In many real-life scenarios, insertions and deletions (abbreviated
We first present a simple deterministic exact algorithm for ED_a and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture.
Our main result is a randomized algorithm computing a $(1+\epsilon)$-approximation of ED_a(X,Y)$, given strings X,Y of total length n and a bound $k\ge ED_a(X,Y)$.
For simplicity, let us focus on $k\geq 1$ and a constant $\epsilon>0$;
then, our algorithm takes $\tilde{O}(\frac{n}{a} + ak^3)$ time.
Unless $a=\tilde{O}(1)$, in which case ED_a resembles the standard edit distance, and for the most interesting regime of small enough k, this running time is sublinear in n.
We also consider a very natural version that asks to find a (k_I, k_S)-alignment, i.e., an alignment with at most k_I indels and k_S substitutions.
In this setting, we give an exact algorithm and, more importantly, an $\tilde{O}(\frac{n k_I}{k_S}+k_S k_I^3)$-time $(1,1+\epsilon)$-bicriteria approximation algorithm.
The latter solution is based on the techniques we develop for ED_a for $a=\Theta(\frac{k_S}{k_I})$, and its running time is again sublinear in n whenever $k_I \ll k_S$ and the overall distance is small enough.
These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving $(1+\epsilon)$-approximation in sublinear time, even for a favorable choice of k.
We study distributed schemes for high-dimensional sparse linear regression, based on orthogonal matching pursuit (OMP). Such schemes are particularly suited for settings where a central fusion center is connected to end machines, that have both computation and communication limitations. We prove that under suitable assumptions, distributed-OMP schemes recover the support of the regression vector with communication per machine linear in its sparsity and logarithmic in the dimension. Remarkably, this holds even at low signal-to-noise-ratios, where individual machines are unable to detect the support. Our simulations show that distributed-OMP schemes are competitive with more computationally intensive methods, and in some cases even outperform them.
Given a large edge-capacitated network G and a subset of k vertices called terminals, an (exact) flow sparsifier is a small network G′ that preserves (exactly) all multicommodity flows that can be routed between the terminals. Flow sparsifiers were introduced by Leighton and Moitra [STOC 2010], and have been studied and used in many algorithmic contexts.
A fundamental question that remained open for over a decade, asks whether every k-terminal network admits an exact flow sparsifier whose size is bounded by some function f(k) (regardless of the size of G or its capacities). We resolve this question in the negative by proving that there exist 6-terminal networks G whose flow sparsifiers G′ must have arbitrarily large size. This unboundedness is perhaps surprising, since the analogous sparsification that preserves all terminal cuts (called exact cut sparsifier or mimicking network) admits sparsifiers of size $f_0(k)\leq $2^{2^k}$ [Hagerup, Katajainen, Nishimura, and Ragde, JCSS 1998].
We prove our results by analyzing the set of all feasible demands in the network, known as the demand polytope. We identify an invariant of this polytope, essentially the slope of certain facets, that can be made arbitrarily large even for k=6, and implies an explicit lower bound on the size of the network. We further use this technique to answer, again in the negative, an open question of Seymour [JCTB 2015] regarding flow-sparsification that uses only contractions and preserves the infeasibility of one demand vector.
In vertex-cut sparsification, given a graph G=(V,E) with a terminal set T⊆V, we wish to construct a graph G′=(V′,E′) with T⊆V′, such that for every two sets of terminals A,B⊆T, the size of a minimum (A,B)-vertex-cut in G′ is the same as in G. In the most basic setting, G is unweighted and undirected, and we wish to bound the size of G′ by a function of k=|T|. Kratsch and Wahlström [JACM 2020] proved that every graph G (possibly directed), admits a vertex-cut sparsifier G′ with O(k^3) vertices, which can in fact be constructed in randomized polynomial time.
We study (possibly directed) graphs G that are quasi-bipartite, i.e., every edge has at least one endpoint in T, and prove that they admit a vertex-cut sparsifier with O(k^2) edges and vertices, which can in fact be constructed in deterministic polynomial time. In fact, this bound naturally extends to all graphs with a small separator into bounded-size sets. Finally, we prove information-theoretically a nearly-matching lower bound, i.e., that $\tilde\Omega(k^2)$ edges are required to sparsify quasi-bipartite undirected graphs.
Motivated by practical generalizations of the classic k-median and k-means objectives,
such as clustering with size constraints, fair clustering, and Wasserstein barycenter,
we introduce a meta-theorem for designing coresets for constrained-clustering problems.
The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive error.
This reduction enables us to construct coresets using uniform sampling,
in contrast to the widely-used importance sampling,
and consequently we can easily handle constrained objectives.
Notably and perhaps surprisingly, this simpler sampling scheme can yield
coresets whose size is independent of $n$, the number of input points.
Our technique yields smaller coresets, and sometimes the first coresets,
for a large number of constrained clustering problems,
including capacitated clustering, fair clustering,
Euclidean Wasserstein barycenter, clustering in minor-excluded graph,
and polygon clustering under Frechet and Hausdorff distance.
Finally, our technique yields also smaller coresets
for 1-median in low-dimensional Euclidean spaces,
specifically of size $\tilde{O}(\varepsilon^{-1.5})$ in $\mathbb{R}^2$
and $\tilde{O}(\varepsilon^{-1.6})$ in $\mathbb{R}^3$.
In Euclidean Uniform Facility Location,
the input is a set of clients in R^d
and the goal is to place facilities to serve them,
so as to minimize the total cost of opening facilities plus connecting the clients.
We study the classical setting of dynamic geometric streams,
where the clients are presented as a sequence of insertions and deletions
of points in the grid {1,...,\Delta}^d,
and we focus on the high-dimensional regime,
where the algorithm's space complexity
must be polynomial (and certainly not exponential) in $d\log\Delta$.
We present a new algorithmic framework,
based on importance sampling from the stream,
for O(1)-approximation of the optimal cost
using only \poly(d\log\Delta) space.
This framework is easy to implement in two passes,
one for sampling points and the other for estimating their contribution.
Over random-order streams, we can extend this to a one-pass algorithm
by using the two halves of the stream separately.
Our main result, for arbitrary-order streams,
computes O(d/log d)-approximation in one pass
by using the new framework but combining the two passes differently.
This improves upon previous algorithms that either need space exp(d)
or only guarantee O(d \log^2 \Delta)-approximation,
and therefore our algorithms for high dimension
are the first to avoid the O(\log\Delta)-factor in approximation
that is inherent to the widely-used quadtree decomposition.
Our improvement is achieved by employing a geometric hashing scheme
that maps points in R^d into buckets of bounded diameter,
with the key property that every point set of small-enough diameter
is hashed into few buckets.
By applying an alternative bound for this hashing,
we also obtain an O(1/epsilon)-approximation in one pass,
using larger but still sublinear space O(n^epsilon),
where n is the number of clients.
Finally, we complement our results with a proof that
every streaming $1.085$-approximation algorithm
requires space exponential in \poly(d\log\Delta),
even for insertion-only streams.
Matrix sparsification is a well-known approach in the design of efficient algorithms, where one approximates a matrix A with a sparse matrix A′. Achlioptas and McSherry [2007] initiated a long line of work on spectral-norm sparsification, which aims to guarantee that $||A′−A||\leq epsilon ||A||$ for error parameter \epsilon>0. Various forms of matrix approximation motivate considering this problem with a guarantee according to the Schatten p-norm for general p, which includes the spectral norm as the special case p=infty. We investigate the relation between fixed but different p \neq q, that is, whether sparsification in Schatten p-norm implies (existentially and/or algorithmically) sparsification in Schatten q-norm with similar sparsity. An affirmative answer could be tremendously useful, as it will identify which value of p to focus on. Our main finding is a surprising contrast between this question and the analogous case of l_p-norm sparsification for vectors: For vectors, the answer is affirmative for p < q and negative for p > q, but for matrices we answer negatively for almost all p\neq q.
In 1961, Gomory and Hu showed that the max-flow values of all {n\choose 2} pairs of vertices in an undirected graph can be computed using only n−1 calls to any max-flow algorithm, and succinctly represented them in a tree (called the Gomory-Hu tree later). Even assuming a linear-time max-flow algorithm, this yields a running time of O(mn) for this problem; with current max-flow algorithms, the running time is $\tilde O(mn+n^{5/2})$. We break this 60-year old barrier by giving an $\tilde O(n^2)$-time algorithm for the Gomory-Hu tree problem, already with current max-flow algorithms. For unweighted graphs, our techniques show equivalence (up to poly-logarithmic factors in running time) between Gomory-Hu tree (i.e., all-pairs max-flow values) and a single-pair max-flow.
We study the problem of approximating edit distance in sublinear time. This is formalized as a promise problem (k,k^c)-Gap Edit Distance, where the input is a pair of strings X,Y and parameters k,c > 1, and the goal is to return YES if ED(X,Y) <= k and NO if ED(X,Y) > k^c. Recent years have witnessed significant interest in designing sublinear-time algorithms for Gap Edit Distance.
We resolve the non-adaptive query complexity of Gap Edit Distance, improving over several previous results. Specifically, we design a non-adaptive algorithm with query complexity $\tilde O(n/k^{c−0.5})$, and further prove that this bound is optimal up to polylogarithmic factors.
Our algorithm also achieves optimal time complexity $\tilde O(n/k^{c−0.5})$ whenever c >= 1.5. For 1 < c < 1.5, the running time of our algorithm is $\tilde O(n/k^{2c−1})$. For the restricted case of $k^c=\Omega(n)$, this matches a known result [Batu, Ergün, Kilian, Magen, Raskhodnikova, Rubinfeld, and Sami, STOC 2003], and in all other (nontrivial) cases, our running time is strictly better than all previous algorithms, including the adaptive ones.
We devise new cut sparsifiers that are related to
the classical sparsification of Nagamochi and Ibaraki [Algorithmica, 1992],
which is an algorithm that, given an unweighted graph G on n nodes and a parameter k,
computes a subgraph with O(nk) edges that preserves all cuts of value up to k.
We put forward the notion of a friendly cut sparsifier,
which is a minor of $G$ that preserves all friendly cuts of value up to k,
where a cut in $G$ is called friendly if every node has more edges
connecting it to its own side of the cut than to the other side.
We present an algorithm that, given a simple graph G,
computes in almost-linear time
a friendly cut sparsifier with $\tilde{O}(n \sqrt{k})$ edges.
Using similar techniques,
we also show how, given in addition a terminal set T,
one can compute in almost-linear time a \emph{terminal sparsifier},
which preserves the minimum $st$-cut between every pair of terminals,
with $\tilde{O}(n \sqrt{k} + |T| k)$ edges.
Plugging these sparsifiers into the recent $n^{2+o(1)}$-time algorithms for constructing a Gomory-Hu tree of simple graphs,
along with a relatively simple procedure for handling the unfriendly minimum cuts,
we improve the running time for moderately dense graphs
(e.g., with $m=n^{1.75}$ edges).
In particular, assuming a linear-time Max-Flow algorithm,
the new state-of-the-art for Gomory-Hu tree is
the minimum between our $(m+n^{1.75})^{1+o(1)}$ and the known $m n^{1/2+o(1)}$.
We further investigate the limits of this approach and the possibility of better sparsification.
Under the hypothesis that an $\tilde{O}(n)$-edge sparsifier that preserves all friendly minimum $st$-cuts can be computed efficiently, our upper bound improves to $\tilde{O}(m+n^{1.5})$ which is the best possible without breaking the cubic barrier for constructing Gomory-Hu trees in non-simple graphs.
We devise the first coreset for kernel k-Means, and use it to obtain new, more efficient, algorithms. Kernel k-Means has superior clustering capability compared to classical k-Means particularly when clusters are separable non-linearly, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs.
Our main result is the first coreset for kernel k-Means, whose size is independent of the number of input points n, and moreover is constructed in time near-linear in n. This result immediately implies new algorithms for kernel k-Means, such as a (1+epsilon)-approximation in time near-linear in n, and a streaming algorithm using space and update time $poly(k\epsilon^{-1}\log n)$.
We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel k-Means++ (the kernelized version of the widely used k-Means++ algorithm), and we further use this faster kernel k-Means++ for spectral clustering. In both applications, we achieve up to 1000x speedup while the error is comparable to baselines that do not use coresets.
We consider an approximate version of the trace reconstruction problem,
where the goal is to recover an unknown string $s\in\{0,1\}^n$
from $m$ traces
(each trace is generated independently by passing $s$ through a probabilistic
insertion-deletion channel with rate $p$).
We present a deterministic near-linear time algorithm for the average-case model,
where $s$ is random, that uses only \emph{three} traces.
It runs in near-linear time $\tilde O(n)$
and with high probability reports a string
within edit distance $O(\epsilon p n)$ from $s$ for $\epsilon=\tilde O(p)$,
which significantly improves over the straightforward bound of $O(pn)$.
Technically, our algorithm computes a $(1+\epsilon)$-approximate median
of the three input traces.
To prove its correctness, our probabilistic analysis shows
that an approximate median is indeed close to the unknown $s$.
To achieve a near-linear time bound, we have to bypass the well-known
dynamic programming algorithm that computes an optimal median in time $O(n^3)$.
We provide the first coreset for clustering points in R^d that have multiple missing values (coordinates).
Previous coreset constructions only allow one missing coordinate.
The challenge in this setting is that objective functions, like k-Means,
are evaluated only on the set of available (non-missing) coordinates,
which varies across points.
Recall that an \epsilon-coreset of a large dataset
is a small proxy, usually a reweighted subset of points,
that (1+\epsilon)-approximates the clustering objective
for every possible center set.
Our coresets for k-Means and k-Median clustering have size
$(jk)^{O(\min(j,k))} (\epsilon^{-1} d \log n)^2$,
where $n$ is the number of data points, $d$ is the dimension and $j$ is the maximum number of missing coordinates for each data point.
We further design an algorithm to construct these coresets in near-linear time,
and consequently improve a recent quadratic-time PTAS
for \kMeans with missing values [Eiben et al., SODA 2021]
to near-linear time.
We validate our coreset construction,
which is based on importance sampling and is easy to implement,
on various real data sets.
Our coreset exhibits a flexible tradeoff between coreset size and accuracy,
and generally outperforms the uniform-sampling baseline.
Furthermore, it significantly speeds up a Lloyd's-style heuristic for k-Means with missing values.
We design an $n^{2+o(1)}$-time algorithm that constructs a cut-equivalent (Gomory-Hu) tree of a simple graph on n nodes.
This bound is almost-optimal in terms of n, and it improves on the recent $\tilde{O}(n^{2.5})$ bound by the authors (STOC 2021), which was the first to break the cubic barrier.
Consequently, the All-Pairs Maximum-Flow (APMF) problem has time complexity $n^{2+o(1)}$, and for the first time in history, this problem can be solved faster than All-Pairs Shortest Paths (APSP).
We further observe that an almost-linear time algorithm (in terms of the number of edges m) is not possible without first obtaining a subcubic algorithm for multigraphs.
Finally, we derandomize our algorithm, obtaining the first subcubic deterministic algorithm for Gomory-Hu Tree in simple graphs, showing that randomness is not necessary for beating the n-1 times max-flow bound from 1961.
The upper bound is $\tilde{O}(n^{2\frac{2}{3}})$ and it would improve to $n^{2+o(1)}$ if there is a deterministic single-pair maximum-flow algorithm that is almost-linear.
The key novelty is in using a ``dynamic pivot'' technique instead of the randomized pivot selection that was central in recent works.
Graph sparsification has been studied extensively over the past two decades,
culminating in spectral sparsifiers of optimal size (up to constant factors).
Spectral hypergraph sparsification is a natural analogue of this problem,
for which optimal bounds on the sparsifier size are not known,
mainly because the hypergraph Laplacian is non-linear,
and thus lacks the linear-algebraic structure and tools
that have been so effective for graphs.
Our main contribution is the first algorithm for constructing \epsilon-spectral sparsifiers for hypergraphs with $O^*(n)$ hyperedges,
where $O^*$ suppresses $(\epsilon^{-1} \log n)^{O(1)}$ factors.
This bound is independent of the rank r (maximum cardinality of a hyperedge), and is essentially best possible due to a recent bit complexity lower bound of \Omega(nr) for hypergraph sparsification.
This result is obtained by introducing two new tools.
First, we give a new proof of spectral concentration bounds for sparsifiers of graphs;
it avoids linear-algebraic methods,
replacing e.g. the usual application of the matrix Bernstein inequality
and therefore applies to the (non-linear) hypergraph setting. To achieve the result, we design a new sequence of hypergraph-dependent \epsilon-nets on the unit sphere in R^n.
Second, we extend the weight-assignment technique of Chen, Khanna and Nagda [FOCS'20] to the spectral sparsification setting. Surprisingly, the number of spanning trees after the weight assignment can serve as a potential function guiding the reweighting process in the spectral setting.
Large matrices are often accessed as a row-order stream. We consider the setting where rows are time-sensitive (i.e. they expire), which can be described by the sliding-window row-order model, and provide the first (1+epsilon)-approximation of Schatten p-norms in this setting. Our main technical contribution is a proof that Schatten p-norms in row-order streams are smooth, and thus fit the smooth-histograms technique of Braverman and Ostrovsky (FOCS 2007) for sliding-window streams.
We consider the problem of sparse normal means estimation in a distributed setting with communication constraints. We assume there are M machines, each holding a d-dimensional observation of a K-sparse vector \mu corrupted by additive Gaussian noise. A central fusion machine is connected to the M machines in a star topology, and its goal is to estimate the vector \mu with a low communication budget. Previous works have shown that to achieve the centralized minimax rate for the l2 risk, the total communication must be high - at least linear in the dimension d. This phenomenon occurs, however, at very weak signals. We show that once the signal-to-noise ratio (SNR) is slightly higher, the support of \mu can be correctly recovered with much less communication. Specifically, we present two algorithms for the distributed sparse normal means problem, and prove that above a certain SNR threshold, with high probability, they recover the correct support with total communication that is sublinear in the dimension d. Furthermore, the communication decreases exponentially as a function of signal strength. If in addition KM<
Every undirected graph G has a (weighted)
We give the first subcubic-time algorithm that constructs such a tree
for an unweighted graph G.
Its time complexity is $\tilde{O}(n^{2.75})$, for n=|V(G)|;
previously, only $\tilde{O}(n^3)$ was known,
except for restricted cases like sparse graphs.
Consequently, we obtain the first algorithm
for All-Pairs Max-Flow in unweighted graphs that breaks the cubic-time barrier.
Under the hypothesis that single-pair Max-Flow can be computed in almost-linear time,
our time bound improves further to $n^{2.5+o(1)}$.
Even under this assumption, no previously known algorithm is subcubic.
Cut and spectral sparsification of graphs have numerous applications,
including e.g. speeding up algorithms for cuts and Laplacian solvers.
These powerful notions have recently been extended to hypergraphs,
which are much richer and may offer new applications. However, the current bounds on the size of hypergraph sparsifiers are not as tight as the corresponding bounds for graphs.
Our first result is a polynomial-time algorithm that,
given a hypergraph on $n$ vertices with maximum hyperedge size r,
outputs an $\epsilon$-spectral sparsifier
with $O^*(nr)$ hyperedges, where $O^*$ suppresses $(\epsilon^{-1} \log n)^{O(1)}$ factors.
This size bound improves the two previous bounds:
$O^*(n^3)$ [Soma and Yoshida, SODA'19]
and $O^*(nr^3)$ [Bansal, Svensson and Trevisan, FOCS'19]. Our main technical tool is a new method for proving concentration of the nonlinear analogue of the quadratic form of the Laplacians for hypergraph expanders.
We complement this with lower bounds on the bit complexity of any compression scheme that $(1+\epsilon)$-approximates all the cuts in a given hypergraph,
and hence also on the bit complexity of every $\epsilon$-cut/spectral sparsifier.
These lower bounds are based on Ruzsa-Szemeredi graphs,
and a particular instantiation yields an $\Omega(nr)$ lower bound on the bit complexity even for fixed constant $\epsilon$. This is tight up to polylogarithmic factors in $n$, due to recent hypergraph cut sparsifiers of [Chen, Khanna and Nagda, FOCS'20].
Finally, for directed hypergraphs,
we present an algorithm that computes an $\epsilon$-spectral sparsifier
with $O^*(n^2r^3)$ hyperarcs,
where $r$ is the maximum size of a hyperarc.
For small $r$, this improves over $O^*(n^3)$
known from [Soma and Yoshida, SODA'19],
and is getting close to the trivial lower bound of $\Omega(n^2)$ hyperarcs.
We consider a natural generalization of the Steiner tree problem, the Steiner forest problem, in the Euclidean plane: the input is a multiset $X \subseteq R^2$, partitioned into k color classes $C_1, C_2, \ldots, C_k \subseteq X$. The goal is to find a minimum-cost Euclidean graph G such that every color class C_i is connected in G.
We study this Steiner forest problem in the streaming setting, where the stream consists of insertions and deletions of points to $X$.
Each input point $x\in X$ arrives with its color $color(x) \in [k]$,
and as usual for dynamic geometric streams,
the input points are restricted to the discrete grid {0,..., \Delta}^2$.
We design a single-pass streaming algorithm that uses $poly(k \log\Delta)$ space and time,
and estimates the cost of an optimal Steiner forest solution
within ratio arbitrarily close to the famous Euclidean Steiner ratio $\alpha_2$
(currently $1.1547 \leq \alpha_2 \leq 1.214$).
Our approach relies on a novel combination of streaming techniques, like sampling and linear sketching, with the classical dynamic-programming framework for geometric optimization problems, which usually requires large memory and has so far not been applied in the streaming setting.
We complement our streaming algorithm for the Steiner forest problem
with simple arguments showing that any finite approximation
requires Omega(k) bits of space.
In addition, our approximation ratio is currently the best
even for streaming Steiner tree, i.e., k=1.
Many real-world data sets are sparse or almost sparse.
One method to measure this for a matrix $A\in R^{n\times n}$
is the numerical sparsity, denoted $\ns(A)$,
defined as the minimum $k\geq 1$ such that
$||a||_1/||a||_2 \leq \sqrt{k}$
for every row and every column $a$ of $A$.
This measure of $a$ is smooth and is clearly only smaller
than the number of non-zeros in the row/column $a$.
The seminal work of Achlioptas and McSherry [2007]
has put forward the question of approximating an input matrix $A$ by entrywise sampling.
More precisely, the goal is to quickly compute a sparse matrix $\tilde{A}$
satisfying $||A - \tilde{A}||_2 \leq \epsilon ||A||_2$
(i.e., additive spectral approximation) given an error parameter $\epsilon>0$.
The known schemes sample and rescale a small fraction of entries from $A$.
We propose a scheme that sparsifies an almost-sparse matrix $A$ ---
it produces a matrix $\tilde{A}$ with $O(\epsilon^{-2}\ns(A) n\ln n)$ non-zero entries with high probability.
We also prove that this upper bound on $\nnz(\tilde{A})$
is tight up to logarithmic factors.
Moreover, our upper bound improves when the spectrum of $A$ decays quickly
(roughly replacing $n$ with the stable rank of $A$).
Our scheme can be implemented in time $O(\mathsf{nnz}(A))$ when $||A||_2$ is given.
Previously, a similar upper bound was obtained by Achlioptas et. al [2013] but only for a restricted class of inputs that does not even include symmetric or covariance matrices.
Finally, we demonstrate two applications of these sampling techniques,
to faster approximate matrix multiplication,
and to ridge regression by using sparse preconditioners.
We study approximation algorithms for variants of the median string problem,
which asks for a string that minimizes
the sum of edit distances from a given set of m strings of length n.
Only the straightforward 2-approximation is known for this NP-hard problem.
This problem is motivated e.g. by computational biology,
and belongs to the class of median problems (over different metric spaces),
which are fundamental tasks in data analysis.
Our main result is for the Ulam metric,
where all strings are permutations over [n]
and each edit operation moves a symbol (deletion plus insertion). We devise for this problem an algorithms
that breaks the 2-approximation barrier,
i.e., computes a (2-\delta)-approximate median permutation
for some constant \delta>0 in time $\tilde{O}(nm^2+n^3)$. We further use these techniques to achieve a (2-\delta) approximation
for the median string problem in the special case
where the median is restricted to length n
and the optimal objective is large \Omega(mn).
We also design an approximation algorithm for the following
probabilistic model of the Ulam median:
the input consists of m perturbations of an (unknown) permutation x,
each generated by moving every symbol to a random position
with probability (a parameter) \epsilon>0.
Our algorithm computes with high probability
a (1+o(1/\epsilon))-approximate median permutation in time O(mn^2+n^3).
Min-Cut queries are fundamental: Preprocess an undirected edge-weighted graph,
to quickly report a minimum-weight cut that separates a query pair of nodes s,t.
The best data structure known for this problem simply builds a cut-equivalent tree, discovered 60 years ago by Gomory and Hu,
who also showed how to construct it using n-1 minimum st-cut computations.
Using state-of-the-art algorithms for minimum $st$-cut (Lee and Sidford, FOCS 2014),
one can construct the tree in time $\tilde{O}(mn^{3/2})$,
which is also the preprocessing time of the data structure.
(Throughout, we focus on polynomially-bounded edge weights,
noting that faster algorithms are known for small/unit edge weights,
and use n and m for the number of nodes and edges in the graph.)
Our main result shows the following equivalence:
Cut-equivalent trees can be constructed in near-linear time if and only if
there is a data structure for Min-Cut queries with near-linear preprocessing time and polylogarithmic (amortized) query time, and even if the queries are restricted to a fixed source.
That is, equivalent trees are an essentially optimal solution for Min-Cut queries.
This equivalence holds even for every minor-closed family of graphs,
such as bounded-treewidth graphs, for which a two-decade old data structure
(Arikati, Chaudhuri, and Zaroliagis, J. Algorithms 1998)
implies the first near-linear time construction of cut-equivalent trees.
Moreover, unlike all previous techniques for constructing cut-equivalent trees, ours is robust to relying on approximation algorithms.
In particular, using the almost-linear time algorithm
for $(1+\epsilon)$-approximate minimum $st$-cut
(Kelner, Lee, Orecchia, and Sidford, SODA 2014),
we can construct a $(1+\epsilon)$-approximate flow-equivalent tree
(which is a slightly weaker notion) in time $n^{2+o(1)}$.
This leads to the first $(1+\epsilon)$-approximation for All-Pairs Max-Flow
that runs in time $n^{2+o(1)}$, and matches the output size almost-optimally.
Coresets are modern data-reduction tools
that are widely used in data analysis to improve efficiency
in terms of running time, space and communication complexity.
Our main result is a fast algorithm to construct a small coreset
for \kMedian in (the shortest-path metric of) an excluded-minor graph.
Specifically, we give the first coreset of size that depends only on $k$, $\epsilon$ and the excluded-minor size,
and our running time is quasi-linear (in the size of the input graph).
The main innovation in our new algorithm is that is iterative;
it first reduces the $n$ input points to roughly $O(\log n)$ reweighted points,
then to $O(\log\log n)$, and so forth until the size is independent of $n$.
Each step in this iterative size reduction is based on the
importance sampling framework of Feldman and Langberg (STOC 2011),
with a crucial adaptation that reduces the number of \emph{distinct points},
by employing a terminal embedding
(where low distortion is guaranteed only for the distance
from every terminal to all other points).
Our terminal embedding is technically involved
and relies on shortest-path separators,
a standard tool in planar and excluded-minor graphs.
Furthermore, our new algorithm is applicable also in Euclidean metrics,
by simply using a recent terminal embedding result of Narayanan and Nelson, (STOC 2019), which extends the Johnson-Lindenstrauss Lemma.
We thus obtain an efficient coreset construction in high-dimensional Euclidean spaces, thereby matching and simplifying state-of-the-art results
(Sohler and Woodruff, FOCS 2018; Huang and Vishnoi, STOC 2020).
In addition, we also employ terminal embedding with additive distortion
to obtain small coresets in graphs with bounded highway dimension,
and use applications of our coresets to obtain improved approximation schemes, e.g., an improved PTAS for planar \kMedian via a new centroid set.
We consider the rooted orienteering problem in Euclidean space:
Given $n$ points P in R^d, a root point $s\in P$ and a budget B>0,
find a path that starts from s, has total length at most B,
and visits as many points of P as possible.
This problem is known to be NP-hard,
hence we study (1-\delta)-approximation algorithms.
The previous Polynomial-Time Approximation Scheme (PTAS) for this problem,
due to Chen and Har-Peled (2008), runs in time $n^{O(d\sqrt{d}/\delta)}(\log n)^{(d/\delta)^{O(d)}}$,
and improving on this time bound was left as an open problem.
Our main contribution is a PTAS with a significantly improved time complexity
of $n^{O(1/\delta)}(\log n)^{(d/\delta)^{O(d)}}$.
A known technique for approximating the orienteering problem is to reduce it
to solving 1/\delta correlated instances of rooted k-TSP
(a k-TSP tour is one that visits at least $k$ points).
However, the k-TSP tours in this reduction must achieve a certain excess guarantee (namely, their length can surpass the optimum length only
in proportion to a parameter of the optimum called excess)
that is stronger than the usual (1+\delta)-approximation.
Our main technical contribution is to improve the running time of these
k-TSP variants, particularly in its dependence on the dimension d.
Indeed, our running time is polynomial even for a moderately large dimension,
roughly up to d=O(\log\log n) instead of d=O(1).
The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other.
A simple dynamic programming computes the edit distance between two strings of length n in O(n^2) time,
and a more sophisticated algorithm runs in time O(n+t^2)
when the edit distance is t [Landau, Myers and Schmidt, SICOMP 1998].
In pursuit of obtaining faster running time, the last couple of decades have seen a flurry of research on approximating edit distance,
including polylogarithmic approximation in near-linear time
[Andoni, Krauthgamer and Onak, FOCS 2010],
and a constant-factor approximation in subquadratic time
[Chakrabarty, Das, Goldenberg, Kouck\'y and Saks, FOCS 2018].
We study sublinear-time algorithms for small edit distance,
which was investigated extensively because of its numerous applications.
Our main result is an algorithm for distinguishing
whether the edit distance is at most t or at least t^2
(the quadratic gap problem) in time $\tilde{O}(\frac{n}{t}+t^3)$.
This time bound is sublinear roughly for all t in [\omega(1), o(n^{1/3})],
which was not known before.
The best previous algorithms solve this problem in sublinear time
only for t=\omega(n^{1/3}) [Andoni and Onak, STOC 2009].
Our algorithm is based on a new approach that adaptively switches
between uniform sampling and reading contiguous blocks of the input strings.
In contrast, all previous algorithms choose which coordinates to query non-adaptively.
Moreover, it can be extended to solve the t vs t^{2-\epsilon} gap problem
in time $\tilde{O}(\frac{n}{t^{1-\epsilon}}+t^3)$.
We investigate for which metric spaces the performance
of distance labeling and of $\ell_\infty$-embeddings differ,
and how significant can this difference be.
Recall that a distance labeling is
a distributed representation of distances in a metric space $(X,d)$,
where each point $x\in X$ is assigned a succinct label,
such that the distance between any two points $x,y \in X$
can be approximated given only their labels.
A highly structured special case is an embedding into $\ell_\infty$,
where each point $x\in X$ is assigned a vector $f(x)$
such that $\|f(x)-f(y)\|_\infty$ is approximately $d(x,y)$.
The performance of a distance labeling or an $\ell_\infty$-embedding
is measured via its distortion and its label-size/dimension.
We also study the analogous question
for the prioritized versions of these two measures.
Here, a priority order $\pi=(x_1,\dots,x_n)$ of the point set $X$ is given,
and higher-priority points should have shorter labels.
Formally, a distance labeling has prioritized label-size $\alpha(.)$
if every $x_j$ has label size at most $\alpha(j)$.
Similarly, an embedding $f: X \to \ell_\infty$
has prioritized dimension $\alpha(.)$
if $f(x_j)$ is non-zero only in the first $\alpha(j)$ coordinates.
In addition, we compare these their prioritized measures
to their classical (worst-case) versions.
We answer these questions in several scenarios,
uncovering a surprisingly diverse range of behaviors.
First, in some cases labelings and embeddings have very similar worst-case performance, but in other cases there is a huge disparity.
However in the prioritized setting, we most often find
a strict separation between the performance of labelings and embeddings.
And finally, when comparing the classical and prioritized settings,
we find that the worst-case bound for label size often ``translates''
to a prioritized one, but also a surprising exception to this rule.
The spectrum of a matrix contains important structural information
about the underlying data, and hence there is considerable interest
in computing various functions of the matrix spectrum.
A fundamental example for such functions is the l_p-norm of the spectrum,
called the Schatten p-norm of the matrix.
Large matrices representing real-world data are often
sparse (most entries are zeros) or doubly sparse,
i.e., sparse in both rows and columns.
These large matrices are usually accessed as a stream of updates,
typically organized in row-order.
In this setting, where space (memory) is the limiting resource,
computing spectral functions is an expensive task and known algorithms require space that is polynomial
in the dimension of the matrix, even for sparse matrices.
Thus, it is highly desirable to design algorithms requiring
significantly smaller space.
We answer this challenge by providing the first algorithm
that uses space independent of the matrix dimension
to compute the Schatten p-norm of a doubly-sparse matrix
presented in row order.
Instead, our algorithm uses space polynomial in the sparsity parameter k and makes O(p) passes over the data stream. We further prove that multiple passes are unavoidable in this setting and show several extensions of our primary technique,
including stronger upper bounds for special matrix families,
algorithms for the more difficult turnstile model,
and a trade-off between space requirements and number of passes.
We initiate the study of coresets for clustering in graph metrics,
i.e., the shortest-path metric of edge-weighted graphs.
Such clustering problems are essential to data analysis
and used for example in road networks and data visualization.
A coreset is a compact summary of the data that approximately preserves
the clustering objective for every possible center set,
and it offers significant efficiency improvements
in terms of running time, storage, and communication,
including in streaming and distributed settings.
Our main result is a near-linear time construction
of a coreset for k-Median in a general graph G,
with size $O_{\epsilon, k}(tw(G))$
where $tw(G)$ is the treewidth of G,
and we complement the construction with a nearly-tight size lower bound.
The construction is based on the framework of Feldman and Langberg [STOC 2011],
and our main technical contribution, as required by this framework,
is a uniform bound of $O(tw(G))$ on the shattering dimension
under any point weights.
We validate our coreset on real-world road networks,
and our scalable algorithm constructs tiny coresets with high accuracy,
which translates to a massive speedup of
existing approximation algorithms such as local search for graph k-Median.
Orthogonal Matching pursuit (OMP) is a popular algorithm to estimate an unknown sparse vector from multiple linear measurements of it. Assuming exact sparsity and that the measurements are corrupted by additive Gaussian noise, the success of OMP is often formulated as exactly recovering the support of the sparse vector. Several authors derived a sufficient condition for exact support recovery by OMP with high probability depending on the signal-to-noise ratio, defined as the magnitude of the smallest non-zero coefficient of the vector divided by the noise level. We make two contributions. First, we derive a slightly sharper sufficient condition for two variants of OMP, in which either the sparsity level or the noise level is known. Next, we show that this sharper sufficient condition is tight, in the following sense: for a wide range of problem parameters, there exist a dictionary of linear measurements and a sparse vector with a signal-to-noise ratio slightly below that of the sufficient condition, for which with high probability OMP fails to recover its support. Finally, we present simulations which illustrate that our condition is tight for a much broader range of dictionaries.
We study algorithms for the sliding-window model, an important variant
of the data-stream model, in which the goal is to compute some function
of a fixed-length suffix of the stream. We extend the smooth-histogram
framework of Braverman and Ostrovsky (FOCS 2007) to almost-smooth
functions, which includes all subadditive functions. Specifically,
we show that if a subadditive function can be $(1+\epsilon)$-approximated
in the insertion-only streaming model, then it can be $(2+\epsilon)$-approximated
also in the sliding-window model with space complexity larger
by factor $O(\epsilon^{-1}\log w)$, where $w$ is the window size.
We demonstrate how our framework yields new approximation algorithms
with relatively little effort for a variety of problems that do not
admit the smooth-histogram technique. For example, in the frequency-vector
model, a symmetric norm is subadditive and thus we obtain a sliding-window
$(2+\epsilon)$-approximation algorithm for it. Another example
is for streaming matrices, where we derive a new sliding-window $(\sqrt{2}+\epsilon)$-approximation
algorithm for Schatten $4$-norm. We then consider graph streams and
show that many graph problems are subadditive, including maximum submodular
matching, minimum vertex-cover, and maximum $k$-cover, thereby deriving
sliding-window $O(1)$-approximation algorithms for them almost for
free (using known insertion-only algorithms). Finally, we design for
every $d\in (1,2]$ an artificial function, based on the
maximum-matching size, whose almost-smoothness parameter is exactly
$d$.
We design coresets for Ordered k-Median, a generalization of classical
clustering problems such as k-Median and k-Center, that offers a more flexible
data analysis, like easily combining multiple objectives (e.g., to increase
fairness or for Pareto optimization). Its objective function is defined via the
Ordered Weighted Averaging (OWA) paradigm of Yager (1988), where data points
are weighted according to a predefined weight vector, but in order of their contribution to the objective (distance from the centers).
A powerful data-reduction technique, called a coreset, is to summarize a
point set X in R^d into a small (weighted) point set X', such that for every set of k potential centers, the objective value of the coreset X' approximates that of X within factor $1\pm \epsilon$. When there are
multiple objectives (weights), the above standard coreset might have limited
usefulness, whereas in a simultaneous coreset, which was introduced
recently by Bachem and Lucic and Lattanzi (2018), the above approximation holds
for all weights (in addition to all centers). Our main result is a construction
of a simultaneous coreset of size $O_{\epsilon, d}(k^2 \log^2 |X|)$ for Ordered
k-Median.
To validate the efficacy of our coreset construction we ran experiments on a
real geographical data set. We find that our algorithm produces a small
coreset, which translates to a massive speedup of clustering computations,
while maintaining high accuracy for a range of weights.
We investigate the time-complexity of the All-Pairs-Max-Flow problem:
Given a graph with $n$ nodes and $m$ edges,
compute for all pairs of nodes the maximum-flow value between them.
If Max-Flow (the version with a given source-sink pair $s,t$) can be solved in time $T(m)$, then an $O(n^2) \cdot T(m)$ is a trivial upper bound.
But can we do better?
For directed graphs, recent results in fine-grained complexity suggest that this time bound is essentially optimal.
In contrast, for undirected graphs with edge capacities,
a seminal algorithm of Gomory and Hu (1961) runs in much faster time
$O(n)\cdot T(m)$.
Under the plausible assumption that Max-Flow can be solved in near-linear time
$m^{1+o(1)}$, this half-century old algorithm yields an $nm^{1+o(1)}$ bound.
Several other algorithms have been designed through the years,
including $\tilde{O}(mn)$ time for unit-capacity edges (unconditionally),
but none of them break the $O(mn)$ barrier.
Meanwhile, no super-linear lower bound was shown for undirected graphs.
We design the first hardness reductions for All-Pairs-Max-Flow in undirected graphs,
giving an essentially optimal lower bound for the node-capacities
setting.
For edge capacities, our efforts to prove similar lower bounds have failed,
but we have discovered a surprising new algorithm
that breaks the $O(mn)$ barrier for graphs with unit-capacity edges!
Assuming $T(m)=m^{1+o(1)}$, our algorithm runs in time $m^{3/2 +o(1)}$ and outputs a cut-equivalent tree (similarly to the Gomory-Hu algorithm).
Even with current Max-Flow algorithms we improve state-of-the-art
as long as $m=O(n^{5/3-\varepsilon})$.
Finally, we explain the lack of lower bounds by proving a non-reducibility result.
This result is based on a new quasi-linear time $\tO(m)$ non-deterministic algorithm for constructing a cut-equivalent tree and may be of independent interest.
Most known algorithms in the streaming model of computation aim to
approximate a single function such as an $\ell_p$ norm.
In 2009, Nelson [https://sublinear.info, Open Problem 30] asked
if it is possible to design universal algorithms, that
simultaneously approximate multiple functions of the stream. In this
paper we answer the question of Nelson for the class of
subset-$\ell_0$ norms in the insertion-only frequency-vector model.
Given a family of subsets, $\calS\subset 2^{[n]}$, we provide a single
streaming algorithm that can $(1\pm \epsilon)$-approximate the
subset-$\ell_p$ norm for every $S\in\calS$. Here, the subset-$\ell_p$
norm of $v\in \R^n$ with respect to the set $S\subseteq [n]$ is the
$\ell_p$ norm of $v_{|S}$ (the vector $v$ restricted to $S$ by zeroing
all other coordinates).
Our main result is a nearly tight characterization of the space
complexity of the subset-$\ell_0$ norm for every family $F\subset
2^{[n]}$ in insertion-only streams, expressed in terms of the "heavy-
hitter dimension" of $\calS$, a new combinatorial quantity related to
the VC-dimension of $\calS$. We also show that the more general
turnstile and sliding-window models require a much larger space usage.
All these results easily extend to the $\ell_1$ norm.
In addition, we design algorithms for two other subset-$\ell_p$
variants. These can be compared to the famous Priority Sampling
algorithm of Duffield, Lund and Thorup [JACM 2007], which achieves
additive approximation $\epsilon\norm{v}_1$ for all possible subsets
($\calS=2^{[n]}$) in the entrywise update model. One of our algorithms
extends their algorithm to handle turnstile updates, and another one
achieves multiplicative approximation, given a family $\calS$.
We study sublinear algorithms that solve linear systems locally.
In the classical version of this problem the input is a
matrix $S\in \R^{n\times n}$ and a vector $b\in\R^n$ in the range of
$S$, and the goal is to output $x\in \R^n$ satisfying $Sx=b$.
For the case when the matrix $S$ is symmetric diagonally dominant (SDD),
the breakthrough algorithm of Spielman and Teng [STOC 2004] approximately
solves this problem in near-linear time (in the input size which is
the number of non-zeros in $S$), and subsequent papers have further
simplified, improved, and generalized the algorithms for this setting.
Here we focus on computing one (or a few) coordinates of $x$,
which potentially allows for sublinear algorithms. Formally, given an
index $u\in [n]$ together with $S$ and $b$ as above, the goal is to
output an approximation $\hat{x}_u$ for $x^*_u$, where $x^*$ is a
fixed solution to $Sx=b$.
Our results show that there is a qualitative gap between SDD matrices
and the more general class of positive semidefinite (PSD) matrices.
For SDD matrices, we develop an algorithm that
approximates a single coordinate $x_{u}$ in time that is
polylogarithmic in $n$, provided that $S$ is sparse and has a small
condition number (e.g., Laplacian of an expander graph). The
approximation guarantee is additive $| \hat{x}_u-x^*_u | \le \epsilon
\| x^* \|_\infty$ for accuracy parameter $\epsilon>0$. We further
prove that the condition-number assumption is necessary and tight.
In contrast to the SDD matrices, we prove that for certain PSD
matrices $S$, the running time must be at least polynomial in $n$.
This holds even when one wants to obtain the same additive approximation,
and $S$ has bounded sparsity and condition number.
We reprove three known (algorithmic) bounds for terminal-clustering problems,
using a single framework that leads to simpler proofs.
The input in terminal-clustering problems is a metric space (X,d)
(possibly arising from a graph) and a subset $K\subset X$ of terminals,
and the goal is to partition the points X,
such that each part, called cluster, contains exactly one terminal
(possibly with connectivity requirements), so as to minimize some objective.
The three bounds we reprove are for
Steiner Point Removal on trees [Gupta, SODA 2001],
Metric 0-Extension for bounded doubling dimension [Lee and Naor, unpublished 2003],
and Connected Metric 0-Extension [Englert et al., SICOMP 2014].
A natural approach is to cluster each point with its closest terminal,
which would partition X into so-called Voronoi cells,
but this approach can fail miserably due to its stringent cluster boundaries.
A now-standard fix is to enlarge each Voronoi cell
computed in some order to obtain disjoint clusters,
which defines the Noisy-Voronoi algorithm.
This method, first proposed by Calinescu, Karloff and Rabani [SICOMP 2004],
was employed successfully to provide state-of-the-art results
for terminal-clustering problems on general metrics.
However, for restricted families of metrics, e.g., trees and doubling metrics,
only more complicated, ad-hoc algorithms are known.
Our main contribution is to demonstrate that the Noisy-Voronoi algorithm is not only applicable to restricted metrics,
but actually leads to relatively simple algorithms and
analyses.
The relationship between the
sparsest cut and the maximum concurrent multi-flow
in graphs has been studied extensively.
For general graphs, the worst-case gap between
these two quantities is now settled: When there are $k$ terminal pairs,
the flow-cut gap is O(log k), and this is tight.
But when topological restrictions are placed on the flow
network, the situation is far less clear.
In particular, it has been conjectured that
the flow-cut gap in planar networks is O(1), while
the known bounds place the gap somewhere between 2 (Lee and Raghavendra, 2003) and O(sqrt{log k}) (Rao, 1999).
A seminal result of Okamura and Seymour (1981) shows that when all the terminals
of a planar network lie on a single face, the flow-cut gap is exactly $1$.
This setting can be generalized by considering planar networks where
the terminals lie on one of $\gamma > 1$ faces in some fixed planar drawing.
Lee and Sidiropoulos (2009) proved that the flow-cut gap is bounded
by a function of $\gamma$, and Chekuri, Shepherd, and Weibel (2013) showed
that the gap is at most $3 \gamma$.
We significantly improve
these asymptotics by establishing that the flow-cut gap is $O(\log \gamma)$.
This is achieved by showing that the edge-weighted shortest-path metric induced
on the terminals admits a stochastic embedding into trees with distortion $O(\log \gamma)$.
The latter result is tight, e.g., for a square planar lattice on $\Theta(\gamma)$ vertices.
The preceding results refer to the setting of edge-capacitated networks.
For vertex-capacitated networks, it can be significantly more challenging
to control flow-cut gaps.
While there is no exact vertex-capacitated version of the Okamura-Seymour Theorem,
an approximate version holds;
Lee, Mendel, and Moharrami (2015) showed that
the vertex-capacitated flow-cut gap is O(1) on planar networks whose
terminals lie on a single face.
We prove that the flow-cut gap is $O(\gamma)$ for vertex-capacitated
instances when the
terminals lie on at most $\gamma$ faces.
In fact, this result holds in the more
general setting of submodular vertex capacities.
We introduce a batch version of sparse recovery,
where the goal is to report a sequence of vectors $A_1',\ldots,A_m' \in R^n$
that estimate unknown signals $A_1,\ldots,A_m \in R^n$
using a few linear measurements, each involving exactly one signal vector,
under an assumption of average sparsity.
More precisely, we want to have
$$
\sum_{j \in [m]}{||A_j- A_j'||_p^p}
\le
C \cdot \min \{ \sum_{j \in [m]}{||A_j - A_j^*||_p^p} \}
$$
for predetermined constants $C \ge 1$ and $p$,
where the minimum is over all $A_1^*,\ldots,A_m^*\in R^n$
that are $k$-sparse on average.
We assume $k$ is given as input, and ask for the minimal number
of measurements required to satisfy the inquality above.
The special case $m=1$ is known as stable sparse recovery and has been studied extensively.
We resolve the question for $p=1$ up to polylogarithmic factors,
by presenting a randomized adaptive scheme
that performs $\tilde{O}(km)$ measurements
and with high probability has output satisfying \eqref{eq:batchRec}, for arbitrarily small $C > 1$.
Finally, we show that adaptivity is necessary for every non-trivial scheme.
Given a directed graph, the vertex connectivity from $u$ to $v$ is the maximum number of internally vertex-disjoint paths from $u$ to $v$.
We design faster algorithms that, given as input a directed graph $G$ with unit node capacities and a threshold $k$, report for all vertex pairs $(s,t)$ the size of a minimum $st$-vertex cut (equivalently, maximum $st$-flow or vertex connectivity)
if it is less than $k$, or report that it is at least $k$ otherwise.
We abbreviate this problem kAPMVC, and the more general, unit edge capacities version as kAPMC.
First, we present a randomized algorithm for kAPMVC that runs in time $O((nk)^{\omega})$, where $\omega$ is the fast matrix multiplication exponent.
This result stems from an application of the network coding method by Cheung, Lau, and Leung [SICOMP 2013] to vertex-capacitated digraphs.
Second, we present two deterministic algorithms for DAGs for the harder kAPMC and where we also want to compute min-cut witnesses.
The first algorithm is combinatorial (it does not involve matrix multiplication)
and runs in time $O(2^{O(k^2)} \cdot mn)$.
The second algorithm is faster on dense DAGs and runs in time $O((k\log n)^{4^k+o(k)}\cdot n^{\omega})$.
Notice that a solution even to kAPMVC, for any $k\geq 1$,
implies a solution to triangle finding and to transitive closure:
thus, our bounds for $k=o(\sqrt{\log n})$ and for $k=o(\log\log n)$,
are tight up to subpolynomial factors in $n$,
where the former applies to combinatorial algorithms [Abboud and Williams, FOCS 2014].
To obtain those results, we exploit new insights on the structure of $k$-cuts and we apply
suitable algebraic tools.
Our final set of results rules out the possibility that kAPMVC can be solved as efficiently as a transitive closure computation for all $k$.
We design a novel reduction showing a lower bound of $n^{\omega-1-o(1)} k^2$ for kAPMVC assuming that $4$-Clique requires $n^{\omega+1-o(1)}$ time.
For combinatorial algorithms, our reduction implies an $n^{2-o(1)} k^2$ conditional lower bound.
These lower bounds are significantly higher than previously known ones (under SETH) even for the general case of $k=n$ and for the All-Pairs Max-Flow problem.
We study the problem of a decision-maker having to select one of many competing alternatives (e.g., choosing between projects, designs, or suppliers) whose future revenues are a priori unknown and modeled as random variables of known probability distributions. The decision-maker can pay to test each alternative to reveal its specific revenue realization (e.g., by conducting market research), and her goal is to maximize the expected revenue of the selected alternative minus the testing costs. This model captures an interesting trade-off between gaining revenue of a high-yield alternative and spending resources to reduce the uncertainty in selecting it. The combinatorial nature of the problem leads to a dynamic programming (DP) formulation with high-dimensional state space that is computationally intractable. By characterizing the structure of the optimal policy, we derive efficient optimal and near-optimal policies that are simple and easy-to-compute. In fact, these policies are also myopic -- they only consider a limited horizon of one test. Moreover, our policies can be described using intuitive `testing intervals' around the expected revenue of each alternative, and in many cases, the dynamics of an optimal policy can be explained by the interaction between the testing intervals of various alternatives.
In the Set Cover problem, the input is a ground set of n elements and a collection of m sets, and the goal is to find the smallest sub-collection of sets whose union is the entire ground set.
The fastest algorithm known runs in time $O(mn 2^n)$ [Fomin et al., WG 2004], and the Set Cover Conjecture (SeCoCo) [Cygan et al., TALG 2016]
asserts that for every fixed $\epsilon>0$,
no algorithm can solve Set Cover in time $2^{(1-\epsilon)n} \poly(m)$,
even if set sizes are bounded by $\Delta=\Delta(\epsilon)$.
We show strong connections between this problem and kTree, a special case of Subgraph Isomorphism where
the input is an n-node graph G and a k-node tree $T$,
and the goal is to determine whether G has a subgraph isomorphic to T.
First, we propose a weaker conjecture Log-SeCoCo,
that allows input sets of size $\Delta=O(1/\epsilon \cdot\log n)$,
and show that an algorithm breaking Log-SeCoCo would imply a faster algorithm
than the currently known $2^n \poly(n)$-time algorithm [Koutis and Williams, TALG 2016] for Directed nTree, which is kTree with k=n and arbitrary directions to the edges of G and T. This would also improve the running time for Directed Hamiltonicity, for which no algorithm significantly faster than $2^n \poly(n)$ is known despite extensive research.
Second, we prove that
if p-Partial Cover, a parameterized version of Set Cover that requires covering at least p elements, cannot be solved significantly faster than $2^{n} \poly(m)$
(an assumption even weaker than Log-SeCoCo)
then kTree cannot be computed significantly faster than $2^{k} \poly(n)$,
the running time of the Koutis and Williams' algorithm.
We study the following version of cut sparsification.
Given a large edge-weighted network $G$ with $k$ terminal vertices,
compress it into a small network $H$ with the same terminals,
such that the minimum cut in $H$ between every bipartition of the terminals
approximates the corresponding one in $G$ within factor $q\geq 1$,
called the quality.
We provide two new insights about the structure of cut sparsifiers,
and then apply them to obtain improved cut sparsifiers
(and data structures) for planar graphs.
Our first main contribution identifies a subset of the minimum terminal cuts
that generates all the other ones.
Restricting attention to these cuts, which we call elementary,
is effective in reducing the number of requirements from the sparsifier $H$.
Our applications of this result are new cut sparsifiers of quality $q=1$
(also called mimicking networks):
(1) for planar networks $G$, we design a cut sparsifier of size $O(k 2^{2k})$,
slightly improving the known bound from [Krauthgamer and Rika, SODA 2013];
and
(2) for planar networks $G$ whose terminals are incident to at most
$\gamma=\gamma(G)$ faces, we design a cut sparsifier of size
$O(\gamma^5 2^{2\gamma} k^4)$, which was previously unknown.
Our second main contribution is a duality between minor cut sparsification
and minor distance sparsification
(i.e., when the sparsifier $H$ is required to be a minor of $G$),
applicable to planar networks $G$ with all their terminals on the same face.
This duality connects problems that were previously studied separately,
implying new results, new proofs of known results,
and equivalences between open gaps.
All our cut sparsifiers are minors of the input network $G$.
We provide evidence that computing the maximum flow value between
every pair of nodes in a directed graph on n nodes, m edges,
and capacities in the range [1..n], which we call the All-Pairs Max-Flow problem,
cannot be solved in time that is significantly faster
(i.e., by a polynomial factor) than O(n^3) even for sparse graphs, namely m=O(n); thus for general m, it cannot be solved significantly faster than O(n^2 m).
Since a single maximum st-flow can be solved
in time $\tilde O(m\sqrt{n})$ [Lee and Sidford, FOCS 2014],
we conclude that the all-pairs version might require time equivalent to $\tilde\Omega(n^{3/2})$ computations of maximum st-flow,
which strongly separates the directed case from the undirected one.
Moreover, if maximum st-flow can be solved in time $\tilde O(m)$,
then the runtime of $\tilde\Omega(n^2)$ computations is needed. This is in contrast to a conjecture of Lacki, Nussbaum, Sankowski, and Wulff-Nilsen [FOCS 2012] that All-Pairs Max-Flow in general graphs can be solved faster than the time of O(n^2) computations of maximum st-flow.
Specifically, we show that in sparse graphs G=(V,E,w), if one can compute the maximum st-flow from every s in an input set of sources $S\subseteq V$
to every t in an input set of sinks $T\subseteq V$ in time $O((|S| |T| m)^{1-\varepsilon})$,
for some $|S|$, $|T|$ and a constant $\varepsilon>0$,
then MAX-CNF-SAT (maximum satisfiability of conjunctive normal form formulas)
with n' variables and m' clauses can be solved in time ${m'}^{O(1)}2^{(1-\delta)n'}$ for a constant $\delta(\varepsilon)>0$,
a problem for which not even $2^{n'}/\poly(n')$ algorithms are known. Such running time for MAX-CNF-SAT would in particular refute the Strong Exponential Time Hypothesis (SETH). Hence, we improve the lower bound of Abboud, Vassilevska-Williams, and Yu [STOC 2015], who showed that for every fixed $\varepsilon>0$ and $|S|=|T|=O(\sqrt{n})$, if the above problem can be solved in time $O(n^{3/2-\varepsilon})$, then some incomparable (and intuitively weaker) conjecture is false. Furthermore, a larger lower bound than ours implies strictly super-linear time
for maximum st-flow problem, which would be an amazing breakthrough.
In addition, we show that All-Pairs Max-Flow in uncapacitated networks with every edge-density m=m(n),
cannot be computed in time significantly faster than O(mn), even for acyclic networks.
The gap to the fastest known algorithm by Cheung, Lau, and Leung [FOCS 2011]
is a factor of $O(m^{\omega-1}/n)$, and for acyclic networks it is $O(n^{\omega-1})$,
where $\omega$ is the matrix multiplication exponent.
Finally, we extend our lower bounds to the version that asks only for
the maximum-flow values below a given threshold
(over all source-sink pairs).
A central problem in data streams is to characterize which functions of an underlying frequency vector can be approximated efficiently. Recently there has been considerable effort in extending this problem to that of estimating functions of a matrix that is presented as a data-stream. This setting generalizes classical problems to the analogous ones for matrices. For example, instead of estimating frequent-item counts, we now wish to estimate "frequent-direction" counts. A related example is to estimate norms, which now correspond to estimating a vector norm on the singular values of the matrix. Despite recent efforts, the current understanding for such matrix problems is considerably weaker than that for vector problems.
We study a number of aspects of estimating matrix norms in a stream that have not previously been considered: (1) multi-pass algorithms, (2) algorithms that see the underlying matrix one row at a time, and (3) time-efficient algorithms. Our multi-pass and row-order algorithms use less memory than what is provably required in the single-pass and entrywise-update models, and thus give separations between these models (in terms of memory). Moreover, all of our algorithms are considerably faster than previous ones. We also prove a number of lower bounds, and obtain for instance, a near-complete characterization of the memory required of row-order algorithms for estimating Schatten p-norms of sparse matrices.
In the snippets problem we are interested in preprocessing a text T
so that given two pattern queries P1 and P2,
one can quickly locate the occurrences of the patterns in T that
are the closest to each other.
A closely related problem is that of constructing a color-distance oracle,
where the goal is to preprocess a set of points from some metric space,
in which every point is associated with a set of colors,
so that given two colors one can quickly locate two points associated with
those colors, that are as close as possible to each other.
We introduce efficient data structures for both color-distance oracles
and the snippets problem.
Moreover, we prove conditional lower bounds for these problems from
both the 3SUM conjecture and the Combinatorial Boolean Matrix Multiplication
conjecture.
We characterize the streaming space complexity of every symmetric norm l
(a norm on R^n invariant under sign-flips and coordinate-permutations),
by relating this space complexity to the measure-concentration characteristics of l.
Specifically, we provide matching upper and lower bounds (up to polylog(n)
factors) on the space complexity of approximating the norm of the stream,
where both bounds depend on the median of l(x),
when x is drawn uniformly from the l2 unit sphere.
The same quantity governs many phenomena in high-dimensional spaces,
such as large-deviation bounds and the critical dimension in Dvoretzky's Theorem.
The family of symmetric norms contains several well-studied norms,
such as all l_p norms, and indeed we provide a new explanation
for the disparity in space complexity between p <= 2 and p>2.
In addition, we apply our general results to easily derive bounds
for several norms were not studied before in the streaming model,
including for example the top-k norm and the k-support norm,
which was recently shown to be effective for machine learning tasks.
Overall, these results make progress on two outstanding problems
in the area of sublinear algorithms (Problems 5 and 30 in this http URL).
By a classical result of Gomory and Hu (1961),
in every edge-weighted graph G=(V,E,w),
the minimum st-cut values, when ranging over all $s,t\in V$,
take at most |V|-1 distinct values.
That is, these $\binom{|V|}{2}$ instances exhibit
redundancy factor $\Omega(|V|)$.
They further showed how to construct from G
a tree (V,E',w') that stores all minimum st-cut values.
Motivated by this result, we obtain tight bounds for the redundancy factor of several generalizations of the minimum st-cut problem.
Update: The latest version (on arXiv) contains additional references to previous work (which have some overlap with our results).
A valued constraint satisfaction problem (VCSP) instance $(V,\Pi,w)$
is a set of variables $V$ with a set of constraints $\Pi$ weighted by $w$.
Given a VCSP instance, we are interested in a re-weighted sub-instance
$(V,\Pi'\subset \Pi,w')$ such that preserves the value of the given instance
(under every assignment to the variables) within factor $1\pm\epsilon$.
A well-studied special case is cut sparsification in graphs,
which has found various applications.
We show that a VCSP instance consisting of a single boolean predicate $P(x,y)$ (e.g., for cut, $P=XOR$) can be sparsified into $O(|V|/\epsilon^2)$ constraints if and only if the number of inputs that satisfy $P$ is anything but one (i.e., $|P^{-1}(1)| \neq 1$).
Furthermore, this sparsity bound is tight unless $P$ is a relatively trivial predicate.
We conclude that also systems of 2SAT (or 2LIN) constraints can be sparsified.
We study the problem of reconstructing a low-rank matrix,
where the input is an $n\times m$ matrix $M$ over a field F
and the goal is to reconstruct a (near-optimal) matrix $M'$
that is low-rank and close to $M$ under some distance function $\Delta$.
Furthermore, the reconstruction must be local, i.e., provides access to
any desired entry of $M'$ by reading only a few entries of the input $M$
(ideally, independent of the matrix dimensions $n$ and $m$).
Our formulation of this problem is inspired by the local reconstruction framework of Saks and Seshadhri (SICOMP, 2010).
Our main result is a local reconstruction algorithm for the case
where $\Delta$ is the normalized Hamming distance (between matrices).
Given $M$ that is $\epsilon$-close to a matrix of rank $d<1/\epsilon$
(together with $d$ and $\epsilon$),
this algorithm computes with high probability a rank-$d$ matrix $M'$
that is $O(\sqrt{d\epsilon})$-close to $M$.
This is a local algorithm that proceeds in two phases.
The preprocessing phase
reads only $\tilde O(\sqrt{d/\epsilon^3})$ random entries of $M$,
and stores a small data structure.
The query phase deterministically outputs a desired entry $M'_{i,j}$
by reading only the data structure and $2d$ additional entries of $M$.
We also consider local reconstruction in an easier setting,
where the algorithm can read an entire matrix column in a single operation.
When $\Delta$ is the normalized Hamming distance between vectors,
we derive an algorithm that runs in polynomial time
by applying our main result for matrix reconstruction.
For comparison, when $\Delta$ is the truncated Euclidean distance and F=R,
we analyze sampling algorithms by using statistical learning tools.
We study resistance sparsification of graphs, in which the goal is to
find a sparse subgraph (with reweighted edges) that approximately
preserves the effective resistances between every pair of nodes.
We show that every dense regular expander admits a
$(1+\epsilon)$-resistance sparsifier of size $\tilde O(n/\epsilon)$,
and conjecture this bound holds for all graphs on $n$ nodes.
In comparison, spectral sparsification is a strictly stronger notion and
requires $\Omega(n/\epsilon^2)$ edges even on the complete graph.
Our approach leads to the following structural question on graphs:
Does every dense regular expander contain a sparse regular expander as
a subgraph? Our main technical contribution, which may of independent
interest, is a positive answer to this question in a certain setting
of parameters.
Combining this with a recent result of von Luxburg,
Radl, and Hein~(JMLR, 2014) leads to the aforementioned resistance
sparsifiers.
We investigate the problem of approximate Nearest-Neighbor Search (NNS)
in graphical metrics:
The task is to preprocess an edge-weighted graph $G=(V,E)$ on $m$ vertices
and a small ``dataset'' $D\subset V$ of size $n \ll m$,
so that given a query point $q\in V$, one can quickly approximate
$dist(q,D)$ (the distance from $q$ to its closest vertex in $D$)
and find a vertex $a\in D$ within this approximated distance.
We assume the query algorithm has access to a distance oracle,
that quickly evaluates the exact distance between any pair of vertices.
For planar graphs $G$ with maximum degree $\Delta$,
we show how to efficiently construct a compact data structure
-- of size $\tilde{O}(n(\Delta+1/\epsilon))$ --
that answers $(1+\epsilon)$-NNS queries in time $\tilde{O}(\Delta+1/\epsilon)$.
Thus, as far as NNS applications are concerned,
metrics derived from bounded-degree planar graphs
behave as low-dimensional metrics,
even though planar metrics do not necessarily have a low doubling dimension,
nor can they be embedded with low distortion into $\ell_2$.
We complement our algorithmic result by lower bounds showing that
the access to an exact distance oracle (rather than an approximate one)
and the dependency on $\Delta$ (in query time) are both essential.
A prominent tool in many problems involving metric spaces is
a notion of randomized low-diameter decomposition.
Loosely speaking, $\beta$-decomposition refers to a probability distribution
over partitions of the metric into sets of low diameter,
such that nearby points (parameterized by $\beta>0$)
are likely to be ``clustered'' together.
Applying this notion to the shortest-path metric in edge-weighted graphs,
it is known that $n$-vertex graphs admit
an $O(\ln n)$-padded decomposition \cite{Bartal96},
and that excluded-minor graphs admit $O(1)$-padded decomposition
[KPR93,FT03,AGGNT14].
We design decompositions to the family of $p$-path-separable graphs,
which was defined by Abraham and Gavoille [AG06]
and refers to graphs that admit vertex-separators consisting
of at most $p$ shortest paths in the graph.
Our main result is that every $p$-path-separable $n$-vertex graph
admits an $O(\ln (p \ln n))$-decomposition,
which refines the $O(\ln n)$ bound for general graphs,
and provides new bounds for families like bounded-treewidth graphs.
Technically, our clustering process differs from previous ones
by working in (the shortest-path metric of) carefully chosen subgraphs.
An outstanding open question [sublinear.info, Question #5] asks
to characterize metric spaces in which distances can be estimated using efficient sketches.
Specifically, we say that a sketching algorithm is efficient
if it achieves constant approximation using constant sketch size.
A well-known result of Indyk (J. ACM, 2006) implies that a metric
that admits a constant-distortion embedding into $\ell_p$ for $p\in(0,2]$
also admits an efficient sketching scheme.
But is the converse true, i.e., is embedding into $\ell_p$
the only way to achieve efficient sketching?
We address these questions for the important special case of normed spaces,
by providing an almost complete characterization of sketching
in terms of embeddings.
In particular, we prove that a finite-dimensional
normed space allows efficient sketches
if and only if it embeds (linearly) into $\ell_{1-\eps}$ with constant
distortion.
We further prove that for norms that are closed under sum-product,
efficient sketching is equivalent to constant-distortion embedding into $\ell_1$.
Examples of such norms include the Earth Mover's Distance
(specifically its norm variant, called Kantorovich-Rubinstein norm),
and the trace norm (a.k.a. Schatten 1-norm or the nuclear norm).
Using known non-embeddability theorems for these norms by
Naor and Schechtman (SICOMP, 2007) and by Pisier (Compositio. Math., 1978),
we then conclude that these spaces do not admit efficient sketches either,
making progress towards answering another open question [sublinear.info, Question #7].
Finally, we observe that resolving whether
``sketching is equivalent to embedding into $\ell_1$ for general norms''
(i.e., without the above restriction)
is equivalent to resolving a well-known open problem in Functional Analysis posed by Kwapien in 1969.
Sketching and streaming algorithms are in the forefront of current
research directions for cut problems in graphs.
In the streaming model, we show that (1-\epsilon)-approximation
for Max-Cut must use $n^{1-O(\epsilon)}$ space; moreover,
beating 4/5-approximation requires polynomial space.
For the sketching model, we show that r-uniform hypergraphs
admit a (1+\epsilon)-cut-sparsifier (i.e., a weighted subhypergraph
that approximately preserves all the cuts)
with $O(\epsilon^{-2}n(r+\log n))$ edges.
We also make first steps towards sketching general CSPs
(Constraint Satisfaction Problems).
We introduce the st-cut version of the Sparsest-Cut problem,
where the goal is to find a cut of minimum sparsity in a graph $G(V,E)$
among those separating two distinguished vertices $s,t\in V$.
Clearly, this problem is at least as hard as the usual (non-st) version.
Our main result is a polynomial-time algorithm for the product-demands setting,
that produces a cut of sparsity $O(\sqrt{OPT})$,
where OPT denotes the optimum,
and the total edge capacity and the total demand are assumed (by normalization)
to be 1.
Our result generalizes the recent work of Trevisan [arXiv, 2013] for the
non-st version of the same problem (Sparsest-Cut with product demands),
which in turn generalizes the bound achieved by the discrete Cheeger inequality,
a cornerstone of Spectral Graph Theory that has numerous applications.
Indeed, Cheeger's inequality handles graph conductance, the special case
of product demands that are proportional to the vertex (capacitated) degrees.
Along the way, we obtain an $O(\log |V|)$-approximation
for the general-demands setting of Sparsest st-Cut.
We study spectral algorithms for the high-dimensional Nearest Neighbor
Search problem (NNS). In particular, we consider a semi-random
setting where a dataset is chosen arbitrarily from an unknown subspace
of low dimension, and then perturbed by full-dimensional Gaussian
noise. We design spectral NNS algorithms whose query time depends
polynomially on the dimension and logarithmically on the size of the
point set. These spectral algorithms use a repeated computation of
the top PCA vector/subspace, and are effective even when the
random-noise magnitude is much larger than the interpoint
distances. Our motivation is that in practice, a number of spectral
NNS algorithms outperform the random-projection methods that seem
otherwise theoretically optimal on worst-case datasets. In this paper
we aim to provide theoretical justification for this disparity.
We undertake a systematic study of sketching a quadratic form: given
an $n \times n$ matrix $A$, create a succinct sketch $sk{A}$ which
can produce (without further access to $A$) a multiplicative
$(1+\epsilon)$-approximation to $x^T A x$ for any desired query $x \in
R^n$.
While a general matrix does not admit non-trivial sketches,
positive semi-definite (PSD) matrices admit sketches of size
$\Theta(\epsilon^{-2} n)$, via the Johnson-Lindenstrauss lemma,
achieving the ``for each'' guarantee, namely, for each query $x$, with a constant
probability the sketch succeeds. (For the stronger ``for all''
guarantee, where the sketch succeeds for all $x$'s simultaneously,
again there are no non-trivial sketches.)
We design significantly better sketches for the important subclass of graph Laplacian matr
ices, which we also extend to symmetric diagonally dominant matrices. A sequence of work c
ulminating in that of Batson, Spielman, and Srivastava (SIAM Review, 2014), shows that by
choosing and reweighting $O(\epsilon^{-2} n)$ edges in a graph,
one achieves the ``for all'' guarantee. Our main results advance this front.
Our results provide the first separation between the sketch size needed
for the ``for all'' and ``for each'' guarantees for Laplacian
matrices.
In edge orientations, the goal is usually to orient (direct) the edges of
an undirected $n$-vertex graph $G$ such that all out-degrees are bounded.
When the graph $G$ is fully dynamic, i.e., admits edge insertions and deletions,
we wish to maintain such an orientation while keeping a tab on
the update time.
Low out-degree orientations turned out to be a surprisingly useful tool,
with several algorithmic applications involving static or dynamic graphs.
Brodal and Fagerberg (1999) initiated the study of the edge orientation problem
in terms of the graph's arboricity, which is very natural in this context.
They provided a solution
with constant out-degree and amortized logarithmic update time
for all graphs with constant arboricity,
which include all planar and excluded-minor graphs.
However, it remained an open question (first proposed by Brodal and Fagerberg, later by others)
to obtain similar bounds with worst-case update time.
We resolve this 15 year old question in the affirmative,
by providing a simple algorithm with worst-case bounds
that nearly match the previous amortized bounds.
Our algorithm is based on a new approach of a combinatorial invariant,
and achieves a logarithmic out-degree with logarithmic worst-case update times.
This result has applications in various dynamic graph problems
such as maintaining a maximal matching, where we obtain $O(\log n)$ worst-case update time compared to the
$O(\frac{\log n}{\log\log n})$ amortized update time of Neiman and Solomon (2013).
The problem of partitioning an edge-capacitated graph on n vertices
into k balanced parts has been amply researched.
Motivated by applications such as load balancing in distributed
systems and market segmentation in social networks, we propose a new
variant of the problem, called Multiply Balanced k-Partitioning,
where the vertex-partition must be balanced under d vertex-weight functions
simultaneously.
We design bicriteria approximation algorithms for this problem, i.e.,
they partition the vertices into up to $k$ parts that are nearly balanced simultaneously for all weight functions, and their approximation
factor for the capacity of cut edges matches the bounds known for a
single weight function times d.
For the case where d=2, for vertex weights that are integers bounded
by a polynomial in n and any fixed $\epsilon>0$, we obtain a
$(2+\epsilon, O(\sqrt{\log n \log k}))$-bicriteria approximation,
namely, we partition the graph into parts whose weight
is at most $2+\epsilon$ times that of a perfectly balanced part
(simultaneously for both weight functions), and whose cut capacity is
$O(\sqrt{\log n \log k})\cdot OPT$. For unbounded (exponential)
vertex weights, we achieve approximation
$(3, O(\log n))$.
Our algorithm generalizes to d weight functions as follows:
For vertex weights that are integers bounded by a polynomial in $n$
and any fixed $\epsilon>0$, we obtain a $(2d +\epsilon,
O(d\sqrt{\log n \log k}))$-bicriteria approximation. For unbounded
(exponential) vertex weights, we achieve approximation
$(2d+ 1, O(d\log n))$.
A useful approach to "compress" a large network G is to represent it with a flow-sparsifier, i.e., a small network H that supports the same flows as G, up to a factor $q \geq 1$ called the quality of sparsifier. Specifically, we assume the network G contains a set of k terminals T, shared with the network H, i.e., $T\subseteq V(G)\cap V(H)$, and we want H to preserve all multicommodity flows that can be routed between the terminals T. The challenge is to construct H that is small.
These questions have received a lot of attention in recent years, leading to some known tradeoffs between the sparsifier's quality $q$ and its size |V(H)|. Nevertheless, it remains an outstanding question whether every $G$ admits a flow-sparsifier H with quality q=1+ε, or even q=O(1), and size $|V(H)|\leq f(k,ε)$ (in particular, independent of |V(G)| and the edge capacities).
Making a first step in this direction, we present new constructions for several scenarios:
We consider the problem of Non-Uniform Graph Partitioning,
where the input is an edge-weighted undirected graph G=(V,E)
and k capacities n_1,...,n_k, and the goal is to find
a partition {S_1,S_2,...,S_k} of V satisfying |S_j| \leq n_j for all 1\leq j\leq k, that minimizes the total weight
of edges crossing between different parts.
This natural graph partitioning problem arises in practical scenarios,
and generalizes well-studied balanced partitioning problems such as
Minimum Bisection, Minimum Balanced Cut, and Minimum k-Partitioning.
Unlike these problems, Non-Uniform Graph Partitioning seems to be resistant to many of the known partitioning techniques, such as spreading metrics, recursive partitioning, and Raecke's tree decomposition,
because k can be a function of n and the capacities could be of different magnitudes.
We present a bicriteria approximation algorithm for Non-Uniform Graph Partitioning
that approximates the objective within O(log n) factor
while deviating from the required capacities by at most a constant factor.
Our approach is to apply stopping-time based concentration results to a simple
randomized rounding of a configuration LP.
These concentration bounds are needed as the commonly used techniques of
bounded differences and bounded conditioned variances do not suffice.
Estimating the leading principal components
of data assuming they are sparse,
is a central task in modern high-dimensional statistics.
Many algorithms were suggested for this sparse PCA problem, from
simple diagonal thresholding to sophisticated
semidefinite programming (SDP) methods.
A key theoretical question asks under what conditions
can such algorithms recover the sparse principal components.
We study this question for a single-spike model,
with a spike that is $\ell_0$-sparse,
and dimension $p$ and sample size $n$ that tend to infinity.
Amini and Wainwright (2009) proved that for sparsity levels $k\geq\Omega(n/\log p)$,
no algorithm, efficient or not, can reliably recover the sparse eigenvector.
In contrast, for sparsity levels $k\leq O(\sqrt{n/\log p})$,
diagonal thresholding is asymptotically consistent.
It was further conjectured that the SDP approach may close this gap
between computational and information limits.
We prove that when $k \geq \Omega(\sqrt{n})$ the SDP approach, at least in its standard usage,
cannot recover the sparse spike.
In fact, we conjecture that in the single-spike model,
no computationally-efficient algorithm can recover a spike
of $\ell_0$-sparsity $k\geq \Omega(\sqrt{n})$.
Finally, we present empirical results suggesting that up to sparsity levels
$k=O(\sqrt{n})$, recovery is possible by
a simple covariance thresholding algorithm.
Our main result is that the Steiner Point Removal (SPR) problem
can always be solved with polylogarithmic distortion,
which answers in the affirmative
a question posed by Chan, Xia, Konjevod, and Richa (2006).
Specifically, we prove that for every edge-weighted graph $G = (V,E,w)$
and a subset of terminals $T \subseteq V$,
there is a graph $G'=(T,E',w')$ that is isomorphic to a minor of $G$,
such that for every two terminals $u,v\in T$,
the shortest-path distances between them in $G$ and in $G'$ satisfy
$d_{G,w}(u,v) \le d_{G',w'}(u,v) \le O(\log^5|T|) \cdot d_{G,w}(u,v)$.
Our existence proof actually gives a randomized polynomial-time algorithm.
Our proof features a new variant of metric decomposition.
It is well-known that every finite metric space $(X,d)$ admits
a $\beta$-separating decomposition for $\beta=O(\log |X|)$,
which means that for every $\Delta>0$ there is a randomized
partitioning of $X$ into clusters of diameter at most $\Delta$,
satisfying the following separation property:
for every $x,y \in X$, the probability they lie in different clusters
of the partition is at most $\beta\,d(x,y)/\Delta$.
We introduce an additional requirement in the form of a tail bound:
for every shortest-path $P$ of length $d(P) \leq \Delta/\beta$,
the number of clusters of the partition that meet the path $P$, denoted $Z_P$,
satisfies $\Pr[Z_P > t] \le 2e^{-\Omega(t)}$ for all $t>0$.
We initiate the study of dimensionality reduction in general metric spaces
in the context of supervised learning.
Our statistical contribution consists of tight Rademacher bounds
for Lipschitz functions in metric spaces that are doubling, or nearly doubling.
As a by-product, we obtain a new theoretical explanation for the
empirically reported improvements gained by
pre-processing Euclidean data by PCA (Principal Components Analysis)
prior to constructing a linear classifier.
On the algorithmic front, we describe an analogue of PCA for metric spaces,
namely an efficient procedure that approximates the data's intrinsic dimension,
which is often much lower than the ambient dimension.
Thus, our approach can exploit the dual benefits of low dimensionality:
(1) more efficient proximity search algorithms, and
(2) more optimistic generalization bounds.
Given a large edge-weighted network G with k terminal vertices,
we wish to compress it and store, using little memory,
the value of the minimum cut (or equivalently, maximum flow)
between every bipartition of terminals.
One appealing methodology to implement a compression of G
is to construct a mimicking network:
a small network G' with the same k terminals,
in which the minimum cut value between every bipartition of terminals
is the same as in G.
This notion was introduced by Hagerup, Katajainen, Nishimura, and Ragde [JCSS '98],
who proved that such G' of size at most $2^{2^k}$ always exists.
Obviously, by having access to the smaller network G',
certain computations involving cuts can be carried out much more efficiently.
We provide several new bounds, which together narrow
the previously known gap from doubly-exponential to only singly-exponential,
both for planar and for general graphs.
Our first and main result is that every $k$-terminal planar network
admits a mimicking network G' of size $O(k^2\cdot 2^{2k})$,
which is moreover a minor of G.
On the other hand, some planar networks $G$ require $|E(G')| \ge \Omega(k^2)$.
For general networks, we show that certain bipartite graphs
only admit mimicking networks of size $|V(G')| \geq 2^{\Omega(k)}$,
and moreover, every data structure that stores the minimum cut value between
all bipartitions of the terminals must use $2^{\Omega(k)}$ machine words.
We examine the efficiency of clustering a set of points,
when the encompassing metric space may be preprocessed in advance.
In computational problems of this genre,
there is a first stage of preprocessing, whose input is a collection of points $M$;
the next stage receives as input a query set $Q\subset M$,
and should report a clustering of $Q$ according to some objective,
such as 1-median, in which case the answer is
a point $a\in M$ minimizing $\sum_{q\in Q} d_M(a,q)$.
We design fast algorithms that approximately solve such problems
under standard clustering objectives like p-center and p-median,
when the metric $M$ has low doubling dimension.
By leveraging the preprocessing stage, our algorithms achieve
query time that is near-linear in the query size $n=|Q|$,
and is (almost) independent of the total number of points $m=|M|$.
The significant progress in constructing graph spanners that are
sparse (small number of edges) or light (low total weight) has skipped
spanners that are everywhere-sparse (small maximum degree). This
disparity is in line with other network design problems, where the
maximum-degree objective has been a notorious technical challenge. Our
main result is for the Lowest Degree 2-Spanner (LD2S) problem, where
the goal is to compute a 2-spanner of an input graph so as to minimize
the maximum degree. We design a polynomial-time algorithm achieving
approximation factor $\tilde O(\Delta^{3-2\sqrt{2}}) \approx \tilde
O(\Delta^{0.172})$, where $\Delta$ is the maximum degree of the input
graph. The previous $\tilde O(\Delta^{1/4})$ -approximation was proved
nearly two decades ago by Kortsarz and Peleg [SODA 1994, SICOMP 1998].
Our main conceptual contribution is to establish a formal connection
between LD2S and a variant of the Densest k-Subgraph (DkS)
problem. Specifically, we design for both problems strong relaxations
based on the Sherali-Adams linear programming (LP) hierarchy, and show
that "faithful" randomized rounding of the DkS-variant can be used to
round LD2S solutions. Our notion of faithfulness intuitively means
that all vertices and edges are chosen with probability proportional
to their LP value, but the precise formulation is more subtle.
Unfortunately, the best algorithms known for DkS use the
Lov\'asz-Schrijver LP hierarchy in a non-faithful way [Bhaskara,
Charikar, Chlamtac, Feige, and Vijayaraghavan, STOC 2010]. Our main
technical contribution is to overcome this shortcoming, while still
matching the gap that arises in random graphs by planting a subgraph
with same log-density.
We introduce the following notion of compressing an undirected graph G
with edge-lengths and terminal vertices $R\subseteq V(G)$.
A distance-preserving minor is a minor G' (of G)
with possibly different edge-lengths, such that $R\subseteq V(G')$ and
the shortest-path distance between every pair of terminals
is exactly the same in G and in G'.
What is the smallest f*(k) such that every graph G with k=|R| terminals
admits a distance-preserving minor G' with at most f*(k) vertices?
Simple analysis shows that $f*(k)\le O(k^4)$.
Our main result proves that $f*(k)\ge \Omega(k^2)$,
significantly improving over the trivial $f*(k)\ge k$.
Our lower bound holds even for planar graphs G,
in contrast to graphs G of constant treewidth,
for which we prove that O(k) vertices suffice.
The Traveling Salesman Problem (TSP) is among the most famous NP-hard optimization problems.
We design for this problem a randomized polynomial-time algorithm that computes
a (1+epsilon)-approximation to the optimal tour, for any fixed epsilon>0,
in TSP instances that form an arbitrary metric space
with bounded intrinsic dimension.
The celebrated results of Arora [JACM 1998] and Mitchell [SICOMP 1999]
prove that the above result holds in the special case of
TSP in a fixed-dimensional Euclidean space.
Thus, our algorithm demonstrates that the algorithmic tractability of metric TSP
depends on the dimensionality of the space and not on its specific geometry.
This result resolves a problem that has been open
since the quasi-polynomial time algorithm of Talwar [STOC 2004].
We present a framework for performing efficient regression in general metric
spaces. Roughly speaking, our regressor predicts the value at a new point by
computing a Lipschitz extension --- the smoothest function consistent with the
observed data --- after performing structural risk minimization to avoid
overfitting. We obtain finite-sample risk bounds with minimal structural and
noise assumptions, and a natural speed-precision tradeoff. The offline
(learning) and online (prediction) stages can be solved by convex programming,
but this naive approach has runtime complexity $O(n^3)$, which is prohibitive
for large datasets. We design instead a regression algorithm whose speed and
generalization performance depend on the intrinsic dimension of the data, to
which the algorithm adapts. While our main innovation is algorithmic, the
statistical results may also be of independent interest.
We study graph partitioning problems from a min-max perspective, in
which an input graph on $n$ vertices should be partitioned into $k$ parts,
and the objective is to minimize the maximum number of edges leaving a single part.
The
two main versions we consider are where the $k$ parts need to be of equal-size,
and where they must separate a set of $k$ given terminals.
We consider a common generalization of these two problems,
and design for it an $O(\sqrt{\log n\log k})$-approximation algorithm.
This improves over an $O(\log^2 n)$ approximation for the second version due
to Svitkina and Tardos (2004),
and roughly $O(k\log n)$ approximation for the first version that follows from
other previous work.
We also give an improved $O(1)$-approximation algorithm
for graphs that exclude any fixed minor.
Along the way, we study the Unbalanced-Cut problem,
whose goal is to find, in an input graph $G=(V,E)$, a set $S\subseteq V$
of size $|S|=\rho n$ that minimizes the number of edges leaving $S$.
We provide a bicriteria approximation of $O(\sqrt{\log{n}\log{(1/\rho)}})$;
when the input graph excludes a fixed-minor we improve this
guarantee to $O(1)$.
Note that the special case $\rho = 1/2$ is the well-known Minimum-Bisection
problem, and indeed our bounds generalize those of Arora, Rao and Vazirani (2008)
and of Klein, Plotkin, and Rao (1993).
Our algorithms work also for the closely related Small-Set Expansion (SSE)
problem, which asks for a set $S\subseteq V$ of size $0<|S| \leq \rho n$
with minimum edge-expansion, and
was suggested recently by Raghavendra and Steurer (2010).
In fact, our algorithm handles more general, weighted,
versions of both problems.
Previously, an $O(\log n)$ true approximation for both \UC and \SSE
follows from Raecke (2008).
In STOC 2005, Indyk and Woodruff introduced a new technique
that yielded a near-optimal space algorithm for $F_k$ moment
estimation problem. Since then, the technique
has inspired a number of advances in streaming algorithmics.
We show that at least several of these results follow easily from
the application of a single probabilistic technique,
called Precision Sampling.
Using it, we obtain simple streaming algorithms that maintain
a randomized sketch of an input vector $x=(x_1,\ldots x_n)$, useful
for the following applications:
Precision Sampling itself addresses the problem of estimating
a sum $\sum_{i=1}^n a_i$ from weak estimates of each real $a_i\in[0,1]$.
More precisely, one chooses in advance ``precisions'' $w_i\ge 1$ for each $i\in[n]$
and obtains an estimate of $a_i$ within additive $1/w_i$. The core
question is: what is the
best trade-off between the approximation to $\sum a_i$ and
the total precision, $\sum_i w_i$ ?
In previous work, we showed [Andoni, Krauthgamer, and Onak, FOCS 2010]
that, as long as $\sum a_i=\Omega(1)$, one can achieve good
multiplicative approximation using total precision of only $O(n\log n)$.
A natural requirement of many distributed structures is fault-tolerance:
after some failures, whatever remains from the structure should still be
effective for whatever remains from the network.
In this paper we examine spanners of general graphs that are tolerant to
vertex failures, and significantly improve their dependence on the number
of faults $r$, for all stretch bounds.
For stretch $k \geq 3$ we design a simple transformation that converts
every k-spanner construction with at most $f(n)$ edges into
an $r$-fault-tolerant $k$-spanner construction with at most $O(r^3
\log n) \cdot f(2n/r)$ edges. Applying this to standard greedy
spanner constructions gives r-fault tolerant k-spanners with
$\tilde O(r^{2} n^{1+\frac{2}{k+1}})$ edges. The previous
construction by Chechik, Langberg, Peleg, and Roddity [STOC 2009]
depends similarly on n but exponentially on r
(approximately like $k^r$).
For the case $k=2$ and unit-length edges, an $O(r \log n)$-approximation
algorithm is known from recent work of Dinitz and Krauthgamer [arXiv 2010],
where several spanner results are obtained using a common approach of
rounding a natural flow-based linear programming relaxation.
Here we use a different (stronger) LP relaxation and improve the
approximation ratio to $O(\log n)$, which is, notably, independent
of the number of faults r.
We further strengthen this bound in terms of the maximum degree by using
the Lovasz Local Lemma.
Finally, we show that most of our constructions are inherently local
by designing equivalent distributed algorithms in the
LOCAL model of distributed computation.
We examine directed spanners through flow-based linear programming relaxations.
We design an $\tilde{O}(n^{2/3})$-approximation algorithm
for the directed $k$-spanner problem that works for all $k\geq 1$,
which is the first sublinear approximation for arbitrary edge-lengths.
Even in the more restricted setting of unit edge-lengths,
our algorithm improves over the previous $\tilde{O}(n^{1-1/k})$ approximation
of Bhattacharyya, Grigorescu, Jung, Raskhodnikova, and Woodruff [SODA 2009]
when $k\geq 4$.
For the special case of $k=3$
we design a different algorithm achieving an $\tilde{O}(\sqrt{n})$-approximation,
improving the previous $\tilde{O}(n^{2/3})$
of Elkin and Peleg [TCS 2005] and Bhattacharyya et al. [SODA 2009].
Both of our algorithms easily extend to the fault-tolerant setting,
which has recently attracted attention but not from an approximation viewpoint.
We also prove a nearly matching integrality gap of
$\tilde{\Omega}(n^{1/3 - \epsilon})$ for any constant $\epsilon > 0$.
A virtue of all our algorithms is that they are relatively simple.
Technically, we introduce a new yet natural flow-based relaxation,
and show how to approximately solve it even when its size is not polynomial.
The main challenge is to design a rounding scheme
that ``coordinates'' the choices of flow-paths between the many demand pairs
while using few edges overall.
We achieve this, roughly speaking,
by randomization at the level of vertices.
We give the first constant-factor approximation algorithm for
Sparsest-Cut with general demands in bounded tr eewidth graphs.
In contrast to previous algorithms, which rely on the flow-cut gap
and/or metric embeddings, our approach exploits the Sherali-Adams
hierarchy of linear programming relaxations.
Given a capacitated graph G = (V,E) and a set of terminals $K \subseteq
V$, how should we produce a graph H only on the terminals K so
that every (multicommodity) flow between the terminals in G could
be supported in H with low congestion, and vice versa? (Such a
graph H is called a flow-sparsifier for G.) What if we
want H to be a ``simple'' graph? What if we allow H to be a
convex combination of simple graphs?
Improving on results of Moitra [FOCS 2009]
and Leighton and Moitra [STOC 2010], we give efficient algorithms for
constructing: (a) a flow-sparsifier $H$ that maintains
congestion
up to a factor of $\alpha_{FHRT}$, where k = |K|.
(b) a convex combination
of trees over the terminals K that maintains congestion up to a
factor of O(log k). (c) for a planar graph G, a convex combination of
planar graphs that maintains congestion
up to a constant factor. This requires us to give a new algorithm
for the 0-extension problem, the first one in which the preimages of each
terminal are connected in G. Moreover, this result extends to
minor-closed families of graphs.
Our bounds immediately imply improved approximation
guarantees for several terminal-based cut and ordering problems.
We introduce a new problem in the study of doubling spaces:
Given a point set S and a target dimension d*,
remove from S the fewest number of points so that
the remaining set has doubling dimension at most d^*. We present a
bicriteria approximation for this problem, and extend this algorithm to
solve a group of proximity problems.
Recent advances in large-margin classification of data residing in general metric
spaces (rather than Hilbert spaces) enable classification under
various natural metrics, such as edit and earthmover distance.
The general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004]
left open the question of computational efficiency
and providing direct bounds on classification error.
We design a new algorithm for classification in general metric spaces,
whose runtime and accuracy depend on the doubling dimension of the data points.
It thus achieves superior classification performance in many common scenarios.
The algorithmic core of our approach is an approximate (rather than exact) solution
to the classical problems of Lipschitz extension and of Nearest Neighbor Search.
The algorithm's generalization performance is established
via the fat-shattering dimension of Lipschitz classifiers.
We present a near-linear time algorithm that approximates the edit distance
between two strings within a polylogarithmic factor;
specifically, for strings of length n and every fixed $\eps>0$,
it can compute a $(\log n)^{O(1/\eps)}$-approximation in $n^{1+\eps}$ time.
This is an {\em exponential} improvement over the previously known factor,
$2^{\tilde O(\sqrt{\log n})}$, with a comparable running time [OR05,AO09].
Previously, no efficient polylogarithmic approximation algorithm was known
for any computational task involving edit distance
(e.g., nearest neighbor search or sketching).
This result arises naturally in the study of
a new asymmetric query model.
In this model, the input consists of two strings x and y,
and an algorithm can access y in an
unrestricted manner, while being charged for querying every symbol of x.
Indeed, we obtain our main result by designing an algorithm that
makes a small number of queries in this model.
We then provide also a nearly-matching lower bound on the
number of queries.
Our lower bound is the first to expose hardness of edit distance
stemming from the input strings being ``repetitive'',
which means that many of their substrings are approximately identical.
Consequently, our lower bound provides the first rigorous separation
between edit distance and Ulam distance,
which is edit distance on non-repetitive strings, i.e., permutations.
The
The snowflake metric $d^{1/2}$ of a doubling set
$S\subset
In fact, the target
dimension is polylogarithmic in the doubling constant. Our techniques are
robust and extend to the more difficult spaces
The quest for a Polynomial Time Approximation Scheme (PTAS) for
Nash equilibrium in a two-player game,
which emerged as a major open question in Algorithmic Game Theory,
seeks to circumvent the PPAD-completeness of finding an (exact) Nash equilibrium
by finding an approximate equilibrium.
The closely related problem of finding an equilibrium maximizing
a certain objective, such as the social welfare,
was shown to be NP-hard [Gilboa and Zemel, Games and Economic Behavior, 1989].
However, this NP-hardness is unlikely to extend to approximate equilibria,
since the latter admits a quasi-polynomial time algorithm
[Lipton, Markakis and Mehta, In Proc. of EC, 2003].
We show that this optimization problem, namely, finding in a two-player game
an approximate equilibrium achieving a large social welfare,
is unlikely to have a polynomial time algorithm.
One interpretation of our results is that a PTAS for Nash equilibrium
(if it exists) should not extend to a PTAS for finding the best Nash
equilibrium.
Technically, our result is a reduction from the
notoriously difficult problem in modern Combinatorics, of finding a
planted (but hidden) clique in a random graph G(n,1/2).
Our reduction starts from an instance with planted clique size O(log n)
For comparison, the currently known algorithms
are effective only for a much larger clique size Omega(sqrt n).
We consider the k-balanced partitioning problem, where the
goal is to partition the vertices of an input graph G into k
equally sized components, while minimizing the total weight of the
edges connecting different components. We allow k to be part of
the input and denote the cardinality of the vertex set by n.
This problem is a natural and important generalization of
well-known graph partitioning problems, including minimum
bisection and minimum balanced cut.
We present a (bi-criteria) approximation algorithm achieving
an approximation of O(sqrt{log n / log k}), which
matches or improves over previous algorithms for all relevant values of
k. Our algorithm uses a semidefinite relaxation which combines
We consider the problem of approximate nearest neighbors in high
dimensions, when the queries are lines. In this problem, given n
points in R^d, we want to construct a data structure to support
efficiently the following queries: given a line L, report the point p
closest to L. This problem generalizes the more familiar nearest
neighbor problem. From a practical perspective, lines, and
low-dimensional flats in general, may model data under linear
variation, such as physical objects under different lighting.
For approximation 1+epsilon, we achieve a query time of
O(n^{0.5+t}), for arbitrary small t>0, with a space of
n^{O(1/epsilon^2+1/t^2)}. To the best of our knowledge, this is the
first algorithm for this problem with polynomial space and sub-linear
query time.
A common approach for solving computational problems over a difficult
metric space is to embed the ``hard'' metric into L_1, which admits
efficient algorithms and is thus considered an ``easy'' metric. This
approach has proved successful or partially successful for important
spaces such as the edit distance, but it also has inherent
limitations: it is provably impossible to go below certain
approximation for some metrics.
We propose a new approach, of embedding the difficult space into
richer host spaces, namely iterated products of standard spaces like
L_1 and L_\infty. We show that this class is rich
since it contains useful metric spaces with only a constant
distortion, and, at the same time, it is tractable and admits
efficient algorithms. Using this approach, we obtain for example the
first nearest neighbor data structure with O(log log d)
approximation for edit distance in non-repetitive strings (the Ulam
metric). This approximation is exponentially better than the
lower bound for embedding into L_1. Furthermore, we give
constant factor approximation for two other computational problems.
Along the way, we answer positively a question posed in
[Ajtai, Jayram, Kumar, and Sivakumar, STOC 2002].
One of our algorithms has already found applications for smoothed edit
distance over 0-1 strings [Andoni and Krauthgamer, ICALP 2008].
We initiate the study of the smoothed complexity of sequence alignment,
by proposing a semi-random model of edit distance between two input strings,
generated as follows.
First, an adversary
chooses two binary strings of length d and a longest common subsequence
A of them.
Then, every character is perturbed independently with probability p,
except that A is perturbed in exactly the same way inside the two strings.
We design two efficient algorithms that compute the edit
distance on smoothed instances up to a constant factor approximation.
The first algorithm runs in near-linear time, namely d^{1+epsilon} for
any fixed epsilon>0. The second one runs in time sublinear in d,
assuming the edit distance is not too small. These approximation and
runtime guarantees are significantly better then the bounds known for
worst-case inputs, e.g. near-linear time algorithm achieving
approximation roughly d^{1/3}, due to Batu, Ergun, and Sahinalp
[SODA 2006].
Our technical contribution is twofold. First, we rely on finding
matches between substrings in the two strings, where two substrings
are considered a match if their edit distance is relatively small, a
prevailing technique in commonly used heuristics, such as PatternHunter
of Ma, Tromp and Li [Bioinformatics, 2002].
Second, we effectively reduce the smoothed edit distance to a
simpler variant of (worst-case) edit distance, namely, edit distance
on permutations (a.k.a. Ulam's metric).
We are thus able to build on algorithms developed for the Ulam metric,
whose much better algorithmic guarantees usually do not carry over to
general edit distance.
A common technique for processing conjunctive queries is to first match each
predicate separately using an index lookup, and then compute the intersection
of the resulting row-id lists, via an AND-tree.
The performance of this technique depends crucially on the order of lists in
this tree: it is important to compute early the intersections that will produce
small results. But this optimization is hard to do when the data or predicates
have correlation.
We present a new algorithm for ordering the lists in an AND-tree by sampling
the intermediate intersection sizes. We prove that our algorithm is
near-optimal and validate its effectiveness experimentally on datasets with a
variety of distributions.
How should a seller price his goods in a market where each buyer
prefers a single good among his desired goods, and will buy the
cheapest such good, as long as it is within his budget? We provide
efficient algorithms that compute near-optimal prices for this
problem, focusing on a commodity market, where the range of buyer
budgets is small. We also show that our technique (which is based on
LP-rounding) easily extends to a different scenario, in which the buyers
want to buy all the desired goods, as long as they are within budget.
We design approximation algorithms for a number of fundamental
optimization problems in metric spaces, namely
computing separating and padded decompositions, sparse covers,
and metric triangulations.
Our work is the first to emphasize relative guarantees
that compare the produced solution to the optimal one for the
input at hand.
By contrast, the extensive previous work on these topics has sought
absolute bounds that hold for every possible metric space
(or for a family of metrics).
While absolute bounds typically translate to relative ones,
our algorithms provide significantly better relative guarantees,
using a rather different algorithm.
Our technical approach is to cast a number of metric clustering
problems that have been well studied---but almost always as disparate
problems---into a common modeling and algorithmic framework, which we
call the consistent labeling problem. Having identified the
common features of all of these problems, we provide a family of
linear programming relaxations and simple randomized rounding
procedures that achieve provably good approximation guarantees.
We prove the first non-trivial communication complexity lower bound
for the problem of estimating the edit distance (aka Levenshtein distance)
between two strings.
A major feature of our result is that it provides the first setting
in which the complexity of computing the edit distance is provably larger than
that of Hamming distance.
Our lower bound exhibits a trade-off between approximation and communication,
asserting, for example, that protocols with O(1) bits of communication
can only obtain approximation Omega(log d/loglog d),
where d is the length of the input strings.
This case of O(1) communication is of particular importance,
since it captures constant-size sketches as well as
embeddings into spaces like L_1 and squared-L_2,
two prevailing algorithmic approaches for dealing with edit distance.
Furthermore, the bound holds not only for strings over alphabet
Sigma={0,1},
but also for strings that are permutations (called the Ulam metric).
Besides being applicable to a much richer class of algorithms than
all previous results, our bounds are near-tight in at least one case,
namely of embedding permutations into L_1.
The proof uses a new technique, that relies on Fourier analysis
in a rather elementary way.
The Earth Mover Distance (EMD) between two equal-size sets
of points in R^d is defined to be the minimum cost of a
bipartite matching between the two pointsets. It is a natural metric
for comparing sets of features, and as such, it has received
significant interest in computer vision. Motivated by recent
developments in that area, we address computational
problems involving EMD over high-dimensional pointsets.
A natural approach is to embed the EMD metric into l_1, and use
the algorithms designed for the latter space. However, Khot and
Naor [FOCS 2005] show that any embedding of EMD over the d-dimensional
Hamming cube into l_1 must incur a distortion Omega(d), thus
practically losing all distance information. We circumvent this
roadblock by focusing on sets with cardinalities upper-bounded by a
parameter s, and achieve a distortion of only O(log s log
d). Since in applications the feature sets have bounded size, the
resulting distortion is much smaller than the Omega(d) lower
bound. Our approach is quite general and easily extends to EMD over
R^d.
We then provide a strong lower bound on the multi-round communication
complexity of estimating EMD, which in particular strengthens the known
non-embeddability result of Khot and Naor.
Our bound exhibits a smooth tradeoff between
approximation and communication, and for example implies that every algorithm
that estimates EMD using constant size sketches can only achieve
Omega(log s) approximation.
Network triangulation is a method for estimating distances between
nodes in the network, by letting every node measure its distance
to a few beacon nodes, and deducing the distance between every two nodes
x,y by using their measurements to their common beacons
and applying the triangle inequality.
Kleinberg, Slivkins and Wexler [FOCS 2004]
initiated a theoretical study of triangulation in metric spaces,
and Slivkins [PODC 2005] subsequently showed that metrics of bounded
doubling dimension admit a triangulation that
approximates arbitrarily well all pairwise distances
using only O(log n) beacons per point,
where n is the number of points in the network.
He then asked whether this term is necessary (for doubling metrics).
We provide the first lower bounds on the number of beacons required for a
triangulation in some specific simple networks.
In particular, these bounds (i) answer Slivkins' question positively,
even for one-dimensional metrics, and
(ii) prove that, up to constants,
Slivkins' triangulation achieves an optimal number of beacons
(as a function of the approximation guarantee and the doubling dimension).
The distance to monotonicity of a sequence is the minimum number
of edit operations required to transform the sequence into an increasing order;
this measure is complementary to the length of the longest
increasing subsequence (LIS).
We address the question of estimating these quantities in
the one-pass data stream model and present the first sub-linear space
algorithms for both problems.
We first present O(sqrt{n})-space
deterministic algorithms that approximate
the distance to monotonicity and the LIS to within a factor that is
arbitrarily close to $1$. We also show a lower bound of
Omega(n) on the space
required by any randomized algorithm to compute the LIS
(or alternatively the distance from monotonicity) exactly,
demonstrating that approximation is necessary for sub-linear space computation;
this bound improves upon the existing lower bound of
Omega(sqrt{n})
[Liben-Nowell, Vee, and Zhu, J. Combinatorial Optimization, 2006].
Our main result is a randomized
algorithm that uses only O(log^2 n) space and
approximates the distance
to monotonicity to within a factor that is arbitrarily close to 4.
In contrast, we believe that any significant
reduction in the space complexity for approximating the length of the LIS
is considerably hard. We conjecture that any deterministic 1 + epsilon
approximation algorithm for LIS requires
Omega(sqrt{n}) space, and as a step towards
this conjecture, prove a space lower bound of $\Omega(\sqrt n)$ for a restricted
yet natural class of deterministic algorithms.
We initiate the study of approximate algorithms on
negatively curved spaces. These spaces have recently become of
interest in various domains of computer science including networking
and vision. The classical example of such a space is the
real-hyperbolic space H^d for d>=2, but our approach
applies to a more general family of spaces characterized by Gromov's
(combinatorial) hyperbolic condition. We give efficient algorithms
and data structures for problems like approximate nearest-neighbor
search and compact, low-stretch routing on subsets of negatively
curved spaces of fixed dimension (including H^d as a
special case).
In a different direction, we show that there is a PTAS for the
Traveling Salesman Problem when the set of cities lie, for example,
in H^d. This generalizes Arora's results
for R^d.
Most of
our algorithms use the intrinsic distance geometry of the data
set, and only need the existence of an embedding into some
negatively curved space in order to function properly. In other
words, our algorithms regard the inter-point distance function as a
black box, and are independent of the representation of the
input points.
Edit distance is a fundamental measure of distance between strings,
the extensive study of which has recently focused on computational
problems such as nearest neighbor search, sketching and fast
approximation. A very powerful paradigm is to map the metric space
induced by the edit distance
into a normed space (e.g., ℓ1) with small distortion,
ℓ1
and then use the rich algorithmic toolkit known for normed spaces.
Although the minimum distortion required to embed edit distance
into ℓ1 has received a lot of attention lately,
there is a large gap between known upper and lower bounds.
We make progress on this question by considering
large, well-structured, submetrics of the edit distance metric space.
Our main technical result is that the Ulam metric,
namely, the edit distance on permutations of length at most n,
embeds into ℓ1 with distortion O(log n).
This immediately leads to sketching algorithms with constant size sketches,
and to efficient approximate nearest neighbor search algorithms,
with approximation factor O(log n).
The embedding and its algorithmic consequences present a big improvement
over those previously known for the Ulam metric, and they are significantly
better than the state of the art for edit distance in general.
Further, we extend these results for the Ulam metric
to edit distance on strings that are (locally) non-repetitive,
i.e., strings where (close by) substrings are distinct.
We simplify and improve upon recent lower bounds on the minimum
distortion of embedding certain finite metric spaces into L_1.
In particular, we show that for infinitely many values of n
there are n-point metric spaces of negative type that require
a distortion of Omega(log log n) for such an embedding, implying
the same lower bound on the integrality gap of a well-known SDP
relaxation for Sparsest-Cut. This result builds upon and
improves the recent lower bound of (log log n)^{1/6-o(1)} due to
Khot and Vishnoi [STOC 2005]. We also show that embedding the edit
distance on {0,1}^n into L_1 requires a distortion of
Omega(log n). This result simplifies and improves a very recent
lower bound due to Khot and Naor [FOCS 2005].
We show that the Multicut, Sparsest-Cut, and Min-2CNF= Deletion
problems are NP-hard to approximate within every constant factor,
assuming the Unique Games Conjecture of Khot [STOC, 2002].
A quantitatively stronger version of
the conjecture implies an inapproximability factor of $\Omega(\sqrt{\log\log n})$.
We address the problems of pattern matching and approximate
pattern matching in the sketching model. We show that it is impossible to
compress the text into a small sketch and use only the sketch to
decide whether a given pattern occurs in the text.
We also prove a sketch size lower bound
for approximate pattern matching,
and show it is tight up to a logarithmic factor.
Edit distance has been extensively studied for the past several years.
Nevertheless, no linear-time algorithm is known to compute the
edit distance between two strings, or
even to approximate it to within a modest factor. Furthermore,
for various natural algorithmic problems such as
low-distortion embeddings into normed spaces,
approximate nearest-neighbor schemes, and sketching algorithms,
known results for the edit distance are rather weak.
We develop algorithms that solve gap versions of the edit distance problem:
given two strings of length $n$ with the promise that their edit distance
is either at most $k$ or greater than $l$, decide which of the two holds.
We present two sketching algorithms for gap versions of edit distance.
Our first algorithm solves the $k$ vs. $(k n)^{2/3}$ gap problem, using a
constant size sketch.
A more involved algorithm solves the stronger $k$ vs. $l$ gap problem,
where $l$ can be as small as $O(k^2)$---still with a constant sketch---but
works only for strings that are mildly ``non-repetitive''.
Finally, we develop an $n^{3/7}$-approximation quasi-linear time algorithm for
edit distance, improving the previous best factor of $n^{3/4}$ due to
Cole and Hariharan [SIAM J. on Computing, 2002];
if the input strings are assumed to be non-repetitive, then
the approximation factor can be strengthened to $n^{1/3}$.
We devise a new embedding technique, which we call measured
descent, based on decomposing a metric space locally, at varying
speeds, according to the density of some probability measure. This
provides a refined and unified framework for the two primary
methods of constructing Fr\'echet embeddings for finite metrics,
due to [Bourgain, 1985] and [Rao, 1999]. We prove that any
$n$-point metric space $(X,d)$ embeds in Hilbert space with
distortion $O(\sqrt{\alpha_X \cdot \log n})$, where $\alpha_X$ is
a geometric estimate on the decomposability of $X$. An an
immediate corollary, we obtain an $O(\sqrt{\log \lambda_X \log
n})$ distortion embedding, where $\lambda_X$ is the doubling
constant of $X$. Since $\lambda_X\le n$, this result recovers
Bourgain's theorem, but when the metric $X$ is, in a sense,
``low-dimensional,'' improved bounds are achieved.
Our embeddings are volume-respecting for subsets of arbitrary
size. One consequence is the existence of $(k, \log n)$
volume-respecting embeddings for all $1 \leq k \leq n$, which is
the best possible, and answers positively a question posed in
[Feige, 1998]. Our techniques are also used to answer positively a
question of Y. Rabinovich, showing that any weighted $n$-point
planar graph embeds in $\ell_\infty^{O(\log n)}$ with $O(1)$
distortion. The $O(\log n)$ bound on the dimension is optimal, and
improves upon the previously known bound of $O(\log n)^2$.
We define a natural notion of efficiency for approximate
nearest-neighbor (ANN) search in general $n$-point metric spaces,
namely the existence of a randomized algorithm which answers
(1+ε)-approximate nearest neighbor queries in
polylog(n) time using only polynomial space. We then study
which families of metric spaces admit efficient ANN schemes in the
black-box model, where only oracle access to the distance function
is given, and any query consistent with the triangle inequality
may be asked.
For ε < 2/5, we offer a complete answer to
this problem. Using the notion of metric dimension defined in
\cite{GKL03} (\`a la [Assouad, 1983]), we show that a metric
space $X$ admits an efficient (1+ε)-ANN scheme for any
ε < 2/5 if and only if $\dim(X) = O(\log
\log n)$. For coarser approximations, clearly the upper bound
continues to hold, but there is a threshold at which our lower
bound breaks down---this is precisely when points in the ``ambient
space'' may begin to affect the complexity of ``hard'' subspaces
$S \subseteq X$. Indeed, we give examples which show that
$\dim(X)$ does not characterize the black-box complexity of ANN
above the threshold.
Our scheme for ANN in low-dimensional metric spaces
is the first to yield efficient algorithms
without relying on any heuristic assumptions
on the input. In previous approaches
(e.g., [Clarkson, 1999], [Karger and Ruhl, 2002], [Krauthgamer and Lee, 2004])
even spaces with $\dim(X) = O(1)$ sometimes required $\Omega(n)$ query times.
Aggregate statistical data about the web is very useful in numerous scenarios, such as market research, intelligence, and social studies. In many of these applications, one is interested not in generic data about the whole web but rather in highly focused information pertinent to a specific domain or topic. Furthermore, timely acquisition of this information provides a competitive advantage.
Focused statistical data can be gathered by a brute force crawl of the whole web, or by a "focused crawl", which collects mainly pages that are relevant to the topic of interest. Crawling, however, is an expensive enterprise, requiring substantial resources. For the purpose of gathering statistical data, random sampling of web pages is a much faster, cheaper, and even more reliable approach. We develop the first efficient method for generating a random sample of web pages relevant to a given user-specified topic. Previously, techniques for getting only an unfocused sample of pages from the whole web were known. Our method is based on a new random walk on (a modified version of) a subgraph of the web graph.
We present a simple deterministic data structure for maintaining a
set $S$ of points in a general metric space, while supporting
proximity search (nearest neighbor and range queries) and updates
to $S$ (insertions and deletions). Our data structure consists of
a sequence of progressively finer $\epsilon$-nets of $S$, with
pointers that allow us to navigate easily from one scale to the
next.
We analyze the worst-case complexity of this data structure in
terms of the ``abstract dimensionality'' of the metric $S$. Our
data structure is extremely efficient for metrics of bounded
dimension and is essentially optimal in a certain model of
distance computation.
Finally, as a special case, our
approach improves over one recently devised by Karger and Ruhl [STOC, 2002].
Given a metric space $(X,d)$, a natural distance measure on
probability distributions over $X$ is the {\em earthmover metric}.
We use randomized rounding of earthmover metrics to devise new
approximation algorithms for two well-known classification problems,
namely, metric labeling and 0-extension.
Our first result is for the 0-extension problem. We show that if
the terminal metric is decomposable with parameter $\alpha$
(e.g., planar metrics are decomposable with $\alpha=O(1)$),
then the earthmover based linear program (for $0$-extension)
can be rounded to within an $O(\alpha)$ factor.
Our second result is an $O(\log n)$-approximation for metric
labeling, using probabilistic tree embeddings in a way very
different from the $O(\log k)$-approximation of Kleinberg and Tardos
[JACM, 2002].
(Here, $n$ is the number of nodes, and $k$ is the number of labels.) The
key element is rounding the earthmover based linear program
(for metric labeling) without increasing the solution's cost,
when the input graph is a tree. This rounding method also
provides an alternate proof to a result stated in Chekuri
et al. [SODA, 2001], that the earthmover based linear program
is integral when the input graph is a tree.
Our simple and constructive rounding techniques contribute to the
understanding of earthmover metrics and may be of independent
interest.
In the Asymmetric $k$-Center problem, the input is an integer $k$ and a complete
digraph over $n$ points together with a distance function obeying
the directed triangle inequality. The goal is to choose a set of $k$
points to serve as centers and to assign all the points to the
centers, so that the maximum distance of any point from its center
is as small as possible.
We show that the Asymmetric $k$-Center problem is hard to approximate up to a factor
of $\log^* n - O(1)$ unless $NP \subseteq DTIME(n^{\log \log
n})$. Since an $O(\log^* n)$-approximation algorithm is known for
this problem, this resolves the asymptotic approximability of Asymmetric $k$-Center.
This is the first natural problem whose approximability threshold
does not polynomially relate to the known approximation classes. We
also resolve the approximability threshold of the metric (symmetric)
$k$-Center problem with costs.
The doubling constant of a metric space $(X,d)$ is the smallest
value $\lambda$ such that every ball in $X$ can be covered by $\lambda$
balls of half the radius.
The doubling dimension of $X$ is then
defined as $dim(X) = \log_2 \lambda$.
A metric (or sequence of metrics)
is called doubling precisely
when its doubling dimension is bounded.
This is a robust class of metric spaces
which contains many
families of metrics that occur in applied settings.
We give tight bounds for embedding doubling metrics
into (low-dimensional) normed spaces.
We consider both general doubling metrics,
as well as more restricted families
such as those arising from trees,
from graphs excluding a fixed minor,
and from snowflaked metrics.
Our techniques include decomposition theorems
for doubling metrics, and an analysis
of a fractal in the plane due to Laakso [Bull. London Math. Soc., 2002].
Finally, we discuss some applications and point out
a central open question regarding dimensionality reduction in $L_2$.
Motivation:
Comparing two protein databases is a fundamental task in biosequence
annotation. Given two databases, one must find all pairs of proteins
that align with high score under a biologically meaningful
substitution score matrix, such as a BLOSUM matrix (Henikoff and Henikoff, 1992).
Distance-based approaches to this problem map each peptide in the
database to a point in a metric space, such that peptides aligning
with higher scores are mapped to closer points. Many techniques exist
to discover close pairs of points in a metric space efficiently, but
the challenge in applying this work to proteomic comparison is to find
a distance mapping that accurately encodes all the distinctions among
residue pairs made by a proteomic score matrix. Buhler (2002)
proposed one such mapping but found that it led to a relatively
inefficient algorithm for protein-protein comparison.
Results:
This work proposes a new distance mapping for peptides under the
BLOSUM matrices that permits more efficient similarity search. We
first propose a new distance function on peptides derived from a given
score matrix. We then show how to map peptides to bit vectors such
that the distance between any two peptides is closely approximated by
the Hamming distance (i.e., number of mismatches) between
their corresponding bit vectors. We combine these two results with
the LSH-ALL-PAIRS-SIM algorithm of Buhler (2002) to produce an
improved distance-based algorithm for proteomic comparison. An
initial implementation of the improved algorithm exhibits sensitivity
within 5% of that of the original LSH-ALL-PAIRS-SIM, while running up
to eight times faster.
We devise the first constant factor approximation algorithm for
minimum quotient vertex-cuts in planar graphs.
Our algorithm achieves approximation ratio $3.5 (1+\epsilon)^2$
with running time $O(W n^{3+2/\epsilon})$.
The approximation ratio improves to $\frac{4}{3} (1+\epsilon)$
if there is an optimal quotient vertex-cut $(A^*,B^*,C^*)$ where
$C^*$ has relatively small weight compared to $A^*$ and $B^*$;
this holds, for example, when the input graph has uniform
(or close to uniform) weight and cost.
The approximation ratio further improves to $1+\epsilon$ if, in addition,
$\min( w(A^*),w(B^*) ) \le \frac{1}{3} W$.
We use our quotient cut algorithm to find the first constant-factor
pseudo-approximation for vertex separators in planar graphs.
Our technical contribution is two-fold.
First, we prove a structural theorem for vertex-cuts in planar graphs,
showing the existence of a near-optimal vertex-cut whose
high-level structure is that of a bounded-depth tree.
Second, we develop an algorithm for optimizing over such complex structures,
whose running time depends (exponentially) not on the size of the structure,
but rather only on its depth.
These techniques may be applicable in other problems.
We resolve the following conjecture raised by Levin together with
Linial, London, and Rabinovich [Combinatorica, 1995].
Let $\dim(G)$ be the smallest $d$ such that $G$ occurs as a
(not necessarily induced) subgraph of the infinite graph $Z_\infty^d$,
whose vertex set is $Z^d$
and which has an edge $(u,v)$ whenever $||u-v||_\infty = 1$.
The growth rate of $G$, denoted $\rho_G$, is the minimum $\rho$
such that every ball of radius $r>1$ in $G$ contains
at most $r^\rho$ vertices.
By simple volume arguments, $\dim(G) = \Omega(\rho_G)$.
Levin conjectured that this lower bound is tight,
i.e., that $\dim(G) = O(\rho_G)$ for every graph $G$.
We show that a weaker form of Levin's conjecture,
namely, $\dim(G) = O(\rho_G \log \rho_G)$, holds for every graph $G$.
We disprove, however, the specific bound of the conjecture
and show that our upper bound is tight
by exhibiting graphs for which $\dim(G) =\Omega(\rho_G \log \rho_G)$.
For families of graphs which exclude a fixed minor,
we salvage the strong form, showing that $\dim(G) = O(\rho_G)$.
This holds also for graphs without long induced simple cycles.
Our results extend to a variant of the conjecture for finite-dimensional
Euclidean spaces due to Linial [Haifa Workshop, 2002].
We provide the first hardness result for a polylogarithmic approximation
ratio (for a natural NP-hard optimization problem).
We show that for every fixed $\epsilon>0$,
the Group Steiner Tree problem admits no efficient $\log^{2-\epsilon} k$--approximation,
where $k$ denotes the number of groups,
unless NP has quasi-polynomial Las-Vegas algorithms.
This result holds even for input graphs which are
Hierarchically Well-Separated Trees, introduced by Bartal [FOCS, 1996],
in which case our bound is nearly tight with the $O(\log^2 k)$--approximation
currently known.
Our results imply that for every fixed $\epsilon>0$,
the Directed Steiner Tree problem admits no $\log^{2-\epsilon} n$--approximation,
where $n$ is the number of vertices in the graph,
under the same complexity assumption.
We present an $\Omega(\log^2 k)$ lower bound on the integrality
ratio of the flow-based relaxation for the Group Steiner Tree problem,
where $k$ denotes the number of groups; this holds even
for input graphs that are Hierarchically Well-Separated Trees,
introduced by Bartal [\textit{Symp.\ Foundations of Computer Science},
pp.\ 184--193, 1996], in which case this lower bound is tight.
This relaxation appears to be the only one that have been studied for the
problem, as well as for
its generalization, the Directed Steiner Tree problem. For the latter
problem, our results imply an $\Omega(\frac{\log^2 n}{(\log \log n)^2})$
integrality ratio, where $n$ is the number of vertices in the graph.
For both problems, this is the first known lower bound on the integrality ratio
that is superlogarithmic in the input size.
We also show algorithmically that the
integrality ratio for Group Steiner Tree is much better for certain
families of instances, which helps pinpoint the types of instances
that appear to be most difficult to approximate.
Data dimensionality is a crucial issue in a variety of settings,
where it is desirable to determine whether a data set
given in a high-dimensional space adheres to a low-dimensional structure.
We study this problem in the framework of property testing:
Given a query access to a data set $S$,
we wish to determine whether $S$ is low-dimensional,
or whether it should be modified significantly in order to have the property.
Allowing a constant probability of error,
we aim at algorithms whose complexity does not depend on the size of $S$.
We present algorithms for testing the low-dimensionality of a set of vectors
and for testing whether a matrix is of low rank.
We then address low-dimensionality in metric spaces.
For vectors in the metric space $l_1$,
we show that low-dimensionality is not testable.
For $l_2$, we show that a data set can be tested
for having a low-dimensional structure,
but that the property of approximately having such a structure
is not testable.
In the survivable network design problem (SNDP), the goal is to find
a minimum-cost subgraph satisfying certain connectivity requirements.
We study the vertex-connectivity variant of SNDP
in which the input specifies, for each pair of vertices,
a required number of vertex-disjoint paths connecting them.
We give the first lower bound on the approximability of SNDP,
showing that the problem admits no efficient
$2^{\log^{1-\epsilon} n}$ ratio approximation for any fixed $\epsilon>0$
unless $NP\subseteq DTIME(n^{\polylog(n)})$.
We also show hardness of approximation results for several
important special cases of SNDP, including constant factor hardness
for the $k$-vertex connected spanning subgraph problem ($k$-VCSS)
and for the vertex-connectivity augmentation problem,
even when the edge costs are severely restricted.
The extensive study of metric spaces and their embeddings
has so far focused on embeddings that preserve pairwise distances.
A very intriguing concept introduced by Feige [JCSS, 2000]
allows us to quantify the extent to which larger
structures are preserved by a given embedding.
We investigate this concept, focusing on several major graph families
such as paths, trees, cubes and expanders.
We find some similarities to the regular (pairwise) distortion,
as well as some striking differences.
Lovasz and Schrijver (SIAM J. Optim., 1991) devised a lift-and-project method that produces
a sequence of convex relaxations for
the problem of finding in a graph an independent set (or a clique) of maximum size.
Each relaxation in the sequence is tighter than the one before it,
while the first relaxation is already at least as strong as the
Lovasz theta function (IEEE Trans. Inform. Theory, 1979}.
We show that on a random graph $G_{n,1/2}$,
the value of the $r$th relaxation in the sequence is roughly $\sqrt{n/2^r}$,
almost surely.
It follows that for those relaxations known to be efficiently computable,
namely for $r=O(1)$,
the value of the relaxation is comparable to the theta function.
Furthermore, a perfectly tight relaxation is almost surely obtained
only at the $r=\Theta(\log n)$ relaxation in the sequence.
The notion of private approximation was introduced recently by
Feigenbaum, Fong, Strauss and Wright. Informally, a private approximation
of a function $f$ is another function $F$ that approximates $f$ in
the usual sense, but does not
yield any information on $x$ other than what can be deduced from $f(x)$.
As such, $F(x)$ is useful for private computation of $f(x)$ (assuming
that $F$ can be computed more efficiently than $f$).
In this work we examine the properties and limitations of this new notion.
Specifically, we show that for many NP-hard problems, the privacy
requirement
precludes non-trivial approximation. This is the case even for problems
that otherwise admit very good approximation (e.g., problems with PTAS).
On the other hand, we show that slightly relaxing the privacy requirement,
by means of leaking ``just a few bits of information'' about $x$, again
permits good approximation.
A web content hosting service provider needs to dynamically allocate
servers in a server farm to its customers' web sites.
Ideally, the allocation to a site should always suffice to
handle its load.
However, due to a limited number of servers and the overhead
incurred in changing the allocation of a server from one site to
another, the system may become overloaded.
The problem faced by the web hosting service provider is how to allocate
the available servers in the most profitable way. Adding to the
complexity of this problem is the fact that future loads of the
sites are either unknown or known only for the very near future.
In this paper we model this server allocation problem,
and consider both its offline
and online versions. We give a polynomial time algorithm for
computing the optimal offline allocation. In the online setting, we
show almost optimal algorithms (both deterministic and randomized)
for any positive lookahead. The
quality of the solution improves as the lookahead increases. We also
consider several special cases of practical interest. Finally, we
present some experimental results using actual trace data that show
that one of our online algorithm performs very close to optimal.
Interestingly,
the online server allocation problem can be cast as a more general
benefit task system that we define.
Our results extend to this task system, which captures also the
benefit maximization variants of the $k$-server problem and the metrical task
system problem. It follows
that the benefit maximization variants of these problems are more tractable
than their cost minimization variants.
The achromatic number problem is to legally color the vertices
of an input graph with the maximum number of colors, denoted $\psi^*$,
so that every two color classes share at least one edge.
This problem is known to be NP-hard.
For general graphs we give an algorithm that approximates the achromatic number
within ratio of $O(n \log\log n/\log n)$.
This improves over the previously known approximation ratio
of $O(n/\sqrt{\log n})$,
due to Chaudhary and Vishwanathan (SODA, 1997).
For graphs of girth at least 5 we give
an algorithm with approximation ratio $O(\min\{n^{1/3},\sqrt{\psi^*}\})$.
This improves over an approximation ratio $O(\sqrt{\psi^*})=O(n^{3/8})$
for the more restricted case of graphs with girth at least 6,
due to Krista and Lorys (ESA 1999).
We also give the first hardness result for approximating the achromatic number.
We show that for every fixed $\epsilon>0$
there is no $2-\epsilon$ approximation algorithm, unless $P=NP$.
A bisection of a graph with $n$ vertices is a partition of its vertices
into two sets, each of size $n/2$. The bisection cost is the number of edges
connecting the two sets. Finding the bisection of minimum cost is NP-hard.
We present an algorithm that finds a bisection whose cost is within
ratio of $O(\log^2 n)$ from the optimal.
For graphs excluding any fixed graph as a minor (e.g. planar graphs)
we obtain an improved approximation ratio of $O(\log n)$.
The previously known approximation ratio for bisection was roughly $\sqrt{n}$.
A bisection of a graph with $n$ vertices is a partition of its vertices
into two sets, each of size $n/2$. The bisection size is the number of edges
connecting the two sets. Finding the bisection of minimum size is
NP-hard. We present an algorithm that finds a bisection that is within
$O(\sqrt{n}\log n)$ of optimal.
No sublinear approximation ratio for bisection was previously known.
We consider the problem of finding in an undirected graph
a minimum cut that separates exactly a given number $k$ of vertices.
For general $k$ (i.e. $k$ is part of the input and may depend on $n$)
this problem is NP-hard.
We present for this problem a randomized approximation algorithm,
which is useful when $k$ is relatively small.
In particular, for $k = O(\log n)$
we obtain a PTAS (polynomial time approximation scheme),
and for $k=\Omega(\log n)$ we obtain an approximation ratio $O(k/\log n)$.
The motivation for our work is the observation that
Web pages on a particular topic are often linked to other pages on the
same topic. We model and analyze the problem of how to improve the
classification of Web pages (that is, determining the topic of the
page) by using link information. In our setting, an initial
classifier examines the text of a Web page and assigns to it some
classification, possibly mistaken. We investigate how to reduce the
error probability using the observation above, thus building an
improved classifier. We present a theoretical framework for this
problem based on a random graph model and suggest two linear time
algorithms, based on similar methods that have been proven effective in
the setting of error-correcting codes. We provide simulation results
to verify our analysis and to compare the performance of our suggested
algorithms.
Alon, Krivelevich and Sudakov (Random Structures and Algorithms, 1998)
designed an algorithm based on spectral techniques that almost surely
finds a clique of size $\Omega(\sqrt{n})$ hidden in an otherwise random graph.
We show that a different algorithm, based on the Lovasz theta function,
almost surely both finds the hidden clique and certifies its optimality.
Our algorithm has an additional advantage of being more robust: it also
works in a semi-random hidden clique model, in which an adversary can remove
edges from the random portion of the graph.
Hot-potato routing is a form of synchronous routing
which makes no use of buffers at intermediate nodes.
Packets must move at every time step, until they reach their destination.
If contention prevents a packet from taking its preferred outgoing edge,
it is deflected on a different edge.
Two simple design principles for hot potato routing algorithms
are minimum advance, that advances at least one packet towards
its destination from every nonempty node (and possibly deflects all
other packets), and maximum advance, that advances the maximum possible
number of packets.
Livelock is a situation in which packets keep moving indefinitely
in the network without any packet ever reaching its destination.
It is known that even maximum advance algorithms might
livelock on some networks.
We show that minimum advance algorithms never livelock on tree
networks, and that maximum advance algorithms never livelock on
triangulated networks.
A stereoscopic family of permutations maps
an $m$-dimensional mesh into several one-dimensional lines,
in a way that jointly preserves distance information.
Specifically, consider any two points and denote their
distance on the $m$-dimensional mesh by $d$. Then the
distance between their images, on the line on which these
images are closest together, is $O(d^m)$.
We initiate a systematic study of stereoscopic families of
permutations. We show a construction of these families that
involves the use of $m+1$ images.
We also show that under some additional restrictions (namely,
adjacent points on the image lines originate at points which
are not too far away on the mesh), three images are necessary
in order to construct such a family for the two-dimensional mesh.
We present two applications for stereoscopic families of permutations.
One application is an algorithm for routing on the mesh that
guarantees delivery of each packet within a number of steps
that depends upon the distance between this packet's source and
destination, but is independent of the size of the mesh.
Our algorithm is exceptionally simple, involves no queues, and
can be used in dynamic settings in which packets are continuously
generated.
Another application is an extension of the construction of
non-expansive hash functions of Linial and Sasson (STOC 1996)
from the case of one dimensional metrics to arbitrary dimensions.
Recovery Guarantees for Distributed-OMP
Chen Amiraz,
Robert Krauthgamer, and
Boaz Nadler.
Exact Flow Sparsification Requires Unbounded Size
Robert Krauthgamer
and
Ron Mosenzon.
Optimal Vertex-Cut Sparsification of Quasi-Bipartite Graphs
Itai Boneh
and
Robert Krauthgamer.
The Power of Uniform Sampling for Coresets
Vladimir Braverman,
Vincent Cohen-Addad,
Shaofeng H.-C. Jiang,
Robert Krauthgamer,
Chris Schwiegelshohn,
Mads Bech Toftrup,
and
Xuan Wu.
Streaming Facility Location in High Dimension via Geometric Hashing
Artur Czumaj,
Arnold Filtser,
Shaofeng H.-C. Jiang,
Robert Krauthgamer,
Pavel Vesely,
and
Mingwei Yang.
Comparison of Matrix Norm Sparsification
Robert Krauthgamer and
Shay Sapir.
Gomory-Hu Tree in Subcubic Time
Amir Abboud,
Robert Krauthgamer,
Jason Li,
Debmalya Panigrahi,
Thatchaphol Saranurak,
and Ohad Trabelsi.
Gap Edit Distance via Non-Adaptive Queries: Simple and Optimal
Elazar Goldenberg,
Tomasz Kociumaka,
Robert Krauthgamer, and
Barna Saha.
Friendly Cut Sparsifiers and Faster Gomory-Hu Trees
Amir Abboud,
Robert Krauthgamer,
and Ohad Trabelsi.
Coresets for Kernel Clustering
Shaofeng H.-C. Jiang,
Robert Krauthgamer,
Jianing Lou,
and
Yubo Zhang.
Approximate Trace Reconstruction via Median String (in Average-Case)
Diptarka Chakraborty,
Debarati Das,
and
Robert Krauthgamer.
Coresets for Clustering with Missing Values
Vladimir Braverman,
Shaofeng H.-C. Jiang,
Robert Krauthgamer,
and
Xuan Wu.
APMF < APSP? Gomory-Hu Tree for Unweighted Graphs in Almost-Quadratic Time
Amir Abboud,
Robert Krauthgamer,
and Ohad Trabelsi.
Spectral Hypergraph Sparsifiers of Nearly Linear Size
Michael Kapralov,
Robert Krauthgamer,
Jakab Tardos,
and
Yuichi Yoshida.
Smoothness of Schatten Norms and Sliding-Window Matrix Streams
Robert Krauthgamer and
Shay Sapir.
Sparse Normal Means Estimation with Sublinear Communication
Chen Amiraz,
Robert Krauthgamer, and
Boaz Nadler.
Subcubic Algorithms for Gomory-Hu Tree in Unweighted Graphs
Amir Abboud,
Robert Krauthgamer,
and Ohad Trabelsi.
Towards Tight Bounds for Spectral Sparsification of Hypergraphs
Michael Kapralov,
Robert Krauthgamer,
Jakab Tardos,
and
Yuichi Yoshida.
Streaming Algorithms for Geometric Steiner Forest
Artur Czumaj,
Shaofeng H.-C. Jiang,
Robert Krauthgamer,
and
Pavel Vesely.
Near-Optimal Entrywise Sampling of Numerically Sparse Matrices
Vladimir Braverman,
Robert Krauthgamer,
Aditya Krishnan, and
Shay Sapir.
Approximating the Median under the Ulam Metric
Diptarka Chakraborty,
Debarati Das,
and
Robert Krauthgamer.
Cut-Equivalent Trees are Optimal for Min-Cut Queries
Amir Abboud,
Robert Krauthgamer,
and Ohad Trabelsi.
Coresets for Clustering in Excluded-minor Graphs and Beyond
Vladimir Braverman,
Shaofeng H.-C. Jiang,
Robert Krauthgamer,
and
Xuan Wu.
Faster Algorithms for Orienteering and k-TSP
Lee-Ad Gottlieb,
Robert Krauthgamer, and
Havana Rika.
Sublinear Algorithms for Gap Edit Distance
Elazar Goldenberg,
Robert Krauthgamer, and
Barna Saha.
Labelings vs. Embeddings: On Distributed Representations of Distances
Arnold Filtser,
Lee-Ad Gottlieb, and
Robert Krauthgamer.
Schatten Norms in Matrix Streams: Hello Sparsity, Goodbye Dimension
Vladimir Braverman,
Robert Krauthgamer,
Aditya Krishnan, and
Roi Sinoff.
Coresets for Clustering in Graphs of Bounded Treewidth
Daniel Baker,
Vladimir Braverman,
Lingxiao Huang,
Shaofeng H.-C. Jiang,
Robert Krauthgamer,
and
Xuan Wu.
Tight Recovery Guarantees for Orthogonal Matching Pursuit Under Gaussian Noise
Chen Amiraz,
Robert Krauthgamer, and
Boaz Nadler.
Almost-Smooth Histograms and Sliding-Window Graph Algorithms
Robert Krauthgamer
and
David Reitblat.
Coresets for Ordered Weighted Clustering
Vladimir Braverman,
Shaofeng H.-C. Jiang,
Robert Krauthgamer,
and
Xuan Wu.
New Algorithms and Lower Bounds for All-Pairs Max-Flow in Undirected Graphs
Amir Abboud,
Robert Krauthgamer,
and Ohad Trabelsi.
Universal Streaming of Subset Norms
Vladimir Braverman,
Robert Krauthgamer,
and
Lin F. Yang.
On Solving Linear Systems in Sublinear Time
Alexandr Andoni,
Robert Krauthgamer, and
Yosef Pogrow.
Noisy Voronoi: a Simple Framework for Terminal-Clustering Problems
Arnold Filtser,
Robert Krauthgamer,
and Ohad Trabelsi.
Flow-Cut Gaps and Face Covers in Planar Graphs
Robert Krauthgamer,
James R. Lee, and
Inbal Rika.
Batch Sparse Recovery, or How to Leverage the Average Sparsity
Alexandr Andoni,
Lior Kamma,
Robert Krauthgamer, and
Eric Price.
Faster Algorithms for All-Pairs Bounded Min-Cuts
Amir Abboud,
Loukas Georgiadis,
Daniel Graf,
Giuseppe F. Italiano,
Robert Krauthgamer,
Nikos Parotsidis,
Ohad Trabelsi, and
Przemysław Uznański.
Stochastic Selection Problems with Testing
Chen Attias,
Robert Krauthgamer,
Retsef Levi and
Yaron Shaposhnik.
The Set Cover Conjecture and Subgraph Isomorphism with a Tree Pattern
Robert Krauthgamer and
and Ohad Trabelsi.
Refined Vertex Sparsifiers of Planar Graphs
Robert Krauthgamer and
Havana (Inbal) Rika
Conditional Lower Bounds for All-Pairs Max-Flow
Robert Krauthgamer and
and Ohad Trabelsi.
Matrix Norms in Data Streams: Faster, Multi-Pass and Row-Order
Vladimir Braverman,
Stephen R. Chestnut,
Robert Krauthgamer,
Yi Li,
David P. Woodruff,
and
Lin F. Yang.
Color-Distance Oracles and Snippets
Tsvi Kopelowitz,
and Robert Krauthgamer
Streaming Symmetric Norms via Measure Concentration
Jaroslaw Blasiok,
Vladimir Braverman,
Stephen R. Chestnut,
Robert Krauthgamer, and
Lin F. Yang
Tight Bounds for Gomory-Hu-like Cut Counting
Rajesh Chitnis,
Lior Kamma,
and Robert Krauthgamer
A natural application of these bounds is to construct small data structures that stores all relevant cut values, \ala the Gomory-Hu tree.
We initiate this direction by giving some upper and lower bounds.
Sparsification of Two-Variable Valued CSPs
Arnold Filtser
and Robert Krauthgamer
Local Reconstruction of Low-Rank Matrices and Subspaces
Roee David,
Elazar Goldenberg, and
Robert Krauthgamer
Towards Resistance Sparsifiers
Michael Dinitz,
Robert Krauthgamer,
and Tal Wagner
Approximate Nearest Neighbor Search in Metrics of Planar Graphs
Ittai Abraham,
Shiri Chechik,
Robert Krauthgamer,
and Udi Wieder
Metric Decompositions of Path-Separable Graphs
Lior Kamma,
and Robert Krauthgamer
Sketching and Embedding are Equivalent for Norms
Alexandr Andoni,
Robert Krauthgamer
and Ilya Razenshteyn
Sketching Cuts in Graphs and Hypergraphs
Dmitry Kogan,
and Robert Krauthgamer
Cheeger-Type Approximation for Sparsest st-Cut
Robert Krauthgamer
and Tal Wagner
Spectral Approaches to Nearest Neighbor Search
Amirali Abdullah,
Alexandr Andoni,
Ravindran Kannan,
and Robert Krauthgamer
On Sketching Quadratic Forms
Alexandr Andoni,
Jiecao Chen,
Robert Krauthgamer,
Bo Qin,
David P. Woodruff,
and Qin Zhang.
Orienting Fully Dynamic Graphs with Worst-Case Time Bounds
Tsvi Kopelowitz,
Robert Krauthgamer,
Ely Porat,
and Shay Solomon
Multiply Balanced k-Partitioning
Amihood Amir,
Jessica Ficler,
Robert Krauthgamer,
Liam Roditty,
and Oren Sar Shalom
Towards (1+ε)-Approximate Flow Sparsifiers
Alexandr Andoni,
Anupam Gupta,
and
Robert Krauthgamer
Non-Uniform Graph Partitioning
Robert Krauthgamer,
Joseph (Seffi) Naor,
Roy Schwartz,
and
Kunal Talwar.
Do Semidefinite Relaxations Really Solve Sparse PCA?
Robert Krauthgamer,
Boaz Nadler, and
Dan Vilenchik
Cutting corners cheaply, or how to remove Steiner points
Lior Kamma,
Robert Krauthgamer, and
Huy L. Nguyen
Adaptive Metric Dimensionality Reduction
Lee-Ad Gottlieb,
Aryeh Kontorovich
and Robert Krauthgamer.
Mimicking Networks and Succinct Representations of Terminal Cuts
Robert Krauthgamer and
Inbal Rika.
Faster Clustering via Preprocessing
Tsvi Kopelowitz and
Robert Krauthgamer.
Everywhere-Sparse Spanners via Dense Subgraphs
Eden Chlamtac,
Michael Dinitz,
and Robert Krauthgamer.
Preserving Terminal Distances using Minors
Robert Krauthgamer and
Tamar Zondiner.
The Traveling Salesman Problem: Low-Dimensionality Implies a Polynomial Time Approximation Scheme
Yair Bartal,
Lee-Ad Gottlieb,
and Robert Krauthgamer.
Efficient Regression in Metric Spaces via Approximate Lipschitz Extension
Lee-Ad Gottlieb,
Aryeh Kontorovich
and Robert Krauthgamer.
Min-Max Graph Partitioning and Small-Set Expansion
Nikhil Bansal,
Uriel Feige
Robert Krauthgamer,
Konstantin Makarychev,
Viswanath Nagarajan,
Joseph (Seffi) Naor,
and Roy Schwartz.
Streaming Algorithms via Precision Sampling
Alexandr Andoni,
Robert Krauthgamer, and
Krzysztof Onak
For all these applications the algorithm
is essentially the same: to
pre-multiply the vector $x$ entry-wise by a well-chosen random vector, and run a
heavy hitter estimation algorithm on the resulting vector.
Our sketch is a linear function of $x$, thereby allowing general
updates to the vector $x$.
Fault-Tolerant Spanners: Better and Simpler
Michael Dinitz,
and Robert Krauthgamer
Directed Spanners via Flow-Based Linear Programs
Michael Dinitz,
and Robert Krauthgamer
Approximating sparsest cut in graphs of bounded treewidth
Eden Chlamtac,
Robert Krauthgamer,
and Prasad Raghavendra
Vertex Sparsifiers: New Results from Old Techniques
Matthias Englert,
Anupam Gupta,
Robert Krauthgamer,
Harald Raecke,
Inbal Talgam-Cohen
and
Kunal Talwar.
Proximity algorithms for nearly-doubling spaces
Lee-Ad Gottlieb,
and Robert Krauthgamer.
Efficient classification for metric data
Lee-Ad Gottlieb,
Aryeh (Leonid) Kontorovich
and Robert Krauthgamer.
Polylogarithmic approximation for edit distance and the asymmetric query complexity
Alexandr Andoni,
Robert Krauthgamer, and
Krzysztof Onak
A nonlinear approach to dimension reduction
Lee-Ad Gottlieb
and Robert Krauthgamer
How hard is it to approximate the best Nash equilibrium?
Elad Hazan
and Robert Krauthgamer
Partitioning graphs into balanced components
Robert Krauthgamer,
Joseph (Seffi) Naor,
and Roy Schwartz
Approximate Line Nearest Neighbor in High Dimensions
Alexandr Andoni,
Piotr Indyk,
Robert Krauthgamer, and
Huy L. Nguyen
Overcoming the
Alexandr Andoni,
Piotr Indyk and
Robert Krauthgamer
The Smoothed Complexity of Edit Distance
Alexandr Andoni and
Robert Krauthgamer
Greedy List Intersection
Robert Krauthgamer,
Aranyak Mehta,
Vijayshankar Raman, and
Atri Rudra
Pricing commodities, or how to sell when buyers have restricted valuations
Robert Krauthgamer,
Aranyak Mehta, and
Atri Rudra
Metric Clustering via Consistent Labeling
Robert Krauthgamer and
Tim Roughgarden
The Computational Hardness of Estimating Edit Distance
Alexandr Andoni and
Robert Krauthgamer
Earth Mover Distance over High-Dimensional Spaces
Alexandr Andoni,
Piotr Indyk and
Robert Krauthgamer
On triangulation of simple networks
Robert Krauthgamer
Estimating the sortedness of a data stream
Parikshit Gopalan,
T. S. Jayram,
Robert Krauthgamer and
Ravi Kumar
Algorithms on negatively curved spaces
Robert Krauthgamer and
James R. Lee
Embedding the Ulam metric into l_1
Moses Charikar
and Robert Krauthgamer
Improved lower bounds for embeddings into L_1
Robert Krauthgamer and
Yuval Rabani
On the hardness of approximating multicut and sparsest-cut
Shuchi Chawla,
Robert Krauthgamer,
Ravi Kumar,
Yuval Rabani,
and D. Sivakumar
The sketching complexity of pattern matching
Ziv Bar-Yossef,
T. S. Jayram,
Robert Krauthgamer
and Ravi Kumar
Approximating edit distance efficiently
Ziv Bar-Yossef,
T. S. Jayram,
Robert Krauthgamer
and Ravi Kumar
Measured descent: A new embedding method for finite metrics
Robert Krauthgamer,
James R. Lee,
Manor Mendel
and Assaf Naor.
The black-box complexity of nearest neighbor search
Robert Krauthgamer
and James R. Lee
Focused Sampling: Computing Topical Web Statistics
Ziv Bar-Yossef,
Robert Krauthgamer,
and Tapas Kanungo
Navigating nets: Simple algorithms for proximity search
Robert Krauthgamer
and James R. Lee
Approximate classification via earthmover metrics
Aaron Archer,
Jittat Fakcharoenphol,
Chris Harrelson,
Robert Krauthgamer,
Kunal Talwar,
and
Eva Tardos
Asymmetric k-center is log^*n-hard to Approximate
Julia Chuzhoy,
Sudipto Guha,
Eran Halperin,
Sanjeev Khanna,
Guy Kortsarz,
Robert Krauthgamer,
and Seffi Naor
Bounded geometries, fractals, and low-distortion embeddings
Anupam Gupta,
Robert Krauthgamer
and James R. Lee
Detecting protein sequence conservation via metric embeddings
Eran Halperin,
Jeremy Buhler,
Richard Karp,
Robert Krauthgamer
and Ben Westover.
Constant factor approximation of vertex-cuts in planar graphs
Eyal Amir,
Robert Krauthgamer
and Satish Rao
The intrinsic dimensionality of graphs
Robert Krauthgamer
and James R. Lee
Polylogarithmic inapproximability
Eran Halperin,
and Robert Krauthgamer
Integrality ratio for Group Steiner Trees and Directed Steiner Trees
Eran Halperin,
Guy Kortsarz,
Robert Krauthgamer,
Aravind Srinivasan
and Nan Wang
Property testing of data dimensionality
Robert Krauthgamer and
Ori Sasson
Hardness of approximation for vertex-connectivity network-design problems
Guy Kortsarz,
Robert Krauthgamer
and James R. Lee
Metric embeddings - beyond one-dimensional distortion
Robert Krauthgamer,
Nati Linial
and Avner Magen
The probable value of the Lovasz-Schrijver relaxations for maximum
independent set
Uriel Feige
and Robert Krauthgamer
Private approximation of NP-hard functions
Shai Halevi,
Robert Krauthgamer,
Eyal Kushilevitz and
Kobbi Nissim
Online server allocation in a server farm via benefit task systems
T.S. Jayram,
Tracy Kimbrel,
Robert Krauthgamer,
Baruch Schieber and
Maxim Sviridenko.
On approximating the achromatic number
Guy Kortsarz
and
Robert Krauthgamer
A polylogarithmic approximation of the minimum bisection
Uriel Feige
and Robert Krauthgamer
Approximating the minimum bisection size
Uriel Feige,
Robert Krauthgamer
and Kobbi Nissim
On cutting a few vertices from a graph
Uriel Feige,
Robert Krauthgamer
and Kobbi Nissim
Improved classification via connectivity information
Andrei Broder, Robert Krauthgamer and Michael Mitzenmacher
Finding and certifying a large hidden clique
in a semi-random graph
Uriel Feige
and Robert Krauthgamer
Networks on which hot-potato routing does not livelock
Uriel Feige
and Robert Krauthgamer
Stereoscopic families of permutations, and their applications
Uriel Feige
and Robert Krauthgamer