The Traveling Salesman Problem (TSP) is among the most famous NP-hard optimization problems. We design for this problem a randomized polynomial-time algorithm that computes a (1+epsilon)-approximation to the optimal tour, for any fixed epsilon>0, in TSP instances that form an arbitrary metric space with bounded intrinsic dimension.
The celebrated results of Arora [JACM 1998] and Mitchell [SICOMP 1999] prove that the above result holds in the special case of TSP in a fixed-dimensional Euclidean space. Thus, our algorithm demonstrates that the algorithmic tractability of metric TSP depends on the dimensionality of the space and not on its specific geometry. This result resolves a problem that has been open since the quasi-polynomial time algorithm of Talwar [STOC 2004].
We present a framework for performing efficient regression in general metric spaces. Roughly speaking, our regressor predicts the value at a new point by computing a Lipschitz extension -- the smoothest function consistent with the observed data -- while performing an optimized structural risk minimization to avoid overfitting. The offline (learning) and online (inference) stages can be solved by convex programming, but this naive approach has runtime complexity O(n^3), which is prohibitive for large datasets. We design instead an algorithm that is fast when the the doubling dimension, which measures the "intrinsic" dimensionality of the metric space, is low.
We use the doubling dimension multiple times; first, on the statistical front, to bound fat-shattering dimension of the class of Lipschitz functions (and obtain risk bounds); and second, on the computational front, to quickly compute a hypothesis function and a prediction based on Lipschitz extension. Our resulting regressor is both asymptotically strongly consistent and comes with finite-sample risk bounds, while making minimal structural and noise assumptions.
We study graph partitioning problems from a min-max perspective, in which an input graph on $n$ vertices should be partitioned into $k$ parts, and the objective is to minimize the maximum number of edges leaving a single part. The two main versions we consider are where the $k$ parts need to be of equal-size, and where they must separate a set of $k$ given terminals. We consider a common generalization of these two problems, and design for it an $O(\sqrt{\log n\log k})$-approximation algorithm. This improves over an $O(\log^2 n)$ approximation for the second version due to Svitkina and Tardos (2004), and roughly $O(k\log n)$ approximation for the first version that follows from other previous work. We also give an improved $O(1)$-approximation algorithm for graphs that exclude any fixed minor.
Along the way, we study the Unbalanced-Cut problem, whose goal is to find, in an input graph $G=(V,E)$, a set $S\subseteq V$ of size $|S|=\rho n$ that minimizes the number of edges leaving $S$. We provide a bicriteria approximation of $O(\sqrt{\log{n}\log{(1/\rho)}})$; when the input graph excludes a fixed-minor we improve this guarantee to $O(1)$. Note that the special case $\rho = 1/2$ is the well-known Minimum-Bisection problem, and indeed our bounds generalize those of Arora, Rao and Vazirani (2008) and of Klein, Plotkin, and Rao (1993). Our algorithms work also for the closely related Small-Set Expansion (SSE) problem, which asks for a set $S\subseteq V$ of size $0<|S| \leq \rho n$ with minimum edge-expansion, and was suggested recently by Raghavendra and Steurer (2010). In fact, our algorithm handles more general, weighted, versions of both problems. Previously, an $O(\log n)$ true approximation for both \UC and \SSE follows from Raecke (2008).
In STOC 2005, Indyk and Woodruff introduced a new technique that yielded a near-optimal space algorithm for $F_k$ moment estimation problem. Since then, the technique has inspired a number of advances in streaming algorithmics. We show that at least several of these results follow easily from the application of a single probabilistic technique, called Precision Sampling. Using it, we obtain simple streaming algorithms that maintain a randomized sketch of an input vector $x=(x_1,\ldots x_n)$, useful for the following applications:
Precision Sampling itself addresses the problem of estimating a sum $\sum_{i=1}^n a_i$ from weak estimates of each real $a_i\in[0,1]$. More precisely, one chooses in advance ``precisions'' $w_i\ge 1$ for each $i\in[n]$ and obtains an estimate of $a_i$ within additive $1/w_i$. The core question is: what is the best trade-off between the approximation to $\sum a_i$ and the total precision, $\sum_i w_i$ ? In previous work, we showed [Andoni, Krauthgamer, and Onak, FOCS 2010] that, as long as $\sum a_i=\Omega(1)$, one can achieve good multiplicative approximation using total precision of only $O(n\log n)$.
A natural requirement of many distributed structures is fault-tolerance: after some failures, whatever remains from the structure should still be effective for whatever remains from the network. In this paper we examine spanners of general graphs that are tolerant to vertex failures, and significantly improve their dependence on the number of faults $r$, for all stretch bounds.
For stretch $k \geq 3$ we design a simple transformation that converts every k-spanner construction with at most $f(n)$ edges into an $r$-fault-tolerant $k$-spanner construction with at most $O(r^3 \log n) \cdot f(2n/r)$ edges. Applying this to standard greedy spanner constructions gives r-fault tolerant k-spanners with $\tilde O(r^{2} n^{1+\frac{2}{k+1}})$ edges. The previous construction by Chechik, Langberg, Peleg, and Roddity [STOC 2009] depends similarly on n but exponentially on r (approximately like $k^r$).
For the case $k=2$ and unit-length edges, an $O(r \log n)$-approximation algorithm is known from recent work of Dinitz and Krauthgamer [arXiv 2010], where several spanner results are obtained using a common approach of rounding a natural flow-based linear programming relaxation. Here we use a different (stronger) LP relaxation and improve the approximation ratio to $O(\log n)$, which is, notably, independent of the number of faults r. We further strengthen this bound in terms of the maximum degree by using the Lovasz Local Lemma.
Finally, we show that most of our constructions are inherently local by designing equivalent distributed algorithms in the LOCAL model of distributed computation.
We examine directed spanners through flow-based linear programming relaxations. We design an $\tilde{O}(n^{2/3})$-approximation algorithm for the directed $k$-spanner problem that works for all $k\geq 1$, which is the first sublinear approximation for arbitrary edge-lengths. Even in the more restricted setting of unit edge-lengths, our algorithm improves over the previous $\tilde{O}(n^{1-1/k})$ approximation of Bhattacharyya, Grigorescu, Jung, Raskhodnikova, and Woodruff [SODA 2009] when $k\geq 4$. For the special case of $k=3$ we design a different algorithm achieving an $\tilde{O}(\sqrt{n})$-approximation, improving the previous $\tilde{O}(n^{2/3})$ of Elkin and Peleg [TCS 2005] and Bhattacharyya et al. [SODA 2009]. Both of our algorithms easily extend to the \emph{fault-tolerant} setting, which has recently attracted attention but not from an approximation viewpoint. We also prove a nearly matching integrality gap of $\tilde{\Omega}(n^{1/3 - \epsilon})$ for any constant $\epsilon > 0$.
A virtue of all our algorithms is that they are relatively simple. Technically, we introduce a new yet natural flow-based relaxation, and show how to approximately solve it even when its size is not polynomial. The main challenge is to design a rounding scheme that ``coordinates'' the choices of flow-paths between the many demand pairs while using few edges overall. We achieve this, roughly speaking, by randomization at the level of vertices.
We give the first constant-factor approximation algorithm for Sparsest-Cut with general demands in bounded tr eewidth graphs. In contrast to previous algorithms, which rely on the flow-cut gap and/or metric embeddings, our approach exploits the Sherali-Adams hierarchy of linear programming relaxations.
Given a capacitated graph G = (V,E) and a set of terminals $K \subseteq V$, how should we produce a graph H only on the terminals K so that every (multicommodity) flow between the terminals in G could be supported in H with low congestion, and vice versa? (Such a graph H is called a flow-sparsifier for G.) What if we want H to be a ``simple'' graph? What if we allow H to be a convex combination of simple graphs?
Improving on results of Moitra [FOCS 2009] and Leighton and Moitra [STOC 2010], we give efficient algorithms for constructing: (a) a flow-sparsifier $H$ that maintains congestion up to a factor of $\alpha_{FHRT}$, where k = |K|. (b) a convex combination of trees over the terminals K that maintains congestion up to a factor of O(log k). (c) for a planar graph G, a convex combination of planar graphs that maintains congestion up to a constant factor. This requires us to give a new algorithm for the 0-extension problem, the first one in which the preimages of each terminal are connected in G. Moreover, this result extends to minor-closed families of graphs.
Our bounds immediately imply improved approximation guarantees for several terminal-based cut and ordering problems.
We introduce a new problem in the study of doubling spaces: Given a point set S and a target dimension d*, remove from S the fewest number of points so that the remaining set has doubling dimension at most d^*. We present a bicriteria approximation for this problem, and extend this algorithm to solve a group of proximity problems.
Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as edit and earthmover distance. The general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the question of computational efficiency and providing direct bounds on classification error.
We design a new algorithm for classification in general metric spaces, whose runtime and accuracy depend on the doubling dimension of the data points. It thus achieves superior classification performance in many common scenarios. The algorithmic core of our approach is an approximate (rather than exact) solution to the classical problems of Lipschitz extension and of Nearest Neighbor Search. The algorithm's generalization performance is established via the fat-shattering dimension of Lipschitz classifiers.
We present a near-linear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor; specifically, for strings of length n and every fixed $\eps>0$, it can compute a $(\log n)^{O(1/\eps)}$-approximation in $n^{1+\eps}$ time. This is an {\em exponential} improvement over the previously known factor, $2^{\tilde O(\sqrt{\log n})}$, with a comparable running time [OR05,AO09].
Previously, no efficient polylogarithmic approximation algorithm was known for any computational task involving edit distance (e.g., nearest neighbor search or sketching).
This result arises naturally in the study of a new asymmetric query model. In this model, the input consists of two strings x and y, and an algorithm can access y in an unrestricted manner, while being charged for querying every symbol of x. Indeed, we obtain our main result by designing an algorithm that makes a small number of queries in this model. We then provide also a nearly-matching lower bound on the number of queries. Our lower bound is the first to expose hardness of edit distance stemming from the input strings being ``repetitive'', which means that many of their substrings are approximately identical. Consequently, our lower bound provides the first rigorous separation between edit distance and Ulam distance, which is edit distance on non-repetitive strings, i.e., permutations.
The
The snowflake metric $d^{1/2}$ of a doubling set
$S\subset
In fact, the target
dimension is polylogarithmic in the doubling constant. Our techniques are
robust and extend to the more difficult spaces
The quest for a Polynomial Time Approximation Scheme (PTAS) for Nash equilibrium in a two-player game, which emerged as a major open question in Algorithmic Game Theory, seeks to circumvent the PPAD-completeness of finding an (exact) Nash equilibrium by finding an approximate equilibrium. The closely related problem of finding an equilibrium maximizing a certain objective, such as the social welfare, was shown to be NP-hard [Gilboa and Zemel, Games and Economic Behavior, 1989]. However, this NP-hardness is unlikely to extend to approximate equilibria, since the latter admits a quasi-polynomial time algorithm [Lipton, Markakis and Mehta, In Proc. of EC, 2003].
We show that this optimization problem, namely, finding in a two-player game an approximate equilibrium achieving a large social welfare, is unlikely to have a polynomial time algorithm. One interpretation of our results is that a PTAS for Nash equilibrium (if it exists) should not extend to a PTAS for finding the best Nash equilibrium.
Technically, our result is a reduction from the notoriously difficult problem in modern Combinatorics, of finding a planted (but hidden) clique in a random graph G(n,1/2). Our reduction starts from an instance with planted clique size O(log n) For comparison, the currently known algorithms are effective only for a much larger clique size Omega(sqrt n).
We consider the k-balanced partitioning problem, where the goal is to partition the vertices of an input graph G into k equally sized components, while minimizing the total weight of the edges connecting different components. We allow k to be part of the input and denote the cardinality of the vertex set by n. This problem is a natural and important generalization of well-known graph partitioning problems, including minimum bisection and minimum balanced cut.
We present a (bi-criteria) approximation algorithm achieving
an approximation of O(sqrt{log n / log k}), which
matches or improves over previous algorithms for all relevant values of
k. Our algorithm uses a semidefinite relaxation which combines
We consider the problem of approximate nearest neighbors in high dimensions, when the queries are lines. In this problem, given n points in R^d, we want to construct a data structure to support efficiently the following queries: given a line L, report the point p closest to L. This problem generalizes the more familiar nearest neighbor problem. From a practical perspective, lines, and low-dimensional flats in general, may model data under linear variation, such as physical objects under different lighting.
For approximation 1+epsilon, we achieve a query time of O(n^{0.5+t}), for arbitrary small t>0, with a space of n^{O(1/epsilon^2+1/t^2)}. To the best of our knowledge, this is the first algorithm for this problem with polynomial space and sub-linear query time.
A common approach for solving computational problems over a difficult metric space is to embed the ``hard'' metric into L_1, which admits efficient algorithms and is thus considered an ``easy'' metric. This approach has proved successful or partially successful for important spaces such as the edit distance, but it also has inherent limitations: it is provably impossible to go below certain approximation for some metrics.
We propose a new approach, of embedding the difficult space into richer host spaces, namely iterated products of standard spaces like L_1 and L_\infty. We show that this class is rich since it contains useful metric spaces with only a constant distortion, and, at the same time, it is tractable and admits efficient algorithms. Using this approach, we obtain for example the first nearest neighbor data structure with O(log log d) approximation for edit distance in non-repetitive strings (the Ulam metric). This approximation is exponentially better than the lower bound for embedding into L_1. Furthermore, we give constant factor approximation for two other computational problems. Along the way, we answer positively a question posed in [Ajtai, Jayram, Kumar, and Sivakumar, STOC 2002]. One of our algorithms has already found applications for smoothed edit distance over 0-1 strings [Andoni and Krauthgamer, ICALP 2008].
We initiate the study of the smoothed complexity of sequence alignment, by proposing a semi-random model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is perturbed independently with probability p, except that A is perturbed in exactly the same way inside the two strings.
We design two efficient algorithms that compute the edit distance on smoothed instances up to a constant factor approximation. The first algorithm runs in near-linear time, namely d^{1+epsilon} for any fixed epsilon>0. The second one runs in time sublinear in d, assuming the edit distance is not too small. These approximation and runtime guarantees are significantly better then the bounds known for worst-case inputs, e.g. near-linear time algorithm achieving approximation roughly d^{1/3}, due to Batu, Ergun, and Sahinalp [SODA 2006].
Our technical contribution is twofold. First, we rely on finding matches between substrings in the two strings, where two substrings are considered a match if their edit distance is relatively small, a prevailing technique in commonly used heuristics, such as PatternHunter of Ma, Tromp and Li [Bioinformatics, 2002]. Second, we effectively reduce the smoothed edit distance to a simpler variant of (worst-case) edit distance, namely, edit distance on permutations (a.k.a. Ulam's metric). We are thus able to build on algorithms developed for the Ulam metric, whose much better algorithmic guarantees usually do not carry over to general edit distance.
A common technique for processing conjunctive queries is to first match each predicate separately using an index lookup, and then compute the intersection of the resulting row-id lists, via an AND-tree. The performance of this technique depends crucially on the order of lists in this tree: it is important to compute early the intersections that will produce small results. But this optimization is hard to do when the data or predicates have correlation.
We present a new algorithm for ordering the lists in an AND-tree by sampling the intermediate intersection sizes. We prove that our algorithm is near-optimal and validate its effectiveness experimentally on datasets with a variety of distributions.
How should a seller price his goods in a market where each buyer prefers a single good among his desired goods, and will buy the cheapest such good, as long as it is within his budget? We provide efficient algorithms that compute near-optimal prices for this problem, focusing on a commodity market, where the range of buyer budgets is small. We also show that our technique (which is based on LP-rounding) easily extends to a different scenario, in which the buyers want to buy all the desired goods, as long as they are within budget.
We design approximation algorithms for a number of fundamental optimization problems in metric spaces, namely computing separating and padded decompositions, sparse covers, and metric triangulations. Our work is the first to emphasize relative guarantees that compare the produced solution to the optimal one for the input at hand. By contrast, the extensive previous work on these topics has sought absolute bounds that hold for every possible metric space (or for a family of metrics). While absolute bounds typically translate to relative ones, our algorithms provide significantly better relative guarantees, using a rather different algorithm.
Our technical approach is to cast a number of metric clustering problems that have been well studied---but almost always as disparate problems---into a common modeling and algorithmic framework, which we call the consistent labeling problem. Having identified the common features of all of these problems, we provide a family of linear programming relaxations and simple randomized rounding procedures that achieve provably good approximation guarantees.
We prove the first non-trivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. A major feature of our result is that it provides the first setting in which the complexity of computing the edit distance is provably larger than that of Hamming distance.
Our lower bound exhibits a trade-off between approximation and communication, asserting, for example, that protocols with O(1) bits of communication can only obtain approximation Omega(log d/loglog d), where d is the length of the input strings. This case of O(1) communication is of particular importance, since it captures constant-size sketches as well as embeddings into spaces like L_1 and squared-L_2, two prevailing algorithmic approaches for dealing with edit distance. Furthermore, the bound holds not only for strings over alphabet Sigma={0,1}, but also for strings that are permutations (called the Ulam metric).
Besides being applicable to a much richer class of algorithms than all previous results, our bounds are near-tight in at least one case, namely of embedding permutations into L_1. The proof uses a new technique, that relies on Fourier analysis in a rather elementary way.
The Earth Mover Distance (EMD) between two equal-size sets of points in R^d is defined to be the minimum cost of a bipartite matching between the two pointsets. It is a natural metric for comparing sets of features, and as such, it has received significant interest in computer vision. Motivated by recent developments in that area, we address computational problems involving EMD over high-dimensional pointsets.
A natural approach is to embed the EMD metric into l_1, and use the algorithms designed for the latter space. However, Khot and Naor [FOCS 2005] show that any embedding of EMD over the d-dimensional Hamming cube into l_1 must incur a distortion Omega(d), thus practically losing all distance information. We circumvent this roadblock by focusing on sets with cardinalities upper-bounded by a parameter s, and achieve a distortion of only O(log s log d). Since in applications the feature sets have bounded size, the resulting distortion is much smaller than the Omega(d) lower bound. Our approach is quite general and easily extends to EMD over R^d.
We then provide a strong lower bound on the multi-round communication complexity of estimating EMD, which in particular strengthens the known non-embeddability result of Khot and Naor. Our bound exhibits a smooth tradeoff between approximation and communication, and for example implies that every algorithm that estimates EMD using constant size sketches can only achieve Omega(log s) approximation.
Network triangulation is a method for estimating distances between nodes in the network, by letting every node measure its distance to a few beacon nodes, and deducing the distance between every two nodes x,y by using their measurements to their common beacons and applying the triangle inequality. Kleinberg, Slivkins and Wexler [FOCS 2004] initiated a theoretical study of triangulation in metric spaces, and Slivkins [PODC 2005] subsequently showed that metrics of bounded doubling dimension admit a triangulation that approximates arbitrarily well all pairwise distances using only O(log n) beacons per point, where n is the number of points in the network. He then asked whether this term is necessary (for doubling metrics).
We provide the first lower bounds on the number of beacons required for a triangulation in some specific simple networks. In particular, these bounds (i) answer Slivkins' question positively, even for one-dimensional metrics, and (ii) prove that, up to constants, Slivkins' triangulation achieves an optimal number of beacons (as a function of the approximation guarantee and the doubling dimension).
The distance to monotonicity of a sequence is the minimum number of edit operations required to transform the sequence into an increasing order; this measure is complementary to the length of the longest increasing subsequence (LIS). We address the question of estimating these quantities in the one-pass data stream model and present the first sub-linear space algorithms for both problems.
We first present O(sqrt{n})-space deterministic algorithms that approximate the distance to monotonicity and the LIS to within a factor that is arbitrarily close to $1$. We also show a lower bound of Omega(n) on the space required by any randomized algorithm to compute the LIS (or alternatively the distance from monotonicity) exactly, demonstrating that approximation is necessary for sub-linear space computation; this bound improves upon the existing lower bound of Omega(sqrt{n}) [Liben-Nowell, Vee, and Zhu, J. Combinatorial Optimization, 2006].
Our main result is a randomized algorithm that uses only O(log^2 n) space and approximates the distance to monotonicity to within a factor that is arbitrarily close to 4. In contrast, we believe that any significant reduction in the space complexity for approximating the length of the LIS is considerably hard. We conjecture that any deterministic 1 + epsilon approximation algorithm for LIS requires Omega(sqrt{n}) space, and as a step towards this conjecture, prove a space lower bound of $\Omega(\sqrt n)$ for a restricted yet natural class of deterministic algorithms.
We initiate the study of approximate algorithms on negatively curved spaces. These spaces have recently become of interest in various domains of computer science including networking and vision. The classical example of such a space is the real-hyperbolic space H^d for d>=2, but our approach applies to a more general family of spaces characterized by Gromov's (combinatorial) hyperbolic condition. We give efficient algorithms and data structures for problems like approximate nearest-neighbor search and compact, low-stretch routing on subsets of negatively curved spaces of fixed dimension (including H^d as a special case). In a different direction, we show that there is a PTAS for the Traveling Salesman Problem when the set of cities lie, for example, in H^d. This generalizes Arora's results for R^d.
Most of our algorithms use the intrinsic distance geometry of the data set, and only need the existence of an embedding into some negatively curved space in order to function properly. In other words, our algorithms regard the inter-point distance function as a black box, and are independent of the representation of the input points.
Edit distance is a fundamental measure of distance between strings, the extensive study of which has recently focused on computational problems such as nearest neighbor search, sketching and fast approximation. A very powerful paradigm is to map the metric space induced by the edit distance into a normed space (e.g., ℓ1) with small distortion, ℓ1 and then use the rich algorithmic toolkit known for normed spaces. Although the minimum distortion required to embed edit distance into ℓ1 has received a lot of attention lately, there is a large gap between known upper and lower bounds. We make progress on this question by considering large, well-structured, submetrics of the edit distance metric space.
Our main technical result is that the Ulam metric, namely, the edit distance on permutations of length at most n, embeds into ℓ1 with distortion O(log n). This immediately leads to sketching algorithms with constant size sketches, and to efficient approximate nearest neighbor search algorithms, with approximation factor O(log n). The embedding and its algorithmic consequences present a big improvement over those previously known for the Ulam metric, and they are significantly better than the state of the art for edit distance in general. Further, we extend these results for the Ulam metric to edit distance on strings that are (locally) non-repetitive, i.e., strings where (close by) substrings are distinct.
We simplify and improve upon recent lower bounds on the minimum distortion of embedding certain finite metric spaces into L_1. In particular, we show that for infinitely many values of n there are n-point metric spaces of negative type that require a distortion of Omega(log log n) for such an embedding, implying the same lower bound on the integrality gap of a well-known SDP relaxation for Sparsest-Cut. This result builds upon and improves the recent lower bound of (log log n)^{1/6-o(1)} due to Khot and Vishnoi [STOC 2005]. We also show that embedding the edit distance on {0,1}^n into L_1 requires a distortion of Omega(log n). This result simplifies and improves a very recent lower bound due to Khot and Naor [FOCS 2005].
We show that the Multicut, Sparsest-Cut, and Min-2CNF= Deletion problems are NP-hard to approximate within every constant factor, assuming the Unique Games Conjecture of Khot [STOC, 2002]. A quantitatively stronger version of the conjecture implies an inapproximability factor of $\Omega(\sqrt{\log\log n})$.
We address the problems of pattern matching and approximate pattern matching in the sketching model. We show that it is impossible to compress the text into a small sketch and use only the sketch to decide whether a given pattern occurs in the text. We also prove a sketch size lower bound for approximate pattern matching, and show it is tight up to a logarithmic factor.
Edit distance has been extensively studied for the past several years. Nevertheless, no linear-time algorithm is known to compute the edit distance between two strings, or even to approximate it to within a modest factor. Furthermore, for various natural algorithmic problems such as low-distortion embeddings into normed spaces, approximate nearest-neighbor schemes, and sketching algorithms, known results for the edit distance are rather weak. We develop algorithms that solve gap versions of the edit distance problem: given two strings of length $n$ with the promise that their edit distance is either at most $k$ or greater than $l$, decide which of the two holds. We present two sketching algorithms for gap versions of edit distance. Our first algorithm solves the $k$ vs. $(k n)^{2/3}$ gap problem, using a constant size sketch. A more involved algorithm solves the stronger $k$ vs. $l$ gap problem, where $l$ can be as small as $O(k^2)$---still with a constant sketch---but works only for strings that are mildly ``non-repetitive''. Finally, we develop an $n^{3/7}$-approximation quasi-linear time algorithm for edit distance, improving the previous best factor of $n^{3/4}$ due to Cole and Hariharan [SIAM J. on Computing, 2002]; if the input strings are assumed to be non-repetitive, then the approximation factor can be strengthened to $n^{1/3}$.
We devise a new embedding technique, which we call measured descent, based on decomposing a metric space locally, at varying speeds, according to the density of some probability measure. This provides a refined and unified framework for the two primary methods of constructing Fr\'echet embeddings for finite metrics, due to [Bourgain, 1985] and [Rao, 1999]. We prove that any $n$-point metric space $(X,d)$ embeds in Hilbert space with distortion $O(\sqrt{\alpha_X \cdot \log n})$, where $\alpha_X$ is a geometric estimate on the decomposability of $X$. An an immediate corollary, we obtain an $O(\sqrt{\log \lambda_X \log n})$ distortion embedding, where $\lambda_X$ is the doubling constant of $X$. Since $\lambda_X\le n$, this result recovers Bourgain's theorem, but when the metric $X$ is, in a sense, ``low-dimensional,'' improved bounds are achieved.
Our embeddings are volume-respecting for subsets of arbitrary size. One consequence is the existence of $(k, \log n)$ volume-respecting embeddings for all $1 \leq k \leq n$, which is the best possible, and answers positively a question posed in [Feige, 1998]. Our techniques are also used to answer positively a question of Y. Rabinovich, showing that any weighted $n$-point planar graph embeds in $\ell_\infty^{O(\log n)}$ with $O(1)$ distortion. The $O(\log n)$ bound on the dimension is optimal, and improves upon the previously known bound of $O(\log n)^2$.
We define a natural notion of efficiency for approximate nearest-neighbor (ANN) search in general $n$-point metric spaces, namely the existence of a randomized algorithm which answers (1+ε)-approximate nearest neighbor queries in polylog(n) time using only polynomial space. We then study which families of metric spaces admit efficient ANN schemes in the black-box model, where only oracle access to the distance function is given, and any query consistent with the triangle inequality may be asked.
For ε < 2/5, we offer a complete answer to this problem. Using the notion of metric dimension defined in \cite{GKL03} (\`a la [Assouad, 1983]), we show that a metric space $X$ admits an efficient (1+ε)-ANN scheme for any ε < 2/5 if and only if $\dim(X) = O(\log \log n)$. For coarser approximations, clearly the upper bound continues to hold, but there is a threshold at which our lower bound breaks down---this is precisely when points in the ``ambient space'' may begin to affect the complexity of ``hard'' subspaces $S \subseteq X$. Indeed, we give examples which show that $\dim(X)$ does not characterize the black-box complexity of ANN above the threshold.
Our scheme for ANN in low-dimensional metric spaces is the first to yield efficient algorithms without relying on any heuristic assumptions on the input. In previous approaches (e.g., [Clarkson, 1999], [Karger and Ruhl, 2002], [Krauthgamer and Lee, 2004]) even spaces with $\dim(X) = O(1)$ sometimes required $\Omega(n)$ query times.
Aggregate statistical data about the web is very useful in numerous scenarios, such as market research, intelligence, and social studies. In many of these applications, one is interested not in generic data about the whole web but rather in highly focused information pertinent to a specific domain or topic. Furthermore, timely acquisition of this information provides a competitive advantage.
Focused statistical data can be gathered by a brute force crawl of the whole web, or by a "focused crawl", which collects mainly pages that are relevant to the topic of interest. Crawling, however, is an expensive enterprise, requiring substantial resources. For the purpose of gathering statistical data, random sampling of web pages is a much faster, cheaper, and even more reliable approach. We develop the first efficient method for generating a random sample of web pages relevant to a given user-specified topic. Previously, techniques for getting only an unfocused sample of pages from the whole web were known. Our method is based on a new random walk on (a modified version of) a subgraph of the web graph.
We present a simple deterministic data structure for maintaining a set $S$ of points in a general metric space, while supporting proximity search (nearest neighbor and range queries) and updates to $S$ (insertions and deletions). Our data structure consists of a sequence of progressively finer $\epsilon$-nets of $S$, with pointers that allow us to navigate easily from one scale to the next. We analyze the worst-case complexity of this data structure in terms of the ``abstract dimensionality'' of the metric $S$. Our data structure is extremely efficient for metrics of bounded dimension and is essentially optimal in a certain model of distance computation. Finally, as a special case, our approach improves over one recently devised by Karger and Ruhl [STOC, 2002].
Given a metric space $(X,d)$, a natural distance measure on probability distributions over $X$ is the {\em earthmover metric}. We use randomized rounding of earthmover metrics to devise new approximation algorithms for two well-known classification problems, namely, metric labeling and 0-extension.
Our first result is for the 0-extension problem. We show that if the terminal metric is decomposable with parameter $\alpha$ (e.g., planar metrics are decomposable with $\alpha=O(1)$), then the earthmover based linear program (for $0$-extension) can be rounded to within an $O(\alpha)$ factor.
Our second result is an $O(\log n)$-approximation for metric labeling, using probabilistic tree embeddings in a way very different from the $O(\log k)$-approximation of Kleinberg and Tardos [JACM, 2002]. (Here, $n$ is the number of nodes, and $k$ is the number of labels.) The key element is rounding the earthmover based linear program (for metric labeling) without increasing the solution's cost, when the input graph is a tree. This rounding method also provides an alternate proof to a result stated in Chekuri et al. [SODA, 2001], that the earthmover based linear program is integral when the input graph is a tree.
Our simple and constructive rounding techniques contribute to the understanding of earthmover metrics and may be of independent interest.
In the Asymmetric $k$-Center problem, the input is an integer $k$ and a complete digraph over $n$ points together with a distance function obeying the directed triangle inequality. The goal is to choose a set of $k$ points to serve as centers and to assign all the points to the centers, so that the maximum distance of any point from its center is as small as possible.
We show that the Asymmetric $k$-Center problem is hard to approximate up to a factor of $\log^* n - O(1)$ unless $NP \subseteq DTIME(n^{\log \log n})$. Since an $O(\log^* n)$-approximation algorithm is known for this problem, this resolves the asymptotic approximability of Asymmetric $k$-Center. This is the first natural problem whose approximability threshold does not polynomially relate to the known approximation classes. We also resolve the approximability threshold of the metric (symmetric) $k$-Center problem with costs.
The doubling constant of a metric space $(X,d)$ is the smallest value $\lambda$ such that every ball in $X$ can be covered by $\lambda$ balls of half the radius. The doubling dimension of $X$ is then defined as $dim(X) = \log_2 \lambda$. A metric (or sequence of metrics) is called doubling precisely when its doubling dimension is bounded. This is a robust class of metric spaces which contains many families of metrics that occur in applied settings.
We give tight bounds for embedding doubling metrics into (low-dimensional) normed spaces. We consider both general doubling metrics, as well as more restricted families such as those arising from trees, from graphs excluding a fixed minor, and from snowflaked metrics. Our techniques include decomposition theorems for doubling metrics, and an analysis of a fractal in the plane due to Laakso [Bull. London Math. Soc., 2002]. Finally, we discuss some applications and point out a central open question regarding dimensionality reduction in $L_2$.
Motivation: Comparing two protein databases is a fundamental task in biosequence annotation. Given two databases, one must find all pairs of proteins that align with high score under a biologically meaningful substitution score matrix, such as a BLOSUM matrix (Henikoff and Henikoff, 1992). Distance-based approaches to this problem map each peptide in the database to a point in a metric space, such that peptides aligning with higher scores are mapped to closer points. Many techniques exist to discover close pairs of points in a metric space efficiently, but the challenge in applying this work to proteomic comparison is to find a distance mapping that accurately encodes all the distinctions among residue pairs made by a proteomic score matrix. Buhler (2002) proposed one such mapping but found that it led to a relatively inefficient algorithm for protein-protein comparison.
Results: This work proposes a new distance mapping for peptides under the BLOSUM matrices that permits more efficient similarity search. We first propose a new distance function on peptides derived from a given score matrix. We then show how to map peptides to bit vectors such that the distance between any two peptides is closely approximated by the Hamming distance (i.e., number of mismatches) between their corresponding bit vectors. We combine these two results with the LSH-ALL-PAIRS-SIM algorithm of Buhler (2002) to produce an improved distance-based algorithm for proteomic comparison. An initial implementation of the improved algorithm exhibits sensitivity within 5% of that of the original LSH-ALL-PAIRS-SIM, while running up to eight times faster.
We devise the first constant factor approximation algorithm for minimum quotient vertex-cuts in planar graphs. Our algorithm achieves approximation ratio $3.5 (1+\epsilon)^2$ with running time $O(W n^{3+2/\epsilon})$. The approximation ratio improves to $\frac{4}{3} (1+\epsilon)$ if there is an optimal quotient vertex-cut $(A^*,B^*,C^*)$ where $C^*$ has relatively small weight compared to $A^*$ and $B^*$; this holds, for example, when the input graph has uniform (or close to uniform) weight and cost. The approximation ratio further improves to $1+\epsilon$ if, in addition, $\min( w(A^*),w(B^*) ) \le \frac{1}{3} W$.
We use our quotient cut algorithm to find the first constant-factor pseudo-approximation for vertex separators in planar graphs.
Our technical contribution is two-fold. First, we prove a structural theorem for vertex-cuts in planar graphs, showing the existence of a near-optimal vertex-cut whose high-level structure is that of a bounded-depth tree. Second, we develop an algorithm for optimizing over such complex structures, whose running time depends (exponentially) not on the size of the structure, but rather only on its depth. These techniques may be applicable in other problems.
We resolve the following conjecture raised by Levin together with Linial, London, and Rabinovich [Combinatorica, 1995]. Let $\dim(G)$ be the smallest $d$ such that $G$ occurs as a (not necessarily induced) subgraph of the infinite graph $Z_\infty^d$, whose vertex set is $Z^d$ and which has an edge $(u,v)$ whenever $||u-v||_\infty = 1$. The growth rate of $G$, denoted $\rho_G$, is the minimum $\rho$ such that every ball of radius $r>1$ in $G$ contains at most $r^\rho$ vertices. By simple volume arguments, $\dim(G) = \Omega(\rho_G)$. Levin conjectured that this lower bound is tight, i.e., that $\dim(G) = O(\rho_G)$ for every graph $G$.
We show that a weaker form of Levin's conjecture, namely, $\dim(G) = O(\rho_G \log \rho_G)$, holds for every graph $G$. We disprove, however, the specific bound of the conjecture and show that our upper bound is tight by exhibiting graphs for which $\dim(G) =\Omega(\rho_G \log \rho_G)$. For families of graphs which exclude a fixed minor, we salvage the strong form, showing that $\dim(G) = O(\rho_G)$. This holds also for graphs without long induced simple cycles. Our results extend to a variant of the conjecture for finite-dimensional Euclidean spaces due to Linial [Haifa Workshop, 2002].
We provide the first hardness result for a polylogarithmic approximation ratio (for a natural NP-hard optimization problem). We show that for every fixed $\epsilon>0$, the Group Steiner Tree problem admits no efficient $\log^{2-\epsilon} k$--approximation, where $k$ denotes the number of groups, unless NP has quasi-polynomial Las-Vegas algorithms. This result holds even for input graphs which are Hierarchically Well-Separated Trees, introduced by Bartal [FOCS, 1996], in which case our bound is nearly tight with the $O(\log^2 k)$--approximation currently known. Our results imply that for every fixed $\epsilon>0$, the Directed Steiner Tree problem admits no $\log^{2-\epsilon} n$--approximation, where $n$ is the number of vertices in the graph, under the same complexity assumption.
We present an $\Omega(\log^2 k)$ lower bound on the integrality ratio of the flow-based relaxation for the Group Steiner Tree problem, where $k$ denotes the number of groups; this holds even for input graphs that are Hierarchically Well-Separated Trees, introduced by Bartal [\textit{Symp.\ Foundations of Computer Science}, pp.\ 184--193, 1996], in which case this lower bound is tight. This relaxation appears to be the only one that have been studied for the problem, as well as for its generalization, the Directed Steiner Tree problem. For the latter problem, our results imply an $\Omega(\frac{\log^2 n}{(\log \log n)^2})$ integrality ratio, where $n$ is the number of vertices in the graph. For both problems, this is the first known lower bound on the integrality ratio that is superlogarithmic in the input size. We also show algorithmically that the integrality ratio for Group Steiner Tree is much better for certain families of instances, which helps pinpoint the types of instances that appear to be most difficult to approximate.
Data dimensionality is a crucial issue in a variety of settings, where it is desirable to determine whether a data set given in a high-dimensional space adheres to a low-dimensional structure. We study this problem in the framework of property testing: Given a query access to a data set $S$, we wish to determine whether $S$ is low-dimensional, or whether it should be modified significantly in order to have the property. Allowing a constant probability of error, we aim at algorithms whose complexity does not depend on the size of $S$.
We present algorithms for testing the low-dimensionality of a set of vectors and for testing whether a matrix is of low rank. We then address low-dimensionality in metric spaces. For vectors in the metric space $l_1$, we show that low-dimensionality is not testable. For $l_2$, we show that a data set can be tested for having a low-dimensional structure, but that the property of approximately having such a structure is not testable.
In the survivable network design problem (SNDP), the goal is to find a minimum-cost subgraph satisfying certain connectivity requirements. We study the vertex-connectivity variant of SNDP in which the input specifies, for each pair of vertices, a required number of vertex-disjoint paths connecting them.
We give the first lower bound on the approximability of SNDP, showing that the problem admits no efficient $2^{\log^{1-\epsilon} n}$ ratio approximation for any fixed $\epsilon>0$ unless $NP\subseteq DTIME(n^{\polylog(n)})$. We also show hardness of approximation results for several important special cases of SNDP, including constant factor hardness for the $k$-vertex connected spanning subgraph problem ($k$-VCSS) and for the vertex-connectivity augmentation problem, even when the edge costs are severely restricted.
The extensive study of metric spaces and their embeddings has so far focused on embeddings that preserve pairwise distances. A very intriguing concept introduced by Feige [JCSS, 2000] allows us to quantify the extent to which larger structures are preserved by a given embedding. We investigate this concept, focusing on several major graph families such as paths, trees, cubes and expanders. We find some similarities to the regular (pairwise) distortion, as well as some striking differences.
Lovasz and Schrijver (SIAM J. Optim., 1991) devised a lift-and-project method that produces a sequence of convex relaxations for the problem of finding in a graph an independent set (or a clique) of maximum size. Each relaxation in the sequence is tighter than the one before it, while the first relaxation is already at least as strong as the Lovasz theta function (IEEE Trans. Inform. Theory, 1979}. We show that on a random graph $G_{n,1/2}$, the value of the $r$th relaxation in the sequence is roughly $\sqrt{n/2^r}$, almost surely. It follows that for those relaxations known to be efficiently computable, namely for $r=O(1)$, the value of the relaxation is comparable to the theta function. Furthermore, a perfectly tight relaxation is almost surely obtained only at the $r=\Theta(\log n)$ relaxation in the sequence.
The notion of private approximation was introduced recently by Feigenbaum, Fong, Strauss and Wright. Informally, a private approximation of a function $f$ is another function $F$ that approximates $f$ in the usual sense, but does not yield any information on $x$ other than what can be deduced from $f(x)$. As such, $F(x)$ is useful for private computation of $f(x)$ (assuming that $F$ can be computed more efficiently than $f$).
In this work we examine the properties and limitations of this new notion. Specifically, we show that for many NP-hard problems, the privacy requirement precludes non-trivial approximation. This is the case even for problems that otherwise admit very good approximation (e.g., problems with PTAS). On the other hand, we show that slightly relaxing the privacy requirement, by means of leaking ``just a few bits of information'' about $x$, again permits good approximation.
A web content hosting service provider needs to dynamically allocate servers in a server farm to its customers' web sites. Ideally, the allocation to a site should always suffice to handle its load. However, due to a limited number of servers and the overhead incurred in changing the allocation of a server from one site to another, the system may become overloaded. The problem faced by the web hosting service provider is how to allocate the available servers in the most profitable way. Adding to the complexity of this problem is the fact that future loads of the sites are either unknown or known only for the very near future.
In this paper we model this server allocation problem, and consider both its offline and online versions. We give a polynomial time algorithm for computing the optimal offline allocation. In the online setting, we show almost optimal algorithms (both deterministic and randomized) for any positive lookahead. The quality of the solution improves as the lookahead increases. We also consider several special cases of practical interest. Finally, we present some experimental results using actual trace data that show that one of our online algorithm performs very close to optimal.
Interestingly, the online server allocation problem can be cast as a more general benefit task system that we define. Our results extend to this task system, which captures also the benefit maximization variants of the $k$-server problem and the metrical task system problem. It follows that the benefit maximization variants of these problems are more tractable than their cost minimization variants.
The achromatic number problem is to legally color the vertices of an input graph with the maximum number of colors, denoted $\psi^*$, so that every two color classes share at least one edge. This problem is known to be NP-hard.
For general graphs we give an algorithm that approximates the achromatic number within ratio of $O(n \log\log n/\log n)$. This improves over the previously known approximation ratio of $O(n/\sqrt{\log n})$, due to Chaudhary and Vishwanathan (SODA, 1997).
For graphs of girth at least 5 we give an algorithm with approximation ratio $O(\min\{n^{1/3},\sqrt{\psi^*}\})$. This improves over an approximation ratio $O(\sqrt{\psi^*})=O(n^{3/8})$ for the more restricted case of graphs with girth at least 6, due to Krista and Lorys (ESA 1999).
We also give the first hardness result for approximating the achromatic number. We show that for every fixed $\epsilon>0$ there is no $2-\epsilon$ approximation algorithm, unless $P=NP$.
A bisection of a graph with $n$ vertices is a partition of its vertices into two sets, each of size $n/2$. The bisection cost is the number of edges connecting the two sets. Finding the bisection of minimum cost is NP-hard. We present an algorithm that finds a bisection whose cost is within ratio of $O(\log^2 n)$ from the optimal. For graphs excluding any fixed graph as a minor (e.g. planar graphs) we obtain an improved approximation ratio of $O(\log n)$. The previously known approximation ratio for bisection was roughly $\sqrt{n}$.
A bisection of a graph with $n$ vertices is a partition of its vertices into two sets, each of size $n/2$. The bisection size is the number of edges connecting the two sets. Finding the bisection of minimum size is NP-hard. We present an algorithm that finds a bisection that is within $O(\sqrt{n}\log n)$ of optimal. No sublinear approximation ratio for bisection was previously known.
We consider the problem of finding in an undirected graph a minimum cut that separates exactly a given number $k$ of vertices. For general $k$ (i.e. $k$ is part of the input and may depend on $n$) this problem is NP-hard.
We present for this problem a randomized approximation algorithm, which is useful when $k$ is relatively small. In particular, for $k = O(\log n)$ we obtain a PTAS (polynomial time approximation scheme), and for $k=\Omega(\log n)$ we obtain an approximation ratio $O(k/\log n)$.
The motivation for our work is the observation that Web pages on a particular topic are often linked to other pages on the same topic. We model and analyze the problem of how to improve the classification of Web pages (that is, determining the topic of the page) by using link information. In our setting, an initial classifier examines the text of a Web page and assigns to it some classification, possibly mistaken. We investigate how to reduce the error probability using the observation above, thus building an improved classifier. We present a theoretical framework for this problem based on a random graph model and suggest two linear time algorithms, based on similar methods that have been proven effective in the setting of error-correcting codes. We provide simulation results to verify our analysis and to compare the performance of our suggested algorithms.
Alon, Krivelevich and Sudakov (Random Structures and Algorithms, 1998) designed an algorithm based on spectral techniques that almost surely finds a clique of size $\Omega(\sqrt{n})$ hidden in an otherwise random graph. We show that a different algorithm, based on the Lovasz theta function, almost surely both finds the hidden clique and certifies its optimality. Our algorithm has an additional advantage of being more robust: it also works in a semi-random hidden clique model, in which an adversary can remove edges from the random portion of the graph.
Hot-potato routing is a form of synchronous routing which makes no use of buffers at intermediate nodes. Packets must move at every time step, until they reach their destination. If contention prevents a packet from taking its preferred outgoing edge, it is deflected on a different edge. Two simple design principles for hot potato routing algorithms are minimum advance, that advances at least one packet towards its destination from every nonempty node (and possibly deflects all other packets), and maximum advance, that advances the maximum possible number of packets.
Livelock is a situation in which packets keep moving indefinitely in the network without any packet ever reaching its destination. It is known that even maximum advance algorithms might livelock on some networks. We show that minimum advance algorithms never livelock on tree networks, and that maximum advance algorithms never livelock on triangulated networks.
A stereoscopic family of permutations maps an $m$-dimensional mesh into several one-dimensional lines, in a way that jointly preserves distance information. Specifically, consider any two points and denote their distance on the $m$-dimensional mesh by $d$. Then the distance between their images, on the line on which these images are closest together, is $O(d^m)$.
We initiate a systematic study of stereoscopic families of permutations. We show a construction of these families that involves the use of $m+1$ images. We also show that under some additional restrictions (namely, adjacent points on the image lines originate at points which are not too far away on the mesh), three images are necessary in order to construct such a family for the two-dimensional mesh.
We present two applications for stereoscopic families of permutations. One application is an algorithm for routing on the mesh that guarantees delivery of each packet within a number of steps that depends upon the distance between this packet's source and destination, but is independent of the size of the mesh. Our algorithm is exceptionally simple, involves no queues, and can be used in dynamic settings in which packets are continuously generated. Another application is an extension of the construction of non-expansive hash functions of Linial and Sasson (STOC 1996) from the case of one dimensional metrics to arbitrary dimensions.