We study the problem of a decision-maker having to select one of many competing alternatives (e.g., choosing between projects, designs, or suppliers) whose future revenues are a priori unknown and modeled as random variables of known probability distributions. The decision-maker can pay to test each alternative to reveal its specific revenue realization (e.g., by conducting market research), and her goal is to maximize the expected revenue of the selected alternative minus the testing costs. This model captures an interesting trade-off between gaining revenue of a high-yield alternative and spending resources to reduce the uncertainty in selecting it. The combinatorial nature of the problem leads to a dynamic programming (DP) formulation with high-dimensional state space that is computationally intractable. By characterizing the structure of the optimal policy, we derive efficient optimal and near-optimal policies that are simple and easy-to-compute. In fact, these policies are also myopic -- they only consider a limited horizon of one test. Moreover, our policies can be described using intuitive `testing intervals' around the expected revenue of each alternative, and in many cases, the dynamics of an optimal policy can be explained by the interaction between the testing intervals of various alternatives.
In the Set Cover problem, the input is a ground set of n elements and a collection of m sets, and the goal is to find the smallest sub-collection of sets whose union is the entire ground set. In spite of extensive effort, the fastest algorithm known for the general case runs in time $O(mn 2^n)$ [Fomin et al., WG 2004]. In 2012, as progress seemed to halt, Cygan et al. [TALG 2016] have put forth the Set Cover Conjecture (SeCoCo), which asserts that for every fixed \epsilon>0, no algorithm with runtime $2^{(1-\epsilon)n} poly(m)$ can solve Set Cover, even if the input sets are of arbitrary large constant size. We propose a weaker conjecture, which we call Log-SeCoCo, that is similar to SeCoCo but allows input sets of size O(log n).
To support Log-SeCoCo, we show that its failure implies an algorithm that is faster than currently known for the famous Directed Hamiltonicity problem. Even though Directed Hamiltonicity has been studied extensively for over half a century, no algorithm significantly faster than $2^n poly(n)$ is known for it. In fact, we show a fine-grained reduction to Log-SeCoCo from a generalization of Directed Hamiltonicity, known as the nTree problem, which too can be solved in time $2^n poly(n)$ [Koutis and Williams, TALG 2016]. We further show an equivalence between solving the parameterized versions of Set Cover and of nTree significantly faster than their current known runtime. Finally, we show that even moderate runtime improvements for Set Cover with bounded-size sets would imply new algorithms for nTree and for Directed Hamiltonicity.
Our technical contribution is to reinforce Log-SeCoCo (and arguably SeCoCo) by reductions from other famous problems with known algorithmic barriers, and hope it will lead to more results in this vein, particularly reinforcing the Strong Exponential-Time Hypothesis (SETH) by reductions from other well-known problems.
The kTree problem is a special case of Subgraph Isomorphism where the pattern graph is a tree, that is, the input is an n-node graph G and a k-node tree T, and the goal is to determine whether G has a subgraph isomorphic to T. We provide evidence that this problem cannot be computed significantly faster (i.e., by an exponential factor) than $O(2^k) poly(n)$, which matches the fastest algorithm known for this problem [Koutis and Williams, ICALP 2009 and TALG 2016]. Specifically, we show that if kTree can be solved in time $O(2^{(1-\epsilon)k} poly(n))$ for some constant \epsilon>0, then Set Cover with n' elements and m' sets can be solved in time $2^{(1-\delta)n'} poly(m')$ for a constant $\delta(\epsilon)>0$, refuting the Set Cover Conjecture [Cygan et al., CCC 2012 and TALG 2016].
We study the following version of cut sparsification. Given a large edge-weighted network $G$ with $k$ terminal vertices, compress it into a small network $H$ with the same terminals, such that the minimum cut in $H$ between every bipartition of the terminals approximates the corresponding one in $G$ within factor $q\geq 1$, called the quality.
We provide two new insights about the structure of cut sparsifiers, and then apply them to obtain improved cut sparsifiers (and data structures) for planar graphs. Our first main contribution identifies a subset of the minimum terminal cuts that generates all the other ones. Restricting attention to these cuts, which we call elementary, is effective in reducing the number of requirements from the sparsifier $H$. Our applications of this result are new cut sparsifiers of quality $q=1$ (also called mimicking networks): (1) for planar networks $G$, we design a cut sparsifier of size $O(k 2^{2k})$, slightly improving the known bound from [Krauthgamer and Rika, SODA 2013]; and (2) for planar networks $G$ whose terminals are incident to at most $\gamma=\gamma(G)$ faces, we design a cut sparsifier of size $O(\gamma^5 2^{2\gamma} k^4)$, which was previously unknown.
Our second main contribution is a duality between minor cut sparsification and minor distance sparsification (i.e., when the sparsifier $H$ is required to be a minor of $G$), applicable to planar networks $G$ with all their terminals on the same face. This duality connects problems that were previously studied separately, implying new results, new proofs of known results, and equivalences between open gaps. All our cut sparsifiers are minors of the input network $G$.
We provide evidence that computing the maximum flow value between every pair of nodes in a directed graph on $n$ nodes, $m$ edges, and capacities in the range $[1..n]$, which we call the All-Pairs Max-Flow problem, cannot be solved in time that is faster significantly (i.e., by a polynomial factor) than $O(n^2m)$. Since a single maximum $st$-flow in such graphs can be solved in time $\tilde O(m\sqrt{n})$ [Lee and Sidford, FOCS 2014], we conclude that the all-pairs version might require time equivalent to $\tilde\Omega(n^{3/2})$ computations of maximum $st$-flow, which strongly separates the directed case from the undirected one. Moreover, if maximum $st$-flow can be solved in time $\tilde O(m)$, then the runtime of $\tilde\Omega(n^2)$ computations is needed. The latter settles a conjecture of Lacki, Nussbaum, Sankowski, and Wulf-Nilsen [FOCS 2012] negatively.
Specifically, we show that in sparse graphs $G=(V,E,w)$, if one can compute the maximum $st$-flow from every $s$ in an input set of sources $S\subseteq V$ to every $t$ in an input set of sinks $T\subseteq V$ in time $O((|S| |T| m)^{1-\epsilon})$, for some $|S|$, $|T|$, and a constant $\epsilon>0$, then MAX-CNF-SAT (maximum satisfiability of conjunctive normal form formulas) with $n'$ variables and $m'$ clauses can be solved in time ${m'}^{O(1)}2^{(1-\delta)n'}$ for a constant $\delta(\epsilon)>0$, a problem for which not even $2^{n'}/\poly(n')$ algorithms are known. Such runtime for MAX-CNF-SAT would in particular refute the Strong Exponential Time Hypothesis (SETH). Hence, we improve the lower bound of Abboud, Vassilevska-Williams, and Yu [STOC 2015], who showed that for every fixed $\epsilon>0$ and $|S|=|T|=O(\sqrt{n})$, if the above problem can be solved in time $O(n^{3/2-\epsilon})$, then some incomparable (and intuitively weaker) conjecture is false. Furthermore, a larger lower bound than ours implies strictly super-linear time for maximum $st$-flow problem, which would be an amazing breakthrough.
In addition, we show that All-Pairs Max-Flow in uncapacitated networks with every edge-density $m=m(n)$, cannot be computed in time significantly faster than $O(mn)$, even for acyclic networks. The gap to the fastest known algorithm by Cheung, Lau, and Leung [FOCS 2011] is a factor of $O(m^{\omega-1}/n)$, and for acyclic networks it is $O(n^{\omega-1})$, where $\omega$ is the matrix multiplication exponent.
A central problem in data streams is to characterize which functions of an underlying frequency vector can be approximated efficiently. Recently there has been considerable effort in extending this problem to that of estimating functions of a matrix that is presented as a data-stream. This setting generalizes classical problems to the analogous ones for matrices. For example, instead of estimating frequent-item counts, we now wish to estimate "frequent-direction" counts. A related example is to estimate norms, which now correspond to estimating a vector norm on the singular values of the matrix. Despite recent efforts, the current understanding for such matrix problems is considerably weaker than that for vector problems.
We study a number of aspects of estimating matrix norms in a stream that have not previously been considered: (1) multi-pass algorithms, (2) algorithms that see the underlying matrix one row at a time, and (3) time-efficient algorithms. Our multi-pass and row-order algorithms use less memory than what is provably required in the single-pass and entrywise-update models, and thus give separations between these models (in terms of memory). Moreover, all of our algorithms are considerably faster than previous ones. We also prove a number of lower bounds, and obtain for instance, a near-complete characterization of the memory required of row-order algorithms for estimating Schatten p-norms of sparse matrices.
In the snippets problem we are interested in preprocessing a text T so that given two pattern queries P1 and P2, one can quickly locate the occurrences of the patterns in T that are the closest to each other. A closely related problem is that of constructing a color-distance oracle, where the goal is to preprocess a set of points from some metric space, in which every point is associated with a set of colors, so that given two colors one can quickly locate two points associated with those colors, that are as close as possible to each other.
We introduce efficient data structures for both color-distance oracles and the snippets problem. Moreover, we prove conditional lower bounds for these problems from both the 3SUM conjecture and the Combinatorial Boolean Matrix Multiplication conjecture.
We characterize the streaming space complexity of every symmetric norm l (a norm on R^n invariant under sign-flips and coordinate-permutations), by relating this space complexity to the measure-concentration characteristics of l. Specifically, we provide matching upper and lower bounds (up to polylog(n) factors) on the space complexity of approximating the norm of the stream, where both bounds depend on the median of l(x), when x is drawn uniformly from the l2 unit sphere. The same quantity governs many phenomena in high-dimensional spaces, such as large-deviation bounds and the critical dimension in Dvoretzky's Theorem.
The family of symmetric norms contains several well-studied norms, such as all l_p norms, and indeed we provide a new explanation for the disparity in space complexity between p <= 2 and p>2. In addition, we apply our general results to easily derive bounds for several norms were not studied before in the streaming model, including for example the top-k norm and the k-support norm, which was recently shown to be effective for machine learning tasks.
Overall, these results make progress on two outstanding problems in the area of sublinear algorithms (Problems 5 and 30 in this http URL).
By a classical result of Gomory and Hu (1961), in every edge-weighted graph G=(V,E,w), the minimum st-cut values, when ranging over all $s,t\in V$, take at most |V|-1 distinct values. That is, these $\binom{|V|}{2}$ instances exhibit redundancy factor $\Omega(|V|)$. They further showed how to construct from G a tree (V,E',w') that stores all minimum st-cut values. Motivated by this result, we obtain tight bounds for the redundancy factor of several generalizations of the minimum st-cut problem.
Update: The latest version (on arXiv) contains additional references to previous work (which have some overlap with our results).
A valued constraint satisfaction problem (VCSP) instance $(V,\Pi,w)$ is a set of variables $V$ with a set of constraints $\Pi$ weighted by $w$. Given a VCSP instance, we are interested in a re-weighted sub-instance $(V,\Pi'\subset \Pi,w')$ such that preserves the value of the given instance (under every assignment to the variables) within factor $1\pm\epsilon$. A well-studied special case is cut sparsification in graphs, which has found various applications.
We show that a VCSP instance consisting of a single boolean predicate $P(x,y)$ (e.g., for cut, $P=XOR$) can be sparsified into $O(|V|/\epsilon^2)$ constraints if and only if the number of inputs that satisfy $P$ is anything but one (i.e., $|P^{-1}(1)| \neq 1$). Furthermore, this sparsity bound is tight unless $P$ is a relatively trivial predicate. We conclude that also systems of 2SAT (or 2LIN) constraints can be sparsified.
We study the problem of reconstructing a low-rank matrix, where the input is an $n\times m$ matrix $M$ over a field F and the goal is to reconstruct a (near-optimal) matrix $M'$ that is low-rank and close to $M$ under some distance function $\Delta$. Furthermore, the reconstruction must be local, i.e., provides access to any desired entry of $M'$ by reading only a few entries of the input $M$ (ideally, independent of the matrix dimensions $n$ and $m$). Our formulation of this problem is inspired by the local reconstruction framework of Saks and Seshadhri (SICOMP, 2010).
Our main result is a local reconstruction algorithm for the case where $\Delta$ is the normalized Hamming distance (between matrices). Given $M$ that is $\epsilon$-close to a matrix of rank $d<1/\epsilon$ (together with $d$ and $\epsilon$), this algorithm computes with high probability a rank-$d$ matrix $M'$ that is $O(\sqrt{d\epsilon})$-close to $M$. This is a local algorithm that proceeds in two phases. The preprocessing phase reads only $\tilde O(\sqrt{d/\epsilon^3})$ random entries of $M$, and stores a small data structure. The query phase deterministically outputs a desired entry $M'_{i,j}$ by reading only the data structure and $2d$ additional entries of $M$.
We also consider local reconstruction in an easier setting, where the algorithm can read an entire matrix column in a single operation. When $\Delta$ is the normalized Hamming distance between vectors, we derive an algorithm that runs in polynomial time by applying our main result for matrix reconstruction. For comparison, when $\Delta$ is the truncated Euclidean distance and F=R, we analyze sampling algorithms by using statistical learning tools.
We study resistance sparsification of graphs, in which the goal is to find a sparse subgraph (with reweighted edges) that approximately preserves the effective resistances between every pair of nodes. We show that every dense regular expander admits a $(1+\epsilon)$-resistance sparsifier of size $\tilde O(n/\epsilon)$, and conjecture this bound holds for all graphs on $n$ nodes. In comparison, spectral sparsification is a strictly stronger notion and requires $\Omega(n/\epsilon^2)$ edges even on the complete graph.
Our approach leads to the following structural question on graphs: Does every dense regular expander contain a sparse regular expander as a subgraph? Our main technical contribution, which may of independent interest, is a positive answer to this question in a certain setting of parameters. Combining this with a recent result of von Luxburg, Radl, and Hein~(JMLR, 2014) leads to the aforementioned resistance sparsifiers.
We investigate the problem of approximate Nearest-Neighbor Search (NNS) in graphical metrics: The task is to preprocess an edge-weighted graph $G=(V,E)$ on $m$ vertices and a small ``dataset'' $D\subset V$ of size $n \ll m$, so that given a query point $q\in V$, one can quickly approximate $dist(q,D)$ (the distance from $q$ to its closest vertex in $D$) and find a vertex $a\in D$ within this approximated distance. We assume the query algorithm has access to a distance oracle, that quickly evaluates the exact distance between any pair of vertices.
For planar graphs $G$ with maximum degree $\Delta$, we show how to efficiently construct a compact data structure -- of size $\tilde{O}(n(\Delta+1/\epsilon))$ -- that answers $(1+\epsilon)$-NNS queries in time $\tilde{O}(\Delta+1/\epsilon)$. Thus, as far as NNS applications are concerned, metrics derived from bounded-degree planar graphs behave as low-dimensional metrics, even though planar metrics do not necessarily have a low doubling dimension, nor can they be embedded with low distortion into $\ell_2$. We complement our algorithmic result by lower bounds showing that the access to an exact distance oracle (rather than an approximate one) and the dependency on $\Delta$ (in query time) are both essential.
A prominent tool in many problems involving metric spaces is a notion of randomized low-diameter decomposition. Loosely speaking, $\beta$-decomposition refers to a probability distribution over partitions of the metric into sets of low diameter, such that nearby points (parameterized by $\beta>0$) are likely to be ``clustered'' together. Applying this notion to the shortest-path metric in edge-weighted graphs, it is known that $n$-vertex graphs admit an $O(\ln n)$-padded decomposition \cite{Bartal96}, and that excluded-minor graphs admit $O(1)$-padded decomposition [KPR93,FT03,AGGNT14].
We design decompositions to the family of $p$-path-separable graphs, which was defined by Abraham and Gavoille [AG06] and refers to graphs that admit vertex-separators consisting of at most $p$ shortest paths in the graph.
Our main result is that every $p$-path-separable $n$-vertex graph admits an $O(\ln (p \ln n))$-decomposition, which refines the $O(\ln n)$ bound for general graphs, and provides new bounds for families like bounded-treewidth graphs. Technically, our clustering process differs from previous ones by working in (the shortest-path metric of) carefully chosen subgraphs.
An outstanding open question [sublinear.info, Question #5] asks to characterize metric spaces in which distances can be estimated using efficient sketches. Specifically, we say that a sketching algorithm is efficient if it achieves constant approximation using constant sketch size. A well-known result of Indyk (J. ACM, 2006) implies that a metric that admits a constant-distortion embedding into $\ell_p$ for $p\in(0,2]$ also admits an efficient sketching scheme. But is the converse true, i.e., is embedding into $\ell_p$ the only way to achieve efficient sketching?
We address these questions for the important special case of normed spaces, by providing an almost complete characterization of sketching in terms of embeddings. In particular, we prove that a finite-dimensional normed space allows efficient sketches if and only if it embeds (linearly) into $\ell_{1-\eps}$ with constant distortion. We further prove that for norms that are closed under sum-product, efficient sketching is equivalent to constant-distortion embedding into $\ell_1$. Examples of such norms include the Earth Mover's Distance (specifically its norm variant, called Kantorovich-Rubinstein norm), and the trace norm (a.k.a. Schatten 1-norm or the nuclear norm). Using known non-embeddability theorems for these norms by Naor and Schechtman (SICOMP, 2007) and by Pisier (Compositio. Math., 1978), we then conclude that these spaces do not admit efficient sketches either, making progress towards answering another open question [sublinear.info, Question #7].
Finally, we observe that resolving whether ``sketching is equivalent to embedding into $\ell_1$ for general norms'' (i.e., without the above restriction) is equivalent to resolving a well-known open problem in Functional Analysis posed by Kwapien in 1969.
Sketching and streaming algorithms are in the forefront of current research directions for cut problems in graphs. In the streaming model, we show that (1-\epsilon)-approximation for Max-Cut must use $n^{1-O(\epsilon)}$ space; moreover, beating 4/5-approximation requires polynomial space. For the sketching model, we show that r-uniform hypergraphs admit a (1+\epsilon)-cut-sparsifier (i.e., a weighted subhypergraph that approximately preserves all the cuts) with $O(\epsilon^{-2}n(r+\log n))$ edges. We also make first steps towards sketching general CSPs (Constraint Satisfaction Problems).
We introduce the st-cut version of the Sparsest-Cut problem, where the goal is to find a cut of minimum sparsity in a graph $G(V,E)$ among those separating two distinguished vertices $s,t\in V$. Clearly, this problem is at least as hard as the usual (non-st) version. Our main result is a polynomial-time algorithm for the product-demands setting, that produces a cut of sparsity $O(\sqrt{OPT})$, where OPT denotes the optimum, and the total edge capacity and the total demand are assumed (by normalization) to be 1.
Our result generalizes the recent work of Trevisan [arXiv, 2013] for the non-st version of the same problem (Sparsest-Cut with product demands), which in turn generalizes the bound achieved by the discrete Cheeger inequality, a cornerstone of Spectral Graph Theory that has numerous applications. Indeed, Cheeger's inequality handles graph conductance, the special case of product demands that are proportional to the vertex (capacitated) degrees. Along the way, we obtain an $O(\log |V|)$-approximation for the general-demands setting of Sparsest st-Cut.
We study spectral algorithms for the high-dimensional Nearest Neighbor Search problem (NNS). In particular, we consider a semi-random setting where a dataset is chosen arbitrarily from an unknown subspace of low dimension, and then perturbed by full-dimensional Gaussian noise. We design spectral NNS algorithms whose query time depends polynomially on the dimension and logarithmically on the size of the point set. These spectral algorithms use a repeated computation of the top PCA vector/subspace, and are effective even when the random-noise magnitude is much larger than the interpoint distances. Our motivation is that in practice, a number of spectral NNS algorithms outperform the random-projection methods that seem otherwise theoretically optimal on worst-case datasets. In this paper we aim to provide theoretical justification for this disparity.
We undertake a systematic study of sketching a quadratic form: given an $n \times n$ matrix $A$, create a succinct sketch $sk{A}$ which can produce (without further access to $A$) a multiplicative $(1+\epsilon)$-approximation to $x^T A x$ for any desired query $x \in R^n$. While a general matrix does not admit non-trivial sketches, positive semi-definite (PSD) matrices admit sketches of size $\Theta(\epsilon^{-2} n)$, via the Johnson-Lindenstrauss lemma, achieving the ``for each'' guarantee, namely, for each query $x$, with a constant probability the sketch succeeds. (For the stronger ``for all'' guarantee, where the sketch succeeds for all $x$'s simultaneously, again there are no non-trivial sketches.)
We design significantly better sketches for the important subclass of graph Laplacian matr ices, which we also extend to symmetric diagonally dominant matrices. A sequence of work c ulminating in that of Batson, Spielman, and Srivastava (SIAM Review, 2014), shows that by choosing and reweighting $O(\epsilon^{-2} n)$ edges in a graph, one achieves the ``for all'' guarantee. Our main results advance this front.
Our results provide the first separation between the sketch size needed for the ``for all'' and ``for each'' guarantees for Laplacian matrices.
In edge orientations, the goal is usually to orient (direct) the edges of an undirected $n$-vertex graph $G$ such that all out-degrees are bounded. When the graph $G$ is fully dynamic, i.e., admits edge insertions and deletions, we wish to maintain such an orientation while keeping a tab on the update time. Low out-degree orientations turned out to be a surprisingly useful tool, with several algorithmic applications involving static or dynamic graphs.
Brodal and Fagerberg (1999) initiated the study of the edge orientation problem in terms of the graph's arboricity, which is very natural in this context. They provided a solution with constant out-degree and amortized logarithmic update time for all graphs with constant arboricity, which include all planar and excluded-minor graphs. However, it remained an open question (first proposed by Brodal and Fagerberg, later by others) to obtain similar bounds with worst-case update time.
We resolve this 15 year old question in the affirmative, by providing a simple algorithm with worst-case bounds that nearly match the previous amortized bounds. Our algorithm is based on a new approach of a combinatorial invariant, and achieves a logarithmic out-degree with logarithmic worst-case update times. This result has applications in various dynamic graph problems such as maintaining a maximal matching, where we obtain $O(\log n)$ worst-case update time compared to the $O(\frac{\log n}{\log\log n})$ amortized update time of Neiman and Solomon (2013).
The problem of partitioning an edge-capacitated graph on n vertices into k balanced parts has been amply researched. Motivated by applications such as load balancing in distributed systems and market segmentation in social networks, we propose a new variant of the problem, called Multiply Balanced k-Partitioning, where the vertex-partition must be balanced under d vertex-weight functions simultaneously.
We design bicriteria approximation algorithms for this problem, i.e., they partition the vertices into up to $k$ parts that are nearly balanced simultaneously for all weight functions, and their approximation factor for the capacity of cut edges matches the bounds known for a single weight function times d. For the case where d=2, for vertex weights that are integers bounded by a polynomial in n and any fixed $\epsilon>0$, we obtain a $(2+\epsilon, O(\sqrt{\log n \log k}))$-bicriteria approximation, namely, we partition the graph into parts whose weight is at most $2+\epsilon$ times that of a perfectly balanced part (simultaneously for both weight functions), and whose cut capacity is $O(\sqrt{\log n \log k})\cdot OPT$. For unbounded (exponential) vertex weights, we achieve approximation $(3, O(\log n))$.
Our algorithm generalizes to d weight functions as follows: For vertex weights that are integers bounded by a polynomial in $n$ and any fixed $\epsilon>0$, we obtain a $(2d +\epsilon, O(d\sqrt{\log n \log k}))$-bicriteria approximation. For unbounded (exponential) vertex weights, we achieve approximation $(2d+ 1, O(d\log n))$.
A useful approach to "compress" a large network G is to represent it with a flow-sparsifier, i.e., a small network H that supports the same flows as G, up to a factor $q \geq 1$ called the quality of sparsifier. Specifically, we assume the network G contains a set of k terminals T, shared with the network H, i.e., $T\subseteq V(G)\cap V(H)$, and we want H to preserve all multicommodity flows that can be routed between the terminals T. The challenge is to construct H that is small.
These questions have received a lot of attention in recent years, leading to some known tradeoffs between the sparsifier's quality $q$ and its size |V(H)|. Nevertheless, it remains an outstanding question whether every $G$ admits a flow-sparsifier H with quality q=1+ε, or even q=O(1), and size $|V(H)|\leq f(k,ε)$ (in particular, independent of |V(G)| and the edge capacities).
Making a first step in this direction, we present new constructions for several scenarios:
We consider the problem of Non-Uniform Graph Partitioning, where the input is an edge-weighted undirected graph G=(V,E) and k capacities n_1,...,n_k, and the goal is to find a partition {S_1,S_2,...,S_k} of V satisfying |S_j| \leq n_j for all 1\leq j\leq k, that minimizes the total weight of edges crossing between different parts. This natural graph partitioning problem arises in practical scenarios, and generalizes well-studied balanced partitioning problems such as Minimum Bisection, Minimum Balanced Cut, and Minimum k-Partitioning. Unlike these problems, Non-Uniform Graph Partitioning seems to be resistant to many of the known partitioning techniques, such as spreading metrics, recursive partitioning, and Raecke's tree decomposition, because k can be a function of n and the capacities could be of different magnitudes.
We present a bicriteria approximation algorithm for Non-Uniform Graph Partitioning that approximates the objective within O(log n) factor while deviating from the required capacities by at most a constant factor. Our approach is to apply stopping-time based concentration results to a simple randomized rounding of a configuration LP. These concentration bounds are needed as the commonly used techniques of bounded differences and bounded conditioned variances do not suffice.
Estimating the leading principal components of data assuming they are sparse, is a central task in modern high-dimensional statistics. Many algorithms were suggested for this sparse PCA problem, from simple diagonal thresholding to sophisticated semidefinite programming (SDP) methods. A key theoretical question asks under what conditions can such algorithms recover the sparse principal components. We study this question for a single-spike model, with a spike that is $\ell_0$-sparse, and dimension $p$ and sample size $n$ that tend to infinity. Amini and Wainwright (2009) proved that for sparsity levels $k\geq\Omega(n/\log p)$, no algorithm, efficient or not, can reliably recover the sparse eigenvector. In contrast, for sparsity levels $k\leq O(\sqrt{n/\log p})$, diagonal thresholding is asymptotically consistent.
It was further conjectured that the SDP approach may close this gap between computational and information limits. We prove that when $k \geq \Omega(\sqrt{n})$ the SDP approach, at least in its standard usage, cannot recover the sparse spike. In fact, we conjecture that in the single-spike model, no computationally-efficient algorithm can recover a spike of $\ell_0$-sparsity $k\geq \Omega(\sqrt{n})$. Finally, we present empirical results suggesting that up to sparsity levels $k=O(\sqrt{n})$, recovery is possible by a simple covariance thresholding algorithm.
Our main result is that the Steiner Point Removal (SPR) problem can always be solved with polylogarithmic distortion, which answers in the affirmative a question posed by Chan, Xia, Konjevod, and Richa (2006). Specifically, we prove that for every edge-weighted graph $G = (V,E,w)$ and a subset of terminals $T \subseteq V$, there is a graph $G'=(T,E',w')$ that is isomorphic to a minor of $G$, such that for every two terminals $u,v\in T$, the shortest-path distances between them in $G$ and in $G'$ satisfy $d_{G,w}(u,v) \le d_{G',w'}(u,v) \le O(\log^5|T|) \cdot d_{G,w}(u,v)$. Our existence proof actually gives a randomized polynomial-time algorithm.
Our proof features a new variant of metric decomposition. It is well-known that every finite metric space $(X,d)$ admits a $\beta$-separating decomposition for $\beta=O(\log |X|)$, which means that for every $\Delta>0$ there is a randomized partitioning of $X$ into clusters of diameter at most $\Delta$, satisfying the following separation property: for every $x,y \in X$, the probability they lie in different clusters of the partition is at most $\beta\,d(x,y)/\Delta$. We introduce an additional requirement in the form of a tail bound: for every shortest-path $P$ of length $d(P) \leq \Delta/\beta$, the number of clusters of the partition that meet the path $P$, denoted $Z_P$, satisfies $\Pr[Z_P > t] \le 2e^{-\Omega(t)}$ for all $t>0$.
We initiate the study of dimensionality reduction in general metric spaces in the context of supervised learning. Our statistical contribution consists of tight Rademacher bounds for Lipschitz functions in metric spaces that are doubling, or nearly doubling. As a by-product, we obtain a new theoretical explanation for the empirically reported improvements gained by pre-processing Euclidean data by PCA (Principal Components Analysis) prior to constructing a linear classifier. On the algorithmic front, we describe an analogue of PCA for metric spaces, namely an efficient procedure that approximates the data's intrinsic dimension, which is often much lower than the ambient dimension. Thus, our approach can exploit the dual benefits of low dimensionality: (1) more efficient proximity search algorithms, and (2) more optimistic generalization bounds.
Given a large edge-weighted network G with k terminal vertices, we wish to compress it and store, using little memory, the value of the minimum cut (or equivalently, maximum flow) between every bipartition of terminals. One appealing methodology to implement a compression of G is to construct a mimicking network: a small network G' with the same k terminals, in which the minimum cut value between every bipartition of terminals is the same as in G. This notion was introduced by Hagerup, Katajainen, Nishimura, and Ragde [JCSS '98], who proved that such G' of size at most $2^{2^k}$ always exists. Obviously, by having access to the smaller network G', certain computations involving cuts can be carried out much more efficiently.
We provide several new bounds, which together narrow the previously known gap from doubly-exponential to only singly-exponential, both for planar and for general graphs. Our first and main result is that every $k$-terminal planar network admits a mimicking network G' of size $O(k^2\cdot 2^{2k})$, which is moreover a minor of G. On the other hand, some planar networks $G$ require $|E(G')| \ge \Omega(k^2)$. For general networks, we show that certain bipartite graphs only admit mimicking networks of size $|V(G')| \geq 2^{\Omega(k)}$, and moreover, every data structure that stores the minimum cut value between all bipartitions of the terminals must use $2^{\Omega(k)}$ machine words.
We examine the efficiency of clustering a set of points, when the encompassing metric space may be preprocessed in advance. In computational problems of this genre, there is a first stage of preprocessing, whose input is a collection of points $M$; the next stage receives as input a query set $Q\subset M$, and should report a clustering of $Q$ according to some objective, such as 1-median, in which case the answer is a point $a\in M$ minimizing $\sum_{q\in Q} d_M(a,q)$.
We design fast algorithms that approximately solve such problems under standard clustering objectives like p-center and p-median, when the metric $M$ has low doubling dimension. By leveraging the preprocessing stage, our algorithms achieve query time that is near-linear in the query size $n=|Q|$, and is (almost) independent of the total number of points $m=|M|$.
The significant progress in constructing graph spanners that are sparse (small number of edges) or light (low total weight) has skipped spanners that are everywhere-sparse (small maximum degree). This disparity is in line with other network design problems, where the maximum-degree objective has been a notorious technical challenge. Our main result is for the Lowest Degree 2-Spanner (LD2S) problem, where the goal is to compute a 2-spanner of an input graph so as to minimize the maximum degree. We design a polynomial-time algorithm achieving approximation factor $\tilde O(\Delta^{3-2\sqrt{2}}) \approx \tilde O(\Delta^{0.172})$, where $\Delta$ is the maximum degree of the input graph. The previous $\tilde O(\Delta^{1/4})$ -approximation was proved nearly two decades ago by Kortsarz and Peleg [SODA 1994, SICOMP 1998].
Our main conceptual contribution is to establish a formal connection between LD2S and a variant of the Densest k-Subgraph (DkS) problem. Specifically, we design for both problems strong relaxations based on the Sherali-Adams linear programming (LP) hierarchy, and show that "faithful" randomized rounding of the DkS-variant can be used to round LD2S solutions. Our notion of faithfulness intuitively means that all vertices and edges are chosen with probability proportional to their LP value, but the precise formulation is more subtle.
Unfortunately, the best algorithms known for DkS use the Lov\'asz-Schrijver LP hierarchy in a non-faithful way [Bhaskara, Charikar, Chlamtac, Feige, and Vijayaraghavan, STOC 2010]. Our main technical contribution is to overcome this shortcoming, while still matching the gap that arises in random graphs by planting a subgraph with same log-density.
We introduce the following notion of compressing an undirected graph G with edge-lengths and terminal vertices $R\subseteq V(G)$. A distance-preserving minor is a minor G' (of G) with possibly different edge-lengths, such that $R\subseteq V(G')$ and the shortest-path distance between every pair of terminals is exactly the same in G and in G'. What is the smallest f*(k) such that every graph G with k=|R| terminals admits a distance-preserving minor G' with at most f*(k) vertices?
Simple analysis shows that $f*(k)\le O(k^4)$. Our main result proves that $f*(k)\ge \Omega(k^2)$, significantly improving over the trivial $f*(k)\ge k$. Our lower bound holds even for planar graphs G, in contrast to graphs G of constant treewidth, for which we prove that O(k) vertices suffice.
The Traveling Salesman Problem (TSP) is among the most famous NP-hard optimization problems. We design for this problem a randomized polynomial-time algorithm that computes a (1+epsilon)-approximation to the optimal tour, for any fixed epsilon>0, in TSP instances that form an arbitrary metric space with bounded intrinsic dimension.
The celebrated results of Arora [JACM 1998] and Mitchell [SICOMP 1999] prove that the above result holds in the special case of TSP in a fixed-dimensional Euclidean space. Thus, our algorithm demonstrates that the algorithmic tractability of metric TSP depends on the dimensionality of the space and not on its specific geometry. This result resolves a problem that has been open since the quasi-polynomial time algorithm of Talwar [STOC 2004].
We present a framework for performing efficient regression in general metric spaces. Roughly speaking, our regressor predicts the value at a new point by computing a Lipschitz extension --- the smoothest function consistent with the observed data --- after performing structural risk minimization to avoid overfitting. We obtain finite-sample risk bounds with minimal structural and noise assumptions, and a natural speed-precision tradeoff. The offline (learning) and online (prediction) stages can be solved by convex programming, but this naive approach has runtime complexity $O(n^3)$, which is prohibitive for large datasets. We design instead a regression algorithm whose speed and generalization performance depend on the intrinsic dimension of the data, to which the algorithm adapts. While our main innovation is algorithmic, the statistical results may also be of independent interest.
We study graph partitioning problems from a min-max perspective, in which an input graph on $n$ vertices should be partitioned into $k$ parts, and the objective is to minimize the maximum number of edges leaving a single part. The two main versions we consider are where the $k$ parts need to be of equal-size, and where they must separate a set of $k$ given terminals. We consider a common generalization of these two problems, and design for it an $O(\sqrt{\log n\log k})$-approximation algorithm. This improves over an $O(\log^2 n)$ approximation for the second version due to Svitkina and Tardos (2004), and roughly $O(k\log n)$ approximation for the first version that follows from other previous work. We also give an improved $O(1)$-approximation algorithm for graphs that exclude any fixed minor.
Along the way, we study the Unbalanced-Cut problem, whose goal is to find, in an input graph $G=(V,E)$, a set $S\subseteq V$ of size $|S|=\rho n$ that minimizes the number of edges leaving $S$. We provide a bicriteria approximation of $O(\sqrt{\log{n}\log{(1/\rho)}})$; when the input graph excludes a fixed-minor we improve this guarantee to $O(1)$. Note that the special case $\rho = 1/2$ is the well-known Minimum-Bisection problem, and indeed our bounds generalize those of Arora, Rao and Vazirani (2008) and of Klein, Plotkin, and Rao (1993). Our algorithms work also for the closely related Small-Set Expansion (SSE) problem, which asks for a set $S\subseteq V$ of size $0<|S| \leq \rho n$ with minimum edge-expansion, and was suggested recently by Raghavendra and Steurer (2010). In fact, our algorithm handles more general, weighted, versions of both problems. Previously, an $O(\log n)$ true approximation for both \UC and \SSE follows from Raecke (2008).
In STOC 2005, Indyk and Woodruff introduced a new technique that yielded a near-optimal space algorithm for $F_k$ moment estimation problem. Since then, the technique has inspired a number of advances in streaming algorithmics. We show that at least several of these results follow easily from the application of a single probabilistic technique, called Precision Sampling. Using it, we obtain simple streaming algorithms that maintain a randomized sketch of an input vector $x=(x_1,\ldots x_n)$, useful for the following applications:
Precision Sampling itself addresses the problem of estimating a sum $\sum_{i=1}^n a_i$ from weak estimates of each real $a_i\in[0,1]$. More precisely, one chooses in advance ``precisions'' $w_i\ge 1$ for each $i\in[n]$ and obtains an estimate of $a_i$ within additive $1/w_i$. The core question is: what is the best trade-off between the approximation to $\sum a_i$ and the total precision, $\sum_i w_i$ ? In previous work, we showed [Andoni, Krauthgamer, and Onak, FOCS 2010] that, as long as $\sum a_i=\Omega(1)$, one can achieve good multiplicative approximation using total precision of only $O(n\log n)$.
A natural requirement of many distributed structures is fault-tolerance: after some failures, whatever remains from the structure should still be effective for whatever remains from the network. In this paper we examine spanners of general graphs that are tolerant to vertex failures, and significantly improve their dependence on the number of faults $r$, for all stretch bounds.
For stretch $k \geq 3$ we design a simple transformation that converts every k-spanner construction with at most $f(n)$ edges into an $r$-fault-tolerant $k$-spanner construction with at most $O(r^3 \log n) \cdot f(2n/r)$ edges. Applying this to standard greedy spanner constructions gives r-fault tolerant k-spanners with $\tilde O(r^{2} n^{1+\frac{2}{k+1}})$ edges. The previous construction by Chechik, Langberg, Peleg, and Roddity [STOC 2009] depends similarly on n but exponentially on r (approximately like $k^r$).
For the case $k=2$ and unit-length edges, an $O(r \log n)$-approximation algorithm is known from recent work of Dinitz and Krauthgamer [arXiv 2010], where several spanner results are obtained using a common approach of rounding a natural flow-based linear programming relaxation. Here we use a different (stronger) LP relaxation and improve the approximation ratio to $O(\log n)$, which is, notably, independent of the number of faults r. We further strengthen this bound in terms of the maximum degree by using the Lovasz Local Lemma.
Finally, we show that most of our constructions are inherently local by designing equivalent distributed algorithms in the LOCAL model of distributed computation.
We examine directed spanners through flow-based linear programming relaxations. We design an $\tilde{O}(n^{2/3})$-approximation algorithm for the directed $k$-spanner problem that works for all $k\geq 1$, which is the first sublinear approximation for arbitrary edge-lengths. Even in the more restricted setting of unit edge-lengths, our algorithm improves over the previous $\tilde{O}(n^{1-1/k})$ approximation of Bhattacharyya, Grigorescu, Jung, Raskhodnikova, and Woodruff [SODA 2009] when $k\geq 4$. For the special case of $k=3$ we design a different algorithm achieving an $\tilde{O}(\sqrt{n})$-approximation, improving the previous $\tilde{O}(n^{2/3})$ of Elkin and Peleg [TCS 2005] and Bhattacharyya et al. [SODA 2009]. Both of our algorithms easily extend to the fault-tolerant setting, which has recently attracted attention but not from an approximation viewpoint. We also prove a nearly matching integrality gap of $\tilde{\Omega}(n^{1/3 - \epsilon})$ for any constant $\epsilon > 0$.
A virtue of all our algorithms is that they are relatively simple. Technically, we introduce a new yet natural flow-based relaxation, and show how to approximately solve it even when its size is not polynomial. The main challenge is to design a rounding scheme that ``coordinates'' the choices of flow-paths between the many demand pairs while using few edges overall. We achieve this, roughly speaking, by randomization at the level of vertices.
We give the first constant-factor approximation algorithm for Sparsest-Cut with general demands in bounded tr eewidth graphs. In contrast to previous algorithms, which rely on the flow-cut gap and/or metric embeddings, our approach exploits the Sherali-Adams hierarchy of linear programming relaxations.
Given a capacitated graph G = (V,E) and a set of terminals $K \subseteq V$, how should we produce a graph H only on the terminals K so that every (multicommodity) flow between the terminals in G could be supported in H with low congestion, and vice versa? (Such a graph H is called a flow-sparsifier for G.) What if we want H to be a ``simple'' graph? What if we allow H to be a convex combination of simple graphs?
Improving on results of Moitra [FOCS 2009] and Leighton and Moitra [STOC 2010], we give efficient algorithms for constructing: (a) a flow-sparsifier $H$ that maintains congestion up to a factor of $\alpha_{FHRT}$, where k = |K|. (b) a convex combination of trees over the terminals K that maintains congestion up to a factor of O(log k). (c) for a planar graph G, a convex combination of planar graphs that maintains congestion up to a constant factor. This requires us to give a new algorithm for the 0-extension problem, the first one in which the preimages of each terminal are connected in G. Moreover, this result extends to minor-closed families of graphs.
Our bounds immediately imply improved approximation guarantees for several terminal-based cut and ordering problems.
We introduce a new problem in the study of doubling spaces: Given a point set S and a target dimension d*, remove from S the fewest number of points so that the remaining set has doubling dimension at most d^*. We present a bicriteria approximation for this problem, and extend this algorithm to solve a group of proximity problems.
Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as edit and earthmover distance. The general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the question of computational efficiency and providing direct bounds on classification error.
We design a new algorithm for classification in general metric spaces, whose runtime and accuracy depend on the doubling dimension of the data points. It thus achieves superior classification performance in many common scenarios. The algorithmic core of our approach is an approximate (rather than exact) solution to the classical problems of Lipschitz extension and of Nearest Neighbor Search. The algorithm's generalization performance is established via the fat-shattering dimension of Lipschitz classifiers.
We present a near-linear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor; specifically, for strings of length n and every fixed $\eps>0$, it can compute a $(\log n)^{O(1/\eps)}$-approximation in $n^{1+\eps}$ time. This is an {\em exponential} improvement over the previously known factor, $2^{\tilde O(\sqrt{\log n})}$, with a comparable running time [OR05,AO09].
Previously, no efficient polylogarithmic approximation algorithm was known for any computational task involving edit distance (e.g., nearest neighbor search or sketching).
This result arises naturally in the study of a new asymmetric query model. In this model, the input consists of two strings x and y, and an algorithm can access y in an unrestricted manner, while being charged for querying every symbol of x. Indeed, we obtain our main result by designing an algorithm that makes a small number of queries in this model. We then provide also a nearly-matching lower bound on the number of queries. Our lower bound is the first to expose hardness of edit distance stemming from the input strings being ``repetitive'', which means that many of their substrings are approximately identical. Consequently, our lower bound provides the first rigorous separation between edit distance and Ulam distance, which is edit distance on non-repetitive strings, i.e., permutations.
The
The snowflake metric $d^{1/2}$ of a doubling set
$S\subset
In fact, the target
dimension is polylogarithmic in the doubling constant. Our techniques are
robust and extend to the more difficult spaces
The quest for a Polynomial Time Approximation Scheme (PTAS) for Nash equilibrium in a two-player game, which emerged as a major open question in Algorithmic Game Theory, seeks to circumvent the PPAD-completeness of finding an (exact) Nash equilibrium by finding an approximate equilibrium. The closely related problem of finding an equilibrium maximizing a certain objective, such as the social welfare, was shown to be NP-hard [Gilboa and Zemel, Games and Economic Behavior, 1989]. However, this NP-hardness is unlikely to extend to approximate equilibria, since the latter admits a quasi-polynomial time algorithm [Lipton, Markakis and Mehta, In Proc. of EC, 2003].
We show that this optimization problem, namely, finding in a two-player game an approximate equilibrium achieving a large social welfare, is unlikely to have a polynomial time algorithm. One interpretation of our results is that a PTAS for Nash equilibrium (if it exists) should not extend to a PTAS for finding the best Nash equilibrium.
Technically, our result is a reduction from the notoriously difficult problem in modern Combinatorics, of finding a planted (but hidden) clique in a random graph G(n,1/2). Our reduction starts from an instance with planted clique size O(log n) For comparison, the currently known algorithms are effective only for a much larger clique size Omega(sqrt n).
We consider the k-balanced partitioning problem, where the goal is to partition the vertices of an input graph G into k equally sized components, while minimizing the total weight of the edges connecting different components. We allow k to be part of the input and denote the cardinality of the vertex set by n. This problem is a natural and important generalization of well-known graph partitioning problems, including minimum bisection and minimum balanced cut.
We present a (bi-criteria) approximation algorithm achieving
an approximation of O(sqrt{log n / log k}), which
matches or improves over previous algorithms for all relevant values of
k. Our algorithm uses a semidefinite relaxation which combines
We consider the problem of approximate nearest neighbors in high dimensions, when the queries are lines. In this problem, given n points in R^d, we want to construct a data structure to support efficiently the following queries: given a line L, report the point p closest to L. This problem generalizes the more familiar nearest neighbor problem. From a practical perspective, lines, and low-dimensional flats in general, may model data under linear variation, such as physical objects under different lighting.
For approximation 1+epsilon, we achieve a query time of O(n^{0.5+t}), for arbitrary small t>0, with a space of n^{O(1/epsilon^2+1/t^2)}. To the best of our knowledge, this is the first algorithm for this problem with polynomial space and sub-linear query time.
A common approach for solving computational problems over a difficult metric space is to embed the ``hard'' metric into L_1, which admits efficient algorithms and is thus considered an ``easy'' metric. This approach has proved successful or partially successful for important spaces such as the edit distance, but it also has inherent limitations: it is provably impossible to go below certain approximation for some metrics.
We propose a new approach, of embedding the difficult space into richer host spaces, namely iterated products of standard spaces like L_1 and L_\infty. We show that this class is rich since it contains useful metric spaces with only a constant distortion, and, at the same time, it is tractable and admits efficient algorithms. Using this approach, we obtain for example the first nearest neighbor data structure with O(log log d) approximation for edit distance in non-repetitive strings (the Ulam metric). This approximation is exponentially better than the lower bound for embedding into L_1. Furthermore, we give constant factor approximation for two other computational problems. Along the way, we answer positively a question posed in [Ajtai, Jayram, Kumar, and Sivakumar, STOC 2002]. One of our algorithms has already found applications for smoothed edit distance over 0-1 strings [Andoni and Krauthgamer, ICALP 2008].
We initiate the study of the smoothed complexity of sequence alignment, by proposing a semi-random model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is perturbed independently with probability p, except that A is perturbed in exactly the same way inside the two strings.
We design two efficient algorithms that compute the edit distance on smoothed instances up to a constant factor approximation. The first algorithm runs in near-linear time, namely d^{1+epsilon} for any fixed epsilon>0. The second one runs in time sublinear in d, assuming the edit distance is not too small. These approximation and runtime guarantees are significantly better then the bounds known for worst-case inputs, e.g. near-linear time algorithm achieving approximation roughly d^{1/3}, due to Batu, Ergun, and Sahinalp [SODA 2006].
Our technical contribution is twofold. First, we rely on finding matches between substrings in the two strings, where two substrings are considered a match if their edit distance is relatively small, a prevailing technique in commonly used heuristics, such as PatternHunter of Ma, Tromp and Li [Bioinformatics, 2002]. Second, we effectively reduce the smoothed edit distance to a simpler variant of (worst-case) edit distance, namely, edit distance on permutations (a.k.a. Ulam's metric). We are thus able to build on algorithms developed for the Ulam metric, whose much better algorithmic guarantees usually do not carry over to general edit distance.
A common technique for processing conjunctive queries is to first match each predicate separately using an index lookup, and then compute the intersection of the resulting row-id lists, via an AND-tree. The performance of this technique depends crucially on the order of lists in this tree: it is important to compute early the intersections that will produce small results. But this optimization is hard to do when the data or predicates have correlation.
We present a new algorithm for ordering the lists in an AND-tree by sampling the intermediate intersection sizes. We prove that our algorithm is near-optimal and validate its effectiveness experimentally on datasets with a variety of distributions.
How should a seller price his goods in a market where each buyer prefers a single good among his desired goods, and will buy the cheapest such good, as long as it is within his budget? We provide efficient algorithms that compute near-optimal prices for this problem, focusing on a commodity market, where the range of buyer budgets is small. We also show that our technique (which is based on LP-rounding) easily extends to a different scenario, in which the buyers want to buy all the desired goods, as long as they are within budget.
We design approximation algorithms for a number of fundamental optimization problems in metric spaces, namely computing separating and padded decompositions, sparse covers, and metric triangulations. Our work is the first to emphasize relative guarantees that compare the produced solution to the optimal one for the input at hand. By contrast, the extensive previous work on these topics has sought absolute bounds that hold for every possible metric space (or for a family of metrics). While absolute bounds typically translate to relative ones, our algorithms provide significantly better relative guarantees, using a rather different algorithm.
Our technical approach is to cast a number of metric clustering problems that have been well studied---but almost always as disparate problems---into a common modeling and algorithmic framework, which we call the consistent labeling problem. Having identified the common features of all of these problems, we provide a family of linear programming relaxations and simple randomized rounding procedures that achieve provably good approximation guarantees.
We prove the first non-trivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. A major feature of our result is that it provides the first setting in which the complexity of computing the edit distance is provably larger than that of Hamming distance.
Our lower bound exhibits a trade-off between approximation and communication, asserting, for example, that protocols with O(1) bits of communication can only obtain approximation Omega(log d/loglog d), where d is the length of the input strings. This case of O(1) communication is of particular importance, since it captures constant-size sketches as well as embeddings into spaces like L_1 and squared-L_2, two prevailing algorithmic approaches for dealing with edit distance. Furthermore, the bound holds not only for strings over alphabet Sigma={0,1}, but also for strings that are permutations (called the Ulam metric).
Besides being applicable to a much richer class of algorithms than all previous results, our bounds are near-tight in at least one case, namely of embedding permutations into L_1. The proof uses a new technique, that relies on Fourier analysis in a rather elementary way.
The Earth Mover Distance (EMD) between two equal-size sets of points in R^d is defined to be the minimum cost of a bipartite matching between the two pointsets. It is a natural metric for comparing sets of features, and as such, it has received significant interest in computer vision. Motivated by recent developments in that area, we address computational problems involving EMD over high-dimensional pointsets.
A natural approach is to embed the EMD metric into l_1, and use the algorithms designed for the latter space. However, Khot and Naor [FOCS 2005] show that any embedding of EMD over the d-dimensional Hamming cube into l_1 must incur a distortion Omega(d), thus practically losing all distance information. We circumvent this roadblock by focusing on sets with cardinalities upper-bounded by a parameter s, and achieve a distortion of only O(log s log d). Since in applications the feature sets have bounded size, the resulting distortion is much smaller than the Omega(d) lower bound. Our approach is quite general and easily extends to EMD over R^d.
We then provide a strong lower bound on the multi-round communication complexity of estimating EMD, which in particular strengthens the known non-embeddability result of Khot and Naor. Our bound exhibits a smooth tradeoff between approximation and communication, and for example implies that every algorithm that estimates EMD using constant size sketches can only achieve Omega(log s) approximation.
Network triangulation is a method for estimating distances between nodes in the network, by letting every node measure its distance to a few beacon nodes, and deducing the distance between every two nodes x,y by using their measurements to their common beacons and applying the triangle inequality. Kleinberg, Slivkins and Wexler [FOCS 2004] initiated a theoretical study of triangulation in metric spaces, and Slivkins [PODC 2005] subsequently showed that metrics of bounded doubling dimension admit a triangulation that approximates arbitrarily well all pairwise distances using only O(log n) beacons per point, where n is the number of points in the network. He then asked whether this term is necessary (for doubling metrics).
We provide the first lower bounds on the number of beacons required for a triangulation in some specific simple networks. In particular, these bounds (i) answer Slivkins' question positively, even for one-dimensional metrics, and (ii) prove that, up to constants, Slivkins' triangulation achieves an optimal number of beacons (as a function of the approximation guarantee and the doubling dimension).
The distance to monotonicity of a sequence is the minimum number of edit operations required to transform the sequence into an increasing order; this measure is complementary to the length of the longest increasing subsequence (LIS). We address the question of estimating these quantities in the one-pass data stream model and present the first sub-linear space algorithms for both problems.
We first present O(sqrt{n})-space deterministic algorithms that approximate the distance to monotonicity and the LIS to within a factor that is arbitrarily close to $1$. We also show a lower bound of Omega(n) on the space required by any randomized algorithm to compute the LIS (or alternatively the distance from monotonicity) exactly, demonstrating that approximation is necessary for sub-linear space computation; this bound improves upon the existing lower bound of Omega(sqrt{n}) [Liben-Nowell, Vee, and Zhu, J. Combinatorial Optimization, 2006].
Our main result is a randomized algorithm that uses only O(log^2 n) space and approximates the distance to monotonicity to within a factor that is arbitrarily close to 4. In contrast, we believe that any significant reduction in the space complexity for approximating the length of the LIS is considerably hard. We conjecture that any deterministic 1 + epsilon approximation algorithm for LIS requires Omega(sqrt{n}) space, and as a step towards this conjecture, prove a space lower bound of $\Omega(\sqrt n)$ for a restricted yet natural class of deterministic algorithms.
We initiate the study of approximate algorithms on negatively curved spaces. These spaces have recently become of interest in various domains of computer science including networking and vision. The classical example of such a space is the real-hyperbolic space H^d for d>=2, but our approach applies to a more general family of spaces characterized by Gromov's (combinatorial) hyperbolic condition. We give efficient algorithms and data structures for problems like approximate nearest-neighbor search and compact, low-stretch routing on subsets of negatively curved spaces of fixed dimension (including H^d as a special case). In a different direction, we show that there is a PTAS for the Traveling Salesman Problem when the set of cities lie, for example, in H^d. This generalizes Arora's results for R^d.
Most of our algorithms use the intrinsic distance geometry of the data set, and only need the existence of an embedding into some negatively curved space in order to function properly. In other words, our algorithms regard the inter-point distance function as a black box, and are independent of the representation of the input points.
Edit distance is a fundamental measure of distance between strings, the extensive study of which has recently focused on computational problems such as nearest neighbor search, sketching and fast approximation. A very powerful paradigm is to map the metric space induced by the edit distance into a normed space (e.g., ℓ_{1}) with small distortion, ℓ_{1} and then use the rich algorithmic toolkit known for normed spaces. Although the minimum distortion required to embed edit distance into ℓ_{1} has received a lot of attention lately, there is a large gap between known upper and lower bounds. We make progress on this question by considering large, well-structured, submetrics of the edit distance metric space.
Our main technical result is that the Ulam metric, namely, the edit distance on permutations of length at most n, embeds into ℓ_{1} with distortion O(log n). This immediately leads to sketching algorithms with constant size sketches, and to efficient approximate nearest neighbor search algorithms, with approximation factor O(log n). The embedding and its algorithmic consequences present a big improvement over those previously known for the Ulam metric, and they are significantly better than the state of the art for edit distance in general. Further, we extend these results for the Ulam metric to edit distance on strings that are (locally) non-repetitive, i.e., strings where (close by) substrings are distinct.
We simplify and improve upon recent lower bounds on the minimum distortion of embedding certain finite metric spaces into L_1. In particular, we show that for infinitely many values of n there are n-point metric spaces of negative type that require a distortion of Omega(log log n) for such an embedding, implying the same lower bound on the integrality gap of a well-known SDP relaxation for Sparsest-Cut. This result builds upon and improves the recent lower bound of (log log n)^{1/6-o(1)} due to Khot and Vishnoi [STOC 2005]. We also show that embedding the edit distance on {0,1}^n into L_1 requires a distortion of Omega(log n). This result simplifies and improves a very recent lower bound due to Khot and Naor [FOCS 2005].
We show that the Multicut, Sparsest-Cut, and Min-2CNF= Deletion problems are NP-hard to approximate within every constant factor, assuming the Unique Games Conjecture of Khot [STOC, 2002]. A quantitatively stronger version of the conjecture implies an inapproximability factor of $\Omega(\sqrt{\log\log n})$.
We address the problems of pattern matching and approximate pattern matching in the sketching model. We show that it is impossible to compress the text into a small sketch and use only the sketch to decide whether a given pattern occurs in the text. We also prove a sketch size lower bound for approximate pattern matching, and show it is tight up to a logarithmic factor.
Edit distance has been extensively studied for the past several years. Nevertheless, no linear-time algorithm is known to compute the edit distance between two strings, or even to approximate it to within a modest factor. Furthermore, for various natural algorithmic problems such as low-distortion embeddings into normed spaces, approximate nearest-neighbor schemes, and sketching algorithms, known results for the edit distance are rather weak. We develop algorithms that solve gap versions of the edit distance problem: given two strings of length $n$ with the promise that their edit distance is either at most $k$ or greater than $l$, decide which of the two holds. We present two sketching algorithms for gap versions of edit distance. Our first algorithm solves the $k$ vs. $(k n)^{2/3}$ gap problem, using a constant size sketch. A more involved algorithm solves the stronger $k$ vs. $l$ gap problem, where $l$ can be as small as $O(k^2)$---still with a constant sketch---but works only for strings that are mildly ``non-repetitive''. Finally, we develop an $n^{3/7}$-approximation quasi-linear time algorithm for edit distance, improving the previous best factor of $n^{3/4}$ due to Cole and Hariharan [SIAM J. on Computing, 2002]; if the input strings are assumed to be non-repetitive, then the approximation factor can be strengthened to $n^{1/3}$.
We devise a new embedding technique, which we call measured descent, based on decomposing a metric space locally, at varying speeds, according to the density of some probability measure. This provides a refined and unified framework for the two primary methods of constructing Fr\'echet embeddings for finite metrics, due to [Bourgain, 1985] and [Rao, 1999]. We prove that any $n$-point metric space $(X,d)$ embeds in Hilbert space with distortion $O(\sqrt{\alpha_X \cdot \log n})$, where $\alpha_X$ is a geometric estimate on the decomposability of $X$. An an immediate corollary, we obtain an $O(\sqrt{\log \lambda_X \log n})$ distortion embedding, where $\lambda_X$ is the doubling constant of $X$. Since $\lambda_X\le n$, this result recovers Bourgain's theorem, but when the metric $X$ is, in a sense, ``low-dimensional,'' improved bounds are achieved.
Our embeddings are volume-respecting for subsets of arbitrary size. One consequence is the existence of $(k, \log n)$ volume-respecting embeddings for all $1 \leq k \leq n$, which is the best possible, and answers positively a question posed in [Feige, 1998]. Our techniques are also used to answer positively a question of Y. Rabinovich, showing that any weighted $n$-point planar graph embeds in $\ell_\infty^{O(\log n)}$ with $O(1)$ distortion. The $O(\log n)$ bound on the dimension is optimal, and improves upon the previously known bound of $O(\log n)^2$.
We define a natural notion of efficiency for approximate nearest-neighbor (ANN) search in general $n$-point metric spaces, namely the existence of a randomized algorithm which answers (1+ε)-approximate nearest neighbor queries in polylog(n) time using only polynomial space. We then study which families of metric spaces admit efficient ANN schemes in the black-box model, where only oracle access to the distance function is given, and any query consistent with the triangle inequality may be asked.
For ε < 2/5, we offer a complete answer to this problem. Using the notion of metric dimension defined in \cite{GKL03} (\`a la [Assouad, 1983]), we show that a metric space $X$ admits an efficient (1+ε)-ANN scheme for any ε < 2/5 if and only if $\dim(X) = O(\log \log n)$. For coarser approximations, clearly the upper bound continues to hold, but there is a threshold at which our lower bound breaks down---this is precisely when points in the ``ambient space'' may begin to affect the complexity of ``hard'' subspaces $S \subseteq X$. Indeed, we give examples which show that $\dim(X)$ does not characterize the black-box complexity of ANN above the threshold.
Our scheme for ANN in low-dimensional metric spaces is the first to yield efficient algorithms without relying on any heuristic assumptions on the input. In previous approaches (e.g., [Clarkson, 1999], [Karger and Ruhl, 2002], [Krauthgamer and Lee, 2004]) even spaces with $\dim(X) = O(1)$ sometimes required $\Omega(n)$ query times.
Aggregate statistical data about the web is very useful in numerous scenarios, such as market research, intelligence, and social studies. In many of these applications, one is interested not in generic data about the whole web but rather in highly focused information pertinent to a specific domain or topic. Furthermore, timely acquisition of this information provides a competitive advantage.
Focused statistical data can be gathered by a brute force crawl of the whole web, or by a "focused crawl", which collects mainly pages that are relevant to the topic of interest. Crawling, however, is an expensive enterprise, requiring substantial resources. For the purpose of gathering statistical data, random sampling of web pages is a much faster, cheaper, and even more reliable approach. We develop the first efficient method for generating a random sample of web pages relevant to a given user-specified topic. Previously, techniques for getting only an unfocused sample of pages from the whole web were known. Our method is based on a new random walk on (a modified version of) a subgraph of the web graph.
We present a simple deterministic data structure for maintaining a set $S$ of points in a general metric space, while supporting proximity search (nearest neighbor and range queries) and updates to $S$ (insertions and deletions). Our data structure consists of a sequence of progressively finer $\epsilon$-nets of $S$, with pointers that allow us to navigate easily from one scale to the next. We analyze the worst-case complexity of this data structure in terms of the ``abstract dimensionality'' of the metric $S$. Our data structure is extremely efficient for metrics of bounded dimension and is essentially optimal in a certain model of distance computation. Finally, as a special case, our approach improves over one recently devised by Karger and Ruhl [STOC, 2002].
Given a metric space $(X,d)$, a natural distance measure on probability distributions over $X$ is the {\em earthmover metric}. We use randomized rounding of earthmover metrics to devise new approximation algorithms for two well-known classification problems, namely, metric labeling and 0-extension.
Our first result is for the 0-extension problem. We show that if the terminal metric is decomposable with parameter $\alpha$ (e.g., planar metrics are decomposable with $\alpha=O(1)$), then the earthmover based linear program (for $0$-extension) can be rounded to within an $O(\alpha)$ factor.
Our second result is an $O(\log n)$-approximation for metric labeling, using probabilistic tree embeddings in a way very different from the $O(\log k)$-approximation of Kleinberg and Tardos [JACM, 2002]. (Here, $n$ is the number of nodes, and $k$ is the number of labels.) The key element is rounding the earthmover based linear program (for metric labeling) without increasing the solution's cost, when the input graph is a tree. This rounding method also provides an alternate proof to a result stated in Chekuri et al. [SODA, 2001], that the earthmover based linear program is integral when the input graph is a tree.
Our simple and constructive rounding techniques contribute to the understanding of earthmover metrics and may be of independent interest.
In the Asymmetric $k$-Center problem, the input is an integer $k$ and a complete digraph over $n$ points together with a distance function obeying the directed triangle inequality. The goal is to choose a set of $k$ points to serve as centers and to assign all the points to the centers, so that the maximum distance of any point from its center is as small as possible.
We show that the Asymmetric $k$-Center problem is hard to approximate up to a factor of $\log^* n - O(1)$ unless $NP \subseteq DTIME(n^{\log \log n})$. Since an $O(\log^* n)$-approximation algorithm is known for this problem, this resolves the asymptotic approximability of Asymmetric $k$-Center. This is the first natural problem whose approximability threshold does not polynomially relate to the known approximation classes. We also resolve the approximability threshold of the metric (symmetric) $k$-Center problem with costs.
The doubling constant of a metric space $(X,d)$ is the smallest value $\lambda$ such that every ball in $X$ can be covered by $\lambda$ balls of half the radius. The doubling dimension of $X$ is then defined as $dim(X) = \log_2 \lambda$. A metric (or sequence of metrics) is called doubling precisely when its doubling dimension is bounded. This is a robust class of metric spaces which contains many families of metrics that occur in applied settings.
We give tight bounds for embedding doubling metrics into (low-dimensional) normed spaces. We consider both general doubling metrics, as well as more restricted families such as those arising from trees, from graphs excluding a fixed minor, and from snowflaked metrics. Our techniques include decomposition theorems for doubling metrics, and an analysis of a fractal in the plane due to Laakso [Bull. London Math. Soc., 2002]. Finally, we discuss some applications and point out a central open question regarding dimensionality reduction in $L_2$.
Motivation: Comparing two protein databases is a fundamental task in biosequence annotation. Given two databases, one must find all pairs of proteins that align with high score under a biologically meaningful substitution score matrix, such as a BLOSUM matrix (Henikoff and Henikoff, 1992). Distance-based approaches to this problem map each peptide in the database to a point in a metric space, such that peptides aligning with higher scores are mapped to closer points. Many techniques exist to discover close pairs of points in a metric space efficiently, but the challenge in applying this work to proteomic comparison is to find a distance mapping that accurately encodes all the distinctions among residue pairs made by a proteomic score matrix. Buhler (2002) proposed one such mapping but found that it led to a relatively inefficient algorithm for protein-protein comparison.
Results: This work proposes a new distance mapping for peptides under the BLOSUM matrices that permits more efficient similarity search. We first propose a new distance function on peptides derived from a given score matrix. We then show how to map peptides to bit vectors such that the distance between any two peptides is closely approximated by the Hamming distance (i.e., number of mismatches) between their corresponding bit vectors. We combine these two results with the LSH-ALL-PAIRS-SIM algorithm of Buhler (2002) to produce an improved distance-based algorithm for proteomic comparison. An initial implementation of the improved algorithm exhibits sensitivity within 5% of that of the original LSH-ALL-PAIRS-SIM, while running up to eight times faster.
We devise the first constant factor approximation algorithm for minimum quotient vertex-cuts in planar graphs. Our algorithm achieves approximation ratio $3.5 (1+\epsilon)^2$ with running time $O(W n^{3+2/\epsilon})$. The approximation ratio improves to $\frac{4}{3} (1+\epsilon)$ if there is an optimal quotient vertex-cut $(A^*,B^*,C^*)$ where $C^*$ has relatively small weight compared to $A^*$ and $B^*$; this holds, for example, when the input graph has uniform (or close to uniform) weight and cost. The approximation ratio further improves to $1+\epsilon$ if, in addition, $\min( w(A^*),w(B^*) ) \le \frac{1}{3} W$.
We use our quotient cut algorithm to find the first constant-factor pseudo-approximation for vertex separators in planar graphs.
Our technical contribution is two-fold. First, we prove a structural theorem for vertex-cuts in planar graphs, showing the existence of a near-optimal vertex-cut whose high-level structure is that of a bounded-depth tree. Second, we develop an algorithm for optimizing over such complex structures, whose running time depends (exponentially) not on the size of the structure, but rather only on its depth. These techniques may be applicable in other problems.
We resolve the following conjecture raised by Levin together with Linial, London, and Rabinovich [Combinatorica, 1995]. Let $\dim(G)$ be the smallest $d$ such that $G$ occurs as a (not necessarily induced) subgraph of the infinite graph $Z_\infty^d$, whose vertex set is $Z^d$ and which has an edge $(u,v)$ whenever $||u-v||_\infty = 1$. The growth rate of $G$, denoted $\rho_G$, is the minimum $\rho$ such that every ball of radius $r>1$ in $G$ contains at most $r^\rho$ vertices. By simple volume arguments, $\dim(G) = \Omega(\rho_G)$. Levin conjectured that this lower bound is tight, i.e., that $\dim(G) = O(\rho_G)$ for every graph $G$.
We show that a weaker form of Levin's conjecture, namely, $\dim(G) = O(\rho_G \log \rho_G)$, holds for every graph $G$. We disprove, however, the specific bound of the conjecture and show that our upper bound is tight by exhibiting graphs for which $\dim(G) =\Omega(\rho_G \log \rho_G)$. For families of graphs which exclude a fixed minor, we salvage the strong form, showing that $\dim(G) = O(\rho_G)$. This holds also for graphs without long induced simple cycles. Our results extend to a variant of the conjecture for finite-dimensional Euclidean spaces due to Linial [Haifa Workshop, 2002].
We provide the first hardness result for a polylogarithmic approximation ratio (for a natural NP-hard optimization problem). We show that for every fixed $\epsilon>0$, the Group Steiner Tree problem admits no efficient $\log^{2-\epsilon} k$--approximation, where $k$ denotes the number of groups, unless NP has quasi-polynomial Las-Vegas algorithms. This result holds even for input graphs which are Hierarchically Well-Separated Trees, introduced by Bartal [FOCS, 1996], in which case our bound is nearly tight with the $O(\log^2 k)$--approximation currently known. Our results imply that for every fixed $\epsilon>0$, the Directed Steiner Tree problem admits no $\log^{2-\epsilon} n$--approximation, where $n$ is the number of vertices in the graph, under the same complexity assumption.
We present an $\Omega(\log^2 k)$ lower bound on the integrality ratio of the flow-based relaxation for the Group Steiner Tree problem, where $k$ denotes the number of groups; this holds even for input graphs that are Hierarchically Well-Separated Trees, introduced by Bartal [\textit{Symp.\ Foundations of Computer Science}, pp.\ 184--193, 1996], in which case this lower bound is tight. This relaxation appears to be the only one that have been studied for the problem, as well as for its generalization, the Directed Steiner Tree problem. For the latter problem, our results imply an $\Omega(\frac{\log^2 n}{(\log \log n)^2})$ integrality ratio, where $n$ is the number of vertices in the graph. For both problems, this is the first known lower bound on the integrality ratio that is superlogarithmic in the input size. We also show algorithmically that the integrality ratio for Group Steiner Tree is much better for certain families of instances, which helps pinpoint the types of instances that appear to be most difficult to approximate.
Data dimensionality is a crucial issue in a variety of settings, where it is desirable to determine whether a data set given in a high-dimensional space adheres to a low-dimensional structure. We study this problem in the framework of property testing: Given a query access to a data set $S$, we wish to determine whether $S$ is low-dimensional, or whether it should be modified significantly in order to have the property. Allowing a constant probability of error, we aim at algorithms whose complexity does not depend on the size of $S$.
We present algorithms for testing the low-dimensionality of a set of vectors and for testing whether a matrix is of low rank. We then address low-dimensionality in metric spaces. For vectors in the metric space $l_1$, we show that low-dimensionality is not testable. For $l_2$, we show that a data set can be tested for having a low-dimensional structure, but that the property of approximately having such a structure is not testable.
In the survivable network design problem (SNDP), the goal is to find a minimum-cost subgraph satisfying certain connectivity requirements. We study the vertex-connectivity variant of SNDP in which the input specifies, for each pair of vertices, a required number of vertex-disjoint paths connecting them.
We give the first lower bound on the approximability of SNDP, showing that the problem admits no efficient $2^{\log^{1-\epsilon} n}$ ratio approximation for any fixed $\epsilon>0$ unless $NP\subseteq DTIME(n^{\polylog(n)})$. We also show hardness of approximation results for several important special cases of SNDP, including constant factor hardness for the $k$-vertex connected spanning subgraph problem ($k$-VCSS) and for the vertex-connectivity augmentation problem, even when the edge costs are severely restricted.
The extensive study of metric spaces and their embeddings has so far focused on embeddings that preserve pairwise distances. A very intriguing concept introduced by Feige [JCSS, 2000] allows us to quantify the extent to which larger structures are preserved by a given embedding. We investigate this concept, focusing on several major graph families such as paths, trees, cubes and expanders. We find some similarities to the regular (pairwise) distortion, as well as some striking differences.
Lovasz and Schrijver (SIAM J. Optim., 1991) devised a lift-and-project method that produces a sequence of convex relaxations for the problem of finding in a graph an independent set (or a clique) of maximum size. Each relaxation in the sequence is tighter than the one before it, while the first relaxation is already at least as strong as the Lovasz theta function (IEEE Trans. Inform. Theory, 1979}. We show that on a random graph $G_{n,1/2}$, the value of the $r$th relaxation in the sequence is roughly $\sqrt{n/2^r}$, almost surely. It follows that for those relaxations known to be efficiently computable, namely for $r=O(1)$, the value of the relaxation is comparable to the theta function. Furthermore, a perfectly tight relaxation is almost surely obtained only at the $r=\Theta(\log n)$ relaxation in the sequence.
The notion of private approximation was introduced recently by Feigenbaum, Fong, Strauss and Wright. Informally, a private approximation of a function $f$ is another function $F$ that approximates $f$ in the usual sense, but does not yield any information on $x$ other than what can be deduced from $f(x)$. As such, $F(x)$ is useful for private computation of $f(x)$ (assuming that $F$ can be computed more efficiently than $f$).
In this work we examine the properties and limitations of this new notion. Specifically, we show that for many NP-hard problems, the privacy requirement precludes non-trivial approximation. This is the case even for problems that otherwise admit very good approximation (e.g., problems with PTAS). On the other hand, we show that slightly relaxing the privacy requirement, by means of leaking ``just a few bits of information'' about $x$, again permits good approximation.
A web content hosting service provider needs to dynamically allocate servers in a server farm to its customers' web sites. Ideally, the allocation to a site should always suffice to handle its load. However, due to a limited number of servers and the overhead incurred in changing the allocation of a server from one site to another, the system may become overloaded. The problem faced by the web hosting service provider is how to allocate the available servers in the most profitable way. Adding to the complexity of this problem is the fact that future loads of the sites are either unknown or known only for the very near future.
In this paper we model this server allocation problem, and consider both its offline and online versions. We give a polynomial time algorithm for computing the optimal offline allocation. In the online setting, we show almost optimal algorithms (both deterministic and randomized) for any positive lookahead. The quality of the solution improves as the lookahead increases. We also consider several special cases of practical interest. Finally, we present some experimental results using actual trace data that show that one of our online algorithm performs very close to optimal.
Interestingly, the online server allocation problem can be cast as a more general benefit task system that we define. Our results extend to this task system, which captures also the benefit maximization variants of the $k$-server problem and the metrical task system problem. It follows that the benefit maximization variants of these problems are more tractable than their cost minimization variants.
The achromatic number problem is to legally color the vertices of an input graph with the maximum number of colors, denoted $\psi^*$, so that every two color classes share at least one edge. This problem is known to be NP-hard.
For general graphs we give an algorithm that approximates the achromatic number within ratio of $O(n \log\log n/\log n)$. This improves over the previously known approximation ratio of $O(n/\sqrt{\log n})$, due to Chaudhary and Vishwanathan (SODA, 1997).
For graphs of girth at least 5 we give an algorithm with approximation ratio $O(\min\{n^{1/3},\sqrt{\psi^*}\})$. This improves over an approximation ratio $O(\sqrt{\psi^*})=O(n^{3/8})$ for the more restricted case of graphs with girth at least 6, due to Krista and Lorys (ESA 1999).
We also give the first hardness result for approximating the achromatic number. We show that for every fixed $\epsilon>0$ there is no $2-\epsilon$ approximation algorithm, unless $P=NP$.
A bisection of a graph with $n$ vertices is a partition of its vertices into two sets, each of size $n/2$. The bisection cost is the number of edges connecting the two sets. Finding the bisection of minimum cost is NP-hard. We present an algorithm that finds a bisection whose cost is within ratio of $O(\log^2 n)$ from the optimal. For graphs excluding any fixed graph as a minor (e.g. planar graphs) we obtain an improved approximation ratio of $O(\log n)$. The previously known approximation ratio for bisection was roughly $\sqrt{n}$.
A bisection of a graph with $n$ vertices is a partition of its vertices into two sets, each of size $n/2$. The bisection size is the number of edges connecting the two sets. Finding the bisection of minimum size is NP-hard. We present an algorithm that finds a bisection that is within $O(\sqrt{n}\log n)$ of optimal. No sublinear approximation ratio for bisection was previously known.
We consider the problem of finding in an undirected graph a minimum cut that separates exactly a given number $k$ of vertices. For general $k$ (i.e. $k$ is part of the input and may depend on $n$) this problem is NP-hard.
We present for this problem a randomized approximation algorithm, which is useful when $k$ is relatively small. In particular, for $k = O(\log n)$ we obtain a PTAS (polynomial time approximation scheme), and for $k=\Omega(\log n)$ we obtain an approximation ratio $O(k/\log n)$.
The motivation for our work is the observation that Web pages on a particular topic are often linked to other pages on the same topic. We model and analyze the problem of how to improve the classification of Web pages (that is, determining the topic of the page) by using link information. In our setting, an initial classifier examines the text of a Web page and assigns to it some classification, possibly mistaken. We investigate how to reduce the error probability using the observation above, thus building an improved classifier. We present a theoretical framework for this problem based on a random graph model and suggest two linear time algorithms, based on similar methods that have been proven effective in the setting of error-correcting codes. We provide simulation results to verify our analysis and to compare the performance of our suggested algorithms.
Alon, Krivelevich and Sudakov (Random Structures and Algorithms, 1998) designed an algorithm based on spectral techniques that almost surely finds a clique of size $\Omega(\sqrt{n})$ hidden in an otherwise random graph. We show that a different algorithm, based on the Lovasz theta function, almost surely both finds the hidden clique and certifies its optimality. Our algorithm has an additional advantage of being more robust: it also works in a semi-random hidden clique model, in which an adversary can remove edges from the random portion of the graph.
Hot-potato routing is a form of synchronous routing which makes no use of buffers at intermediate nodes. Packets must move at every time step, until they reach their destination. If contention prevents a packet from taking its preferred outgoing edge, it is deflected on a different edge. Two simple design principles for hot potato routing algorithms are minimum advance, that advances at least one packet towards its destination from every nonempty node (and possibly deflects all other packets), and maximum advance, that advances the maximum possible number of packets.
Livelock is a situation in which packets keep moving indefinitely in the network without any packet ever reaching its destination. It is known that even maximum advance algorithms might livelock on some networks. We show that minimum advance algorithms never livelock on tree networks, and that maximum advance algorithms never livelock on triangulated networks.
A stereoscopic family of permutations maps an $m$-dimensional mesh into several one-dimensional lines, in a way that jointly preserves distance information. Specifically, consider any two points and denote their distance on the $m$-dimensional mesh by $d$. Then the distance between their images, on the line on which these images are closest together, is $O(d^m)$.
We initiate a systematic study of stereoscopic families of permutations. We show a construction of these families that involves the use of $m+1$ images. We also show that under some additional restrictions (namely, adjacent points on the image lines originate at points which are not too far away on the mesh), three images are necessary in order to construct such a family for the two-dimensional mesh.
We present two applications for stereoscopic families of permutations. One application is an algorithm for routing on the mesh that guarantees delivery of each packet within a number of steps that depends upon the distance between this packet's source and destination, but is independent of the size of the mesh. Our algorithm is exceptionally simple, involves no queues, and can be used in dynamic settings in which packets are continuously generated. Another application is an extension of the construction of non-expansive hash functions of Linial and Sasson (STOC 1996) from the case of one dimensional metrics to arbitrary dimensions.