Approximate Clustering without the Approximation

by Maria-Florina (Nina) Balcan

My own view is that this direction of research has a much more broad applicability than realized so far. For any search problem $R$, consider the problem of finding a solution $y$ to a given instance $x$ (i.e., given $x$ find $y$ s.t. $(x,y)\in R$). Suppose that with respect to an objective function $\Phi$ (modeling an auxiliary objective function believes to be minimized by valid solutions) the following $(c,\eps)$-property holds: for every $x,y$ if $\Phi(x,y) \leq c\cdot OPT(x) \eqdef\min_z\{\Phi(x,z)\}$, then $y$ is $\eps$-close to a solution to $x$ wrt $R$ (i.e., exists $z$ s.t. $Delta(y,z) < \eps |y|$ and $(x,z)\in R$). In such cases (i.e., if the solution space (defined by $R$) satisfies such a property wrt an auxiliary objective $\Phi$), one typically tries to $c$-approximate the objective $\Phi$ in order to find strings that are $\eps$-close to being a valid solution. Often, this is suggested as the justification for searching for a $c$-approximation to the objective $\Phi$. However, the point is that there may be a direct way of finding a string that is $\eps$-close to being a solution, instead of attempting to $c$-approximate the objective $\Phi$ (which may be infeasible to approximate...). Furthermore, a proof of the $(c,\eps)$-property (of some $\Phi$) may be useful towards this goal, in which case we may view $\Phi$ as a pivot of the analysis (rather than as a goal of the algorithm).

Later comment by Avrim and Nina: Our model is actually a bit more like an instance-based version of the above condition. That is, say that the instance $x$ satisfies the $(c,\eps)$-property if: for every $y$, if $\Phi(x,y) \leq c\cdot OPT(x)$ then $y$ is $\eps$-close to a solution to $x$ wrt $R$. Then the goal is to have an algorithm that finds a good $y$ (one that is $O(\eps)$-close to a solution) on those instances $x$ satisfying the property. This is the kind of guarantee we get for clustering.
Oded: I still consider it good to present the global definition first, and the instance-based version as a ramification.

The original abstract

There has been substantial work on approximation algorithms for clustering data under distance-based objective functions such as k-median, k-means, and min-sum objectives. This work is fueled in part by the hope that approximating these objectives well will indeed yield more accurate solutions. That is, for problems such as clustering proteins by function, or clustering images by subject, there is some unknown correct “target” clustering and the implicit assumption is that clusterings that are approximately optimal in terms of these distance-based measures are also approximately correct in terms of error with respect to the target. In this work we show that if we make this implicit assumption explicit — that is, if we assume that any c-approximation to the given clustering objective Phi is epsilon-close to the target — then we can produce clusterings that are O(epsilon)-close to the target, even for values c for which obtaining a c-approximation is NP-hard. In particular, for the k-median, k-means, and min-sum objectives, we show that we can achieve this guarantee for any constant c > 1. Our results show how by explicitly considering the alignment between the objective function used and the true underlying clustering goals, one can bypass computational barriers and perform as if these objectives were computationally substantially easier.

Back to list of Oded's choices.