## Approximate Clustering without the Approximation

by Maria-Florina (Nina) Balcan

#### Oded's comments

My own view is that this direction of research has a much more broad
applicability than realized so far.
For any search problem $R$, consider the problem of finding
a solution $y$ to a given instance $x$
(i.e., given $x$ find $y$ s.t. $(x,y)\in R$).
Suppose that with respect to an objective function $\Phi$
(modeling an auxiliary objective function believes to be minimized
by valid solutions) the following **$(c,\eps)$-property** holds:
for every $x,y$
if $\Phi(x,y) \leq c\cdot OPT(x) \eqdef\min_z\{\Phi(x,z)\}$,
then $y$ is $\eps$-close to a solution to $x$ wrt $R$
(i.e., exists $z$ s.t. $Delta(y,z) < \eps |y|$ and $(x,z)\in R$).
In such cases (i.e., if the solution space (defined by $R$)
satisfies such a property wrt an auxiliary objective $\Phi$),
one typically tries to $c$-approximate the objective $\Phi$
in order to find strings that are $\eps$-close to being a valid solution.
Often, this is suggested as the justification for searching for
a $c$-approximation to the objective $\Phi$.
However, the point is that there may be a direct way of finding
a string that is $\eps$-close to being a solution,
instead of attempting to $c$-approximate the objective $\Phi$
(which may be infeasible to approximate...).
Furthermore, a proof of the $(c,\eps)$-property (of some $\Phi$)
may be useful towards this goal, in which case we may view $\Phi$
as a pivot of the analysis (rather than as a goal of the algorithm).
Later comment by Avrim and Nina: Our model is actually a bit more
like an instance-based version of the above condition. That is, say
that the instance $x$ satisfies the **$(c,\eps)$-property** if:
for every $y$, if $\Phi(x,y) \leq c\cdot OPT(x)$ then $y$ is
$\eps$-close to a solution to $x$ wrt $R$. Then the goal is to have
an algorithm that finds a good $y$ (one that is $O(\eps)$-close to a
solution) on those instances $x$ satisfying the property. This is
the kind of guarantee we get for clustering.

Oded: I still consider it good to present the global definition first,
and the instance-based version as a ramification.

#### The original abstract

There has been substantial work on approximation algorithms for clustering
data under distance-based objective functions such as k-median, k-means,
and min-sum objectives. This work is fueled in part by the hope that
approximating these objectives well will indeed yield more accurate
solutions. That is, for problems such as clustering proteins by function,
or clustering images by subject, there is some unknown correct “target”
clustering and the implicit assumption is that clusterings that are
approximately optimal in terms of these distance-based measures are also
approximately correct in terms of error with respect to the target. In
this work we show that if we make this implicit assumption explicit —
that is, if we assume that any c-approximation to the given clustering
objective Phi is epsilon-close to the target — then we can produce
clusterings that are O(epsilon)-close to the target, even for values c for
which obtaining a c-approximation is NP-hard. In particular, for the
k-median, k-means, and min-sum objectives, we show that we can achieve
this guarantee for any constant c > 1.
Our results show how by explicitly considering the alignment between the
objective function used and the true underlying clustering goals, one can
bypass computational barriers and perform as if these objectives were
computationally substantially easier.

Back to
list of Oded's choices.