Abstract:
Hierarchical clustering is a popular technique for organizing data as a rooted tree structure that simultaneously clusters data at multiple levels of granularity. A well-studied recent objective function views the input as a weighted graph with edges indicating similarity between the data points, and focuses on finding a tree that minimizes the cost of hierarchical partitioning.
The resulting optimization problem is NP-hard, and previous algorithms for approximating this objective require at least linear time/space. In this talk, we will consider algorithms for hierarchical clustering that use sublinear resources (space, time, and communication).
Specifically, we will present sublinear algorithms for hierarchical clustering in the streaming model (space), in the query model (time), and in the MPC model (communication). At the core of our algorithmic results is a connection between hierarchical clustering and a suitably relaxed notion of cut sparsifiers of graphs that lends itself to efficient sublinear algorithms. We complement our algorithmic results by establishing nearly matching lower bounds that rule out algorithms with better performance guarantees in each of these models.
This is joint work with Arpit Agarwal, Huan Li, and Prathamesh Patil.