Clustering

Recap: Supervised vs. Unsupervised Learning

  • Supervised Learning
    • Data: both the features, x, and a target, y, for each item in the dataset
    • Goal: ‘learn’ how to predict the target from the features, y = f(x)
    • Example: Regression and Classification

Recap: Supervised vs. Unsupervised Learning

  • Unsupervised Learning
    • Data: Only the features, x, for each item in the dataset
    • Goal: discover ‘interesting’ things about the dataset
    • Example: Clustering, Dimensionality reduction, Principal Component Analysis (PCA)

Clustering

  • Clustering is the task of discovering unknown subgroups in data, or clusters
  • The goal is to partition the dataset into clusters where ‘similar’ items are in the same cluster and ‘dissimilar’ items are in different clusters
  • Example:
    • Social Network Analysis: Clustering can be used to find communities
    • Handwritten digits where the digits are unknown

Clustering

  • Partition data into groups (clusters)
  • Points within a cluster should be “similar”
  • Points in different cluster should be “different”

Goal of Clustering

  • Data Exploration
    • Are there coherent groups ?
    • How many groups are there ?
  • Data Partitioning
    • Divide data by group before further processing

Formulation of K-means Clustering Method

  • Data: A collection of points \(x_i\), for i = 1,…,n
  • Find k centers, assign each point x to a cluster as to minimize the total intra-cluster distance
  • The total intra-cluster distance is the total squared Euclidean distance from each point to the center of its cluster
  • It’s a measure of the variance or internal coherence of the cluster

K-Means Algorithm

  • Pick number of clusters k
  • Pick k random points as “cluster center” (or centroid)
  • While cluster centers change:
    • Assign each data point to its closest cluster center
    • Recompute cluster centers as the mean of the assigned points

K-Means Algorithm

K-Means Algorithm

The K-means clustering algorithm on Airbnb rentals in NYC.

from Clustering With K-Means

Limitations of K-Means

  • Restriction to cluster shapes
  • Restriction to distance of points to centers

Overview of clustering methods

from sklearn clustering documentation

Dimensionality Reduction

  • It is often beneficial to reduce number of dimensions before running K-means
  • Simplification of the data – it makes it easier to find clusters

Curse of Dimensionality

As the number of dimensions (features) in a dataset increases, the chances of two data points being close to each other become increasingly rare