Clustering Images

Data

Download the digit images and set up your working environment.

Opening images in Python

We will be using the Image module from PIL

from PIL import Image

If you need to install PIL, you can run:

python3 -m pip install --upgrade Pillow

Then you can open an image:

image = Image.open("data/image.png")

Using images in Python:

image.show()
image_array = np.array(image)
image.close()

Dealing with multiple files

We will use the os module:

import os

Then we can iterate over the list of files in a folder:

for filename in os.listdir("data"):
  if ".png" in filename:
    print(filename)

How to transform images into data?

We can get the values of individual pixels from each image.

How do we combine all images into one .csv file?

How to transform images into data?

image = Image.open("data/image.png")
image_array = np.array(image)

for line in image_array:
  for pixel in line:
    pixels.append(pixel)

Function to deal with individual image

def get_pixels(image_array):
    pixels = []
    for line in image_array:
        for pixel in line:
            pixels.append(pixel)
    return pixels

Creating the data frame

First we create an empty data frame (there are 123 pixels in each image):

data = pd.DataFrame({"index" : list(range(123))})

Then for each image, we create an individual data frame, calling the function we wrote previously:

these_data = pd.DataFrame({filename : get_pixels(image_array)})

Then we concatenate the columns:

data = pd.concat([data, these_data], axis=1)

Creating the data frame

When we are done going over all images, and our data is complete, we transpose the data frame to make each column a row.

data_final = data.transpose()

Modeling

Now we can cluster our data into 10 clusters.

from sklearn.cluster import KMeans

# read data in
data = pd.read_csv("all_pixels.csv")

# run KMeans with 10 clusters
model = KMeans(n_clusters=10)
data["cluster"]  = model.fit_predict(data)

# save results
data.to_csv("digits.csv", index=False)

Inspect results

import pandas as pd

from PIL import Image

def main():
    data = pd.read_csv("digits.csv")

    print(data["cluster"].value_counts())

    for i in range(5):
        Image.open("data/" + data.iloc[i]["filename"] ).show()
        print(data.iloc[i]["cluster"])

main()

Performance metrics

How do we measure performance without the ground truth?

Silhouette Score

  • -1 to 1 score (higher value indicates better clustering)
  • Measure of how similar data points are to their own cluster compared to other clusters

Silhouette Score

  • Combination of two factors:
    • Cohesion: How close a point is to other points in its cluster
    • Separation: How far a point is from points in other clusters

Silhouette Score

For each data point:

  • Calculate average distance to all other points in the same cluster (Cohesion)
  • Calculate minimum average distance to points in any other cluster (Separation)
  • The silhouette value is (Separation - Cohesion) / max(Cohesion, Separation)

The overall Silhouette Score is the sum of the sillhouette values for all data points divided by the number of data points.

Silhouette Score – Limitations

  • Computationally intensive (especiall so for large datasets)
  • Favors models with fewer clusters – Curse of Dimensionality

Silhouette Score

Here’s how to use silhouette_score from sklearn.metrics

from sklearn.metrics import silhouette_score

# calculate shilhouette score based on features and cluster labels
print(silhouette_score(X, data["cluster"]))

Close to zero: points are on or very close to the decision boundary between clusters