When we are done going over all images, and our data is complete, we transpose the data frame to make each column a row.
data_final = data.transpose()
Modeling
Now we can cluster our data into 10 clusters.
from sklearn.cluster import KMeans# read data indata = pd.read_csv("all_pixels.csv")# run KMeans with 10 clustersmodel = KMeans(n_clusters=10)data["cluster"] = model.fit_predict(data)# save resultsdata.to_csv("digits.csv", index=False)
Inspect results
import pandas as pdfrom PIL import Imagedef main(): data = pd.read_csv("digits.csv")print(data["cluster"].value_counts())for i inrange(5): Image.open("data/"+ data.iloc[i]["filename"] ).show()print(data.iloc[i]["cluster"])main()
Performance metrics
How do we measure performance without the ground truth?
-1 to 1 score (higher value indicates better clustering)
Measure of how similar data points are to their own cluster compared to other clusters
Silhouette Score
Combination of two factors:
Cohesion: How close a point is to other points in its cluster
Separation: How far a point is from points in other clusters
Silhouette Score
For each data point:
Calculate average distance to all other points in the same cluster (Cohesion)
Calculate minimum average distance to points in any other cluster (Separation)
The silhouette value is (Separation - Cohesion) / max(Cohesion, Separation)
The overall Silhouette Score is the sum of the sillhouette values for all data points divided by the number of data points.
Silhouette Score – Limitations
Computationally intensive (especiall so for large datasets)
Favors models with fewer clusters – Curse of Dimensionality
Silhouette Score
Here’s how to use silhouette_score from sklearn.metrics
from sklearn.metrics import silhouette_score# calculate shilhouette score based on features and cluster labelsprint(silhouette_score(X, data["cluster"]))
Close to zero: points are on or very close to the decision boundary between clusters