Case Study

Data

Download the data

Set up your coding environment, clean the data

Data Cleaning – .apply()

To keep things organized, we can write functions to deal with individual values.

def live_alone(string):
    if string in ["Married", "Together"]:
        return 0
    return 1

Then you can use .apply() with the function we wrote:

data["Live_Alone"] = data["Marital_Status"].apply(live_alone)

Numpy where()

data["Parent"] = np.where(data["Kidhome"] + data["Teenhome"]  > 0, 1, 0)

Exploratory Visualization

sns.scatterplot(x = "Income", y = "Recency", hue = "Parent", data = data)

Modeling

Let’s start with KMeans

from sklearn.cluster import KMeans

We can set the seed with np.random.seed()

kmeans = KMeans(n_clusters = 6)
X["Cluster"] = kmeans.fit_predict(X)

Visualization

Import both seaborn and matplotlib.pyplot

Visualize clusters

sns.scatterplot(x = "Income", y = "Age", data = X, hue = "Cluster")
plt.show()

Preprocessing

  • Scaling
  • Dimensionality Reduction

Dimensionality Reduction – PCA

Principal Component Analysis (PCA):

  • a dimensionality reduction technique widely used in statistics, machine learning, and data science
  • PCA transforms a dataset with potentially correlated variables into a set of linearly uncorrelated variables called principal components

Principal Component Analysis (PCA)

  • Reduces the number of features in a dataset
  • Retains most of the information
  • Identifies patterns and relationships between variables
  • Removes noise and redundancy from data
  • Used to preprocess data before using it for machine learning algorithms

Principal Component Analysis (PCA) in sklearn

Here’s the import statement:

from sklearn.decomposition import PCA

Then we can run it:

pca = PCA(n_components = 2)
# fit the data
pca.fit(data)
# get transformed data
reduced_data = pd.DataFrame(pca.transform(data))

Visualize results

kmeans = KMeans(n_clusters = 6)
reduced_data["Cluster"] = kmeans.fit_predict(reduced_data)

sns.scatterplot(x = 0, y = 1, data = reduced_data, hue = "Cluster")
plt.show()