Case Study

Data

Set up your coding environment, clean the data

Data Cleaning – `.apply()`

To keep things organized, we can write functions to deal with individual values.

def live_alone(string):
    if string in ["Married", "Together"]:
        return 0
    return 1

Then you can use .apply() with the function we wrote:

data["Live_Alone"] = data["Marital_Status"].apply(live_alone)

Numpy `where()`

data["Parent"] = np.where(data["Kidhome"] + data["Teenhome"]  > 0, 1, 0)

Exploratory Visualization

sns.scatterplot(x = "Income", y = "Recency", hue = "Parent", data = data)

Modeling

Let’s start with KMeans

from sklearn.cluster import KMeans

We can set the seed with np.random.seed()

kmeans = KMeans(n_clusters = 6)
X["Cluster"] = kmeans.fit_predict(X)

Visualization

Import both seaborn and matplotlib.pyplot

Visualize clusters

sns.scatterplot(x = "Income", y = "Age", data = X, hue = "Cluster")
plt.show()

Preprocessing

Scaling
Dimensionality Reduction

Dimensionality Reduction – PCA

Principal Component Analysis (PCA):

a dimensionality reduction technique widely used in statistics, machine learning, and data science
PCA transforms a dataset with potentially correlated variables into a set of linearly uncorrelated variables called principal components

Principal Component Analysis (PCA)

Reduces the number of features in a dataset
Retains most of the information
Identifies patterns and relationships between variables
Removes noise and redundancy from data
Used to preprocess data before using it for machine learning algorithms

Principal Component Analysis (PCA) in sklearn

Here’s the import statement:

from sklearn.decomposition import PCA

Then we can run it:

pca = PCA(n_components = 2)
# fit the data
pca.fit(data)
# get transformed data
reduced_data = pd.DataFrame(pca.transform(data))

Visualize results

kmeans = KMeans(n_clusters = 6)
reduced_data["Cluster"] = kmeans.fit_predict(reduced_data)

sns.scatterplot(x = 0, y = 1, data = reduced_data, hue = "Cluster")
plt.show()