Data Cleaning – .apply()
To keep things organized, we can write functions to deal with individual values.
def live_alone(string):
if string in ["Married", "Together"]:
return 0
return 1
Then you can use .apply()
with the function we wrote:
data["Live_Alone"] = data["Marital_Status"].apply(live_alone)
Numpy where()
data["Parent"] = np.where(data["Kidhome"] + data["Teenhome"] > 0, 1, 0)
Exploratory Visualization
sns.scatterplot(x = "Income", y = "Recency", hue = "Parent", data = data)
Modeling
Let’s start with KMeans
from sklearn.cluster import KMeans
We can set the seed with np.random.seed()
kmeans = KMeans(n_clusters = 6)
X["Cluster"] = kmeans.fit_predict(X)
Visualization
Import both seaborn
and matplotlib.pyplot
Visualize clusters
sns.scatterplot(x = "Income", y = "Age", data = X, hue = "Cluster")
plt.show()
Preprocessing
- Scaling
- Dimensionality Reduction
Dimensionality Reduction – PCA
Principal Component Analysis (PCA):
- a dimensionality reduction technique widely used in statistics, machine learning, and data science
- PCA transforms a dataset with potentially correlated variables into a set of linearly uncorrelated variables called principal components
Principal Component Analysis (PCA)
- Reduces the number of features in a dataset
- Retains most of the information
- Identifies patterns and relationships between variables
- Removes noise and redundancy from data
- Used to preprocess data before using it for machine learning algorithms
Principal Component Analysis (PCA) in sklearn
Here’s the import statement:
from sklearn.decomposition import PCA
Then we can run it:
pca = PCA(n_components = 2)
# fit the data
pca.fit(data)
# get transformed data
reduced_data = pd.DataFrame(pca.transform(data))
Visualize results
kmeans = KMeans(n_clusters = 6)
reduced_data["Cluster"] = kmeans.fit_predict(reduced_data)
sns.scatterplot(x = 0, y = 1, data = reduced_data, hue = "Cluster")
plt.show()