Case Studies

Cupcake vs. Muffin

Download the cupcake vs. muffin data.

Question: Are cupcakes and muffins the same thing?

Challenge: Classify recipes as cupcakes or muffins. When given a new recipe, determine if it’s a cupcake or muffin.

Submit your my_predictions.csv file to gradescope – make sure you have a column named prediction

Decision Trees

supervised learning
a non-parametric algorithm – no specific functional form/pattern to fit, no distribution assumptions
for both classification and regression
divide and conquer strategy
hierarchical, tree structure – root node, branches, internal nodes, leaf nodes

Decision Trees

Documentation for Decision Trees

Example of an import statement:

from sklearn.tree import DecisionTreeClassifier

And then:

model = DecisionTreeClassifier()

Random Forests

Collection of decision trees
Ensemble (voting) algorithm
Trees work together to make more accurate and stable predictions

from sklearn.ensemble import RandomForestClassifier

Confusion Matrix

from sklearn.metrics import confusion_matrix
print(pd.DataFrame(confusion_matrix(y_test, y_pred)))

Classification Metrics

Accuracy – proportion of correctly classified instances among all instances
Precision – proportion of correctly classified positive instances among all instances predicted as positive (including false positives)
Recall – proportion of correctly classified positive instances among all actual positive instances (including false negatives).
F1-score – harmonic mean of precision and recall
ROC AUC – trade-off between correctly identified positive cases (true positives) and incorrectly identified negative cases (false negative)

ROC AUC

ROC AUC (Receiver Operating Characteristic – Area Under the Curve)
comprehensive way to measure a model’s performance across different classification thresholds
AUC (Area Under the Curve) – the probability that a model will correctly rank a randomly chosen positive example higher than a randomly chosen negative example
score from 0 to 1 – a higher AUC == better model performance

ROC AUC

trade-off between correctly identified positive cases (True Positives) and incorrectly identified negative cases (False Negative)

ROC AUC

Curve/plot:

define multiple classification thresholds, and thencalculate:
- True Positive Rate (TPR)
- False Positive Rate (FPR)
Plot points on the ROC curve
Calculate the area under this curve

ROC AUC

Import statement:

from sklearn.metrics import roc_auc_score, roc_curve

Use:

print(roc_auc_score(y_test, y_pred))

fpr, tpr, thresholds = roc_curve(y_test, y_pred)

auc_df = pd.DataFrame({"fpr" : fpr, "tpr" : tpr})
sns.lineplot(data = auc_df, x = "fpr", y = "tpr")
plt.show()

Pumpkin Seeds

Download and clean this pumpkin seeds data set