Case Studies

Cupcake vs. Muffin

Download the cupcake vs. muffin data.

Question: Are cupcakes and muffins the same thing?

Challenge: Classify recipes as cupcakes or muffins. When given a new recipe, determine if it’s a cupcake or muffin.

Submit your my_predictions.csv file to gradescope – make sure you have a column named prediction

Decision Trees

  • supervised learning
  • a non-parametric algorithm – no specific functional form/pattern to fit, no distribution assumptions
  • for both classification and regression
  • divide and conquer strategy
  • hierarchical, tree structure – root node, branches, internal nodes, leaf nodes

Decision Trees

Documentation for Decision Trees

Example of an import statement:

from sklearn.tree import DecisionTreeClassifier

And then:

model = DecisionTreeClassifier()

Random Forests

  • Collection of decision trees
  • Ensemble (voting) algorithm
  • Trees work together to make more accurate and stable predictions
from sklearn.ensemble import RandomForestClassifier

Confusion Matrix

from sklearn.metrics import confusion_matrix
print(pd.DataFrame(confusion_matrix(y_test, y_pred)))

Classification Metrics

  • Accuracy – proportion of correctly classified instances among all instances
  • Precision – proportion of correctly classified positive instances among all instances predicted as positive (including false positives)
  • Recall – proportion of correctly classified positive instances among all actual positive instances (including false negatives).
  • F1-score – harmonic mean of precision and recall
  • ROC AUC – trade-off between correctly identified positive cases (true positives) and incorrectly identified negative cases (false negative)

ROC AUC

  • ROC AUC (Receiver Operating Characteristic – Area Under the Curve)
  • comprehensive way to measure a model’s performance across different classification thresholds
  • AUC (Area Under the Curve) – the probability that a model will correctly rank a randomly chosen positive example higher than a randomly chosen negative example
  • score from 0 to 1 – a higher AUC == better model performance

ROC AUC

  • trade-off between correctly identified positive cases (True Positives) and incorrectly identified negative cases (False Negative)

ROC AUC

Curve/plot:

  • define multiple classification thresholds, and thencalculate:
    • True Positive Rate (TPR)
    • False Positive Rate (FPR)
  • Plot points on the ROC curve
  • Calculate the area under this curve

ROC AUC

Import statement:

from sklearn.metrics import roc_auc_score, roc_curve

Use:

print(roc_auc_score(y_test, y_pred))

fpr, tpr, thresholds = roc_curve(y_test, y_pred)

auc_df = pd.DataFrame({"fpr" : fpr, "tpr" : tpr})
sns.lineplot(data = auc_df, x = "fpr", y = "tpr")
plt.show()

Pumpkin Seeds

Download and clean this pumpkin seeds data set