Machine Learning (Prediction)

What’s machine learning?

  • modeling to make predictions or find patterns in data
  • often without explicit hypotheses
  • a flexible pattern-matcher looking for useful relationships in data
  • models learn patterns and relationships within the data
  • usually needs large amounts of data to train effective models

What’s machine learning?

  • less concerned with whether the data represents a random sample from a population
  • focus on predictive accuracy and generalization performance
  • interested in how well the model performs on new data

Scikit-learn

  • often shortened to sklearn
  • open-source machine learning library

Scikit-learn

  • This package includes:
    • Supervised Learning Algorithms: Linear Regression, Random Forests, Support Vector Machines (SVM), and Neural Networks
    • Unsupervised Learning Clustering Algorithms: K-means, PCA
    • Model Selection Tools: cross-validation, parameter tuning, and metrics evaluation
    • Data Preprocessing Tools: scalers, encoders, and feature selection

Train-Test

Since the focus of machine learning is on model performance, we worry more about overfitting – we (validate and) test data that was not used in building the models (training)

Testing (and validation) helps to assess how well the model generalizes to new, unseen data.

Train-Test

  • Training Set:
    • used to train the machine learning model
    • most of the data (70% to 80%, usually)
  • Testing Set:
    • used to evaluate the model’s performance
    • smaller portion of the data that the model has not seen during training.

Measuring Performance

Regression vs. Classification – different types of measures

Let’s start with regression:

  • Mean Square Error (MSE):average of the squared differences between predicted and actual values

  • Square Root of MSE (RMSE): makes the error values more interpretable as they are in the same units as the target variable

  • R-squared (R2 or \(R^2\)): the proportion of variance in the dependent variable explained by the model – ranges from 0 to 1, where 1 indicates a perfect fit.