Machine Learning (Prediction)

What’s machine learning?

modeling to make predictions or find patterns in data
often without explicit hypotheses
a flexible pattern-matcher looking for useful relationships in data
models learn patterns and relationships within the data
usually needs large amounts of data to train effective models

What’s machine learning?

less concerned with whether the data represents a random sample from a population
focus on predictive accuracy and generalization performance
interested in how well the model performs on new data

Scikit-learn

often shortened to sklearn
open-source machine learning library

Scikit-learn

This package includes:
- Supervised Learning Algorithms: Linear Regression, Random Forests, Support Vector Machines (SVM), and Neural Networks
- Unsupervised Learning Clustering Algorithms: K-means, PCA
- Model Selection Tools: cross-validation, parameter tuning, and metrics evaluation
- Data Preprocessing Tools: scalers, encoders, and feature selection

Train-Test

Since the focus of machine learning is on model performance, we worry more about overfitting – we (validate and) test data that was not used in building the models (training)

Testing (and validation) helps to assess how well the model generalizes to new, unseen data.

Train-Test

Training Set:
- used to train the machine learning model
- most of the data (70% to 80%, usually)
Testing Set:
- used to evaluate the model’s performance
- smaller portion of the data that the model has not seen during training.

Measuring Performance

Regression vs. Classification – different types of measures

Let’s start with regression:

Mean Square Error (MSE):average of the squared differences between predicted and actual values
Square Root of MSE (RMSE): makes the error values more interpretable as they are in the same units as the target variable
R-squared (R2 or \(R^2\)): the proportion of variance in the dependent variable explained by the model – ranges from 0 to 1, where 1 indicates a perfect fit.