Machine Learning (Prediction)
What’s machine learning?
- modeling to make predictions or find patterns in data
- often without explicit hypotheses
- a flexible pattern-matcher looking for useful relationships in data
- models learn patterns and relationships within the data
- usually needs large amounts of data to train effective models
What’s machine learning?
- less concerned with whether the data represents a random sample from a population
- focus on predictive accuracy and generalization performance
- interested in how well the model performs on new data
Scikit-learn
- often shortened to sklearn
- open-source machine learning library
Scikit-learn
- This package includes:
- Supervised Learning Algorithms: Linear Regression, Random Forests, Support Vector Machines (SVM), and Neural Networks
- Unsupervised Learning Clustering Algorithms: K-means, PCA
- Model Selection Tools: cross-validation, parameter tuning, and metrics evaluation
- Data Preprocessing Tools: scalers, encoders, and feature selection
Train-Test
Since the focus of machine learning is on model performance, we worry more about overfitting – we (validate and) test data that was not used in building the models (training)
Testing (and validation) helps to assess how well the model generalizes to new, unseen data.
Train-Test
- Training Set:
- used to train the machine learning model
- most of the data (70% to 80%, usually)
- Testing Set:
- used to evaluate the model’s performance
- smaller portion of the data that the model has not seen during training.