random_state parameter: integer, seed value, to ensure reproducibility, sets the seed for the random number generator
The choice of random_state can impact the performance of your model, especially if the dataset is small or if the data points are not uniformly distributed.
Different splits can lead to different training and testing sets, which in turn can affect the model’s performance metrics.
a mathematical function that measures the difference between predicted outputs by the model and the actual target values
the goal is minimizing cost to improve accuracy during training
Loss function: measure of error for a single training example
Cost function: average of the loss over an entire training data
Gradient descent
used to minimize a function’s cost/loss
finds a local minimum of a differentiable function
start at a random point
find direction that descends
step size: can be larger at first, smaller as it descends
Feature scaling
Puts all features on a comparable level, so the model can properly assess their relative importance
Ensures fair comparison between features – especially when features have very different scales (like age [0-100] vs. income [$0-$1,000,000]) (the model may incorrectly prioritize features with larger values)
Helps avoid overflow or underflow issues in computations caused by very large or very small numbers can cause numerical
Feature scaling
Common scaling methods include:
Min-Max Scaling: Scales features to a fixed range, usually [0,1]
Standard Scaling: Transforms features to have zero mean and unit variance
Robust Scaling: Uses statistics that are robust to outliers (like median and quartiles)
Scaling features
Let’s start with StandardScaler
from sklearn.preprocessing import StandardScaler
We then scale our features for both training and testing data sets.
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
We make predictions based on the test features
y_pred = my_model.predict(X_test_scaled)
Then we call the appropiate methods on our observed and predicted targets:
print(accuracy_score(y_test, y_pred))# precision -- how many messages were erronously identified as spam# (True Positives) / (True Positives + False Positives)print(precision_score(y_test, y_pred))# recall -- how many spam messages were missed# (True Positives) / (True Positives + False Negatives)print(recall_score(y_test, y_pred))# harmonic mean of precision and recallprint(f1_score(y_test, y_pred))