Introduction to Scikit-learn

Train-Test with sklearn

We first load the method for splitting the data:

from sklearn.model_selection import train_test_split

Then, after reading the data in, we calltrain_test_split()

train_test_split(X, y, test_size=0.3)

Documentation for train_test_split

Train-Test with sklearn

  • random_state parameter: integer, seed value, to ensure reproducibility, sets the seed for the random number generator

The choice of random_state can impact the performance of your model, especially if the dataset is small or if the data points are not uniformly distributed.

Different splits can lead to different training and testing sets, which in turn can affect the model’s performance metrics.

Example

import random
print(random.randint(1, 10))
1
import random
random.seed(123)
print(random.randint(1, 10))
1

Train-Test with sklearn

Let’s use the clean email data for our first model (we will be running regression):

data = pd.read_csv("data/clean_email.csv")
X = data[["to_multiple"]]
y = data["spam"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

Linear Regression

We will be using LinearRegression from sklearn.linear_model

From the implementation point of view, this is just plain Ordinary Least Squares (OLS)

from sklearn.linear_model import LinearRegression

Then we create and fit the model to our data

model = LinearRegression()
model.fit(X_train, y_train)

Assessing model – linear regression

We make predictions based on the test features

y_pred = model.predict(X_test)

Then we evaluate the model – lets calculate MSE and R2.

from sklearn.metrics import mean_squared_error, r2_score

Then we call the appropiate methods on our observed and predicted targets:

print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))

Cost/Loss Function

  • a mathematical function that measures the difference between predicted outputs by the model and the actual target values
  • the goal is minimizing cost to improve accuracy during training

Loss function: measure of error for a single training example

Cost function: average of the loss over an entire training data

Gradient descent

  • used to minimize a function’s cost/loss
  • finds a local minimum of a differentiable function
  • start at a random point
  • find direction that descends
  • step size: can be larger at first, smaller as it descends

Feature scaling

  • Puts all features on a comparable level, so the model can properly assess their relative importance
  • Ensures fair comparison between features – especially when features have very different scales (like age [0-100] vs. income [$0-$1,000,000]) (the model may incorrectly prioritize features with larger values)
  • Helps avoid overflow or underflow issues in computations caused by very large or very small numbers can cause numerical

Feature scaling

Common scaling methods include:

  • Min-Max Scaling: Scales features to a fixed range, usually [0,1]
  • Standard Scaling: Transforms features to have zero mean and unit variance
  • Robust Scaling: Uses statistics that are robust to outliers (like median and quartiles)

Scaling features

Let’s start with StandardScaler

from sklearn.preprocessing import StandardScaler

We then scale our features for both training and testing data sets.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Logistic Regression

We will be using LogisticRegression from sklearn.linear_model

From the implementation point of view, this is just plain Ordinary Least Squares (OLS)

from sklearn.linear_model import LogisticRegression

Then we create and fit the model to our data

model = LogisticRegression()
model.fit(X_train, y_train)

Assessing Classification

  • Accuracy – correct predictions / total predictions
  • Precision – (True Positives) / (True Positives + False Positives)
  • Recall – (True Positives) / (True Positives + False Negatives)
  • F1 Score – 2 * (Precision * Recall) / (Precision + Recall)

Assessing Classification

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

We make predictions based on the test features

y_pred = my_model.predict(X_test_scaled)

Then we call the appropiate methods on our observed and predicted targets:

print(accuracy_score(y_test, y_pred))

# precision -- how many messages were erronously identified as spam
# (True Positives) / (True Positives + False Positives)
print(precision_score(y_test, y_pred))

# recall -- how many spam messages were missed
# (True Positives) / (True Positives + False Negatives)
print(recall_score(y_test, y_pred))

# harmonic mean of precision and recall
print(f1_score(y_test, y_pred))

Confusion Matrix

table = pd.DataFrame({"truth": y_test, "prediction": y_pred})
table.to_csv("truth-predictions.csv", index=False)
print(table.value_counts().reset_index())

Cross-Validation

from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(my_model, X_train_scaled, y_train, cv=10)