Student Retention Cost Prediction using Stepwise Regression in ML

Get Ready for Your Dream Job: Click, Learn, Succeed, Start Now!

Universities invest huge amounts of money in retention programs—tutoring, counselling, and engagement activities—to decrease dropout rates and improve student outcomes. Accurately forecasting the cost required to retain at‐risk students enables more efficient budget allocation and program design.

In this student retention cost prediction ML project, we will predict the per‑student retention cost based on demographic, academic, and socio‑economic factors (e.g., GPA, financial aid status, first‐generation status, engagement metrics).

Therefore, by employing stepwise regression, we’ll isolate the most significant drivers of retention cost and build an interpretable linear model that balances simplicity with predictive accuracy.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

Higher Education Predictors of Student Retention

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We import a retention dataset containing student academic and socio‑economic variables, along with the Retention_Cost target. Initial inspection (.info(), .describe()) verifies data types and basic statistics.
We assume the CSV contains columns such as: Student_ID, GPA, Credits_Completed, Financial_Aid (yes/no), First_Generation (yes/no), Engagement_Score, and the target Retention_Cost (USD per student).

# Block 1: Load dataset
# (Download from Kaggle, then provide path)
df = pd.read_csv("higher_education_student_retention.csv")

# Inspect structure
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

Binary categorical columns (Financial_Aid, First_Generation) are one‑hot encoded. Thus, we drop any missing records and separate predictors (X) from the response (y). An 80/20 train–test split prepares the data for unbiased evaluation.

# Block 2: Encode categorical features & clean data
df_enc = pd.get_dummies(
    df,
    columns=["Financial_Aid", "First_Generation"],
    drop_first=True
)

# Drop rows with missing values (if any)
df_enc = df_enc.dropna()

# Define predictors and target
X = df_enc.drop(["Student_ID", "Retention_Cost"], axis=1)
y = df_enc["Retention_Cost"]

# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function alternates forward inclusion (adding the predictor with p‑value < 0.01) and backward elimination (removing the predictor with p‑value > 0.05) until no further changes occur, thus yielding a parsimonious set of cost drivers.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: consider adding each excluded predictor
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:30} p-value {best_pval:.6f}")

        # Backward step: consider removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:30} p-value {worst_pval:.6f}")

        if not changed:
            break

    return included

Model Building & Evaluation

We fit an Ordinary Least Squares regression using statsmodels on the selected features. The .summary() output reports coefficient estimates, p‑values, R², and diagnostic statistics, clarifying the direction and significance of each predictor.

Predictions on the held‑out test set produce R² (explained variance) and RMSE (prediction error), quantifying how well the model generalises to new students.

# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit the final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

Plotting residuals versus predicted costs checks for non‑random patterns, heteroscedasticity, or outliers—validating core OLS assumptions.

# Block 5: Plot residuals to check assumptions
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Retention Cost (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Retention Cost")
plt.show()

Summary

By applying stepwise regression to student retention data, we isolate the most impactful factors—such as GPA, engagement score, and financial aid status—that drive per‑student retention cost. The resulting linear model offers high interpretability and strong predictive performance (e.g., test‑set R² and low RMSE), enabling university administrators to forecast retention budgets accurately and target interventions where they’ll deliver the most significant ROI.