Student Retention Cost Prediction using Stepwise Regression in ML
Get Ready for Your Dream Job: Click, Learn, Succeed, Start Now!
Universities invest huge amounts of money in retention programs—tutoring, counselling, and engagement activities—to decrease dropout rates and improve student outcomes. Accurately forecasting the cost required to retain at‐risk students enables more efficient budget allocation and program design.
In this student retention cost prediction ML project, we will predict the per‑student retention cost based on demographic, academic, and socio‑economic factors (e.g., GPA, financial aid status, first‐generation status, engagement metrics).
Therefore, by employing stepwise regression, we’ll isolate the most significant drivers of retention cost and build an interpretable linear model that balances simplicity with predictive accuracy.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # OLS regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Higher Education Predictors of Student Retention
Step-by-Step Code Implementation
Data Loading & Initial Inspection
- We import a retention dataset containing student academic and socio‑economic variables, along with the Retention_Cost target. Initial inspection (.info(), .describe()) verifies data types and basic statistics.
- We assume the CSV contains columns such as: Student_ID, GPA, Credits_Completed, Financial_Aid (yes/no), First_Generation (yes/no), Engagement_Score, and the target Retention_Cost (USD per student).
# Block 1: Load dataset
# (Download from Kaggle, then provide path)
df = pd.read_csv("higher_education_student_retention.csv")
# Inspect structure
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
Binary categorical columns (Financial_Aid, First_Generation) are one‑hot encoded. Thus, we drop any missing records and separate predictors (X) from the response (y). An 80/20 train–test split prepares the data for unbiased evaluation.
# Block 2: Encode categorical features & clean data
df_enc = pd.get_dummies(
df,
columns=["Financial_Aid", "First_Generation"],
drop_first=True
)
# Drop rows with missing values (if any)
df_enc = df_enc.dropna()
# Define predictors and target
X = df_enc.drop(["Student_ID", "Retention_Cost"], axis=1)
y = df_enc["Retention_Cost"]
# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function alternates forward inclusion (adding the predictor with p‑value < 0.01) and backward elimination (removing the predictor with p‑value > 0.05) until no further changes occur, thus yielding a parsimonious set of cost drivers.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: consider adding each excluded predictor
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:30} p-value {best_pval:.6f}")
# Backward step: consider removing each included predictor
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:30} p-value {worst_pval:.6f}")
if not changed:
break
return included
Model Building & Evaluation
We fit an Ordinary Least Squares regression using statsmodels on the selected features. The .summary() output reports coefficient estimates, p‑values, R², and diagnostic statistics, clarifying the direction and significance of each predictor.
Predictions on the held‑out test set produce R² (explained variance) and RMSE (prediction error), quantifying how well the model generalises to new students.
# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)
# Fit the final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
Plotting residuals versus predicted costs checks for non‑random patterns, heteroscedasticity, or outliers—validating core OLS assumptions.
# Block 5: Plot residuals to check assumptions
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Retention Cost (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Retention Cost")
plt.show()
Summary
By applying stepwise regression to student retention data, we isolate the most impactful factors—such as GPA, engagement score, and financial aid status—that drive per‑student retention cost. The resulting linear model offers high interpretability and strong predictive performance (e.g., test‑set R² and low RMSE), enabling university administrators to forecast retention budgets accurately and target interventions where they’ll deliver the most significant ROI.
