Hydro Energy Cost Prediction using Stepwise Regression in ML
Upgrade Your Skills, Upgrade Your Career - Learn more
Accurately predicting the operation & maintenance (O&M) cost of hydropower plants is important for budgeting, tariff setting, and lifecycle planning. O&M costs depend on plant characteristics (installed capacity, head, turbine type), reservoir attributes (area, volume), and operational metrics (capacity factor, unit age). In this project, we’ll predict the annual O&M cost per MW of capacity using a stepwise linear regression approach to isolate the most significant cost drivers and build a transparent, parsimonious model that supports asset managers in planning and benchmarking.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Ordinary Least Squares regression from sklearn.model_selection import train_test_split # Train/test splitting from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We merge GloHydroRes plant/reservoir attributes (capacity, head, turbine type, reservoir metrics) with FERC Form 1 O&M costs. Initial .info() and .describe() verify completeness and ranges.
# Block 1: Load hydropower plant attributes and cost data
# Assume you have merged:
# - GloHydroRes plant/reservoir attributes (plant_id, capacity_MW, head_m, reservoir_area_km2, reservoir_volume_m3, turbine_type) :contentReference[oaicite:1]{index=1}
# - FERC Form 1 O&M costs per plant and year (form1_cost_usd) :contentReference[oaicite:2]{index=2}
df = pd.read_csv("hydro_plant_om_costs.csv")
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
We drop records missing any key feature or cost. We normalise cost by installed capacity to derive cost_per_MW. Categorical turbine types are one‑hot encoded. We split into training and test sets (80/20) for unbiased evaluation.
# Block 2: Clean & encode
# Drop rows with missing critical fields
df = df.dropna(subset=[
'capacity_MW','head_m','reservoir_area_km2','reservoir_volume_m3',
'turbine_type','capacity_factor','unit_age_years','form1_cost_usd'
])
# Derive cost per MW for normalization
df['cost_per_MW'] = df['form1_cost_usd'] / df['capacity_MW']
# One‑hot encode turbine type (e.g., Francis, Kaplan, Pelton)
df_enc = pd.get_dummies(df, columns=['turbine_type'], drop_first=True)
# Define predictors and target
X = df_enc[[
'capacity_MW','head_m','reservoir_area_km2','reservoir_volume_m3',
'capacity_factor','unit_age_years'
] + [col for col in df_enc.columns if col.startswith('turbine_type_')]]
y = df_enc['cost_per_MW']
# Split into training/test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
Our stepwise_selection alternates between forward inclusion (adding excluded predictors with p < 0.01) and backward elimination (removing included predictors with p > 0.05) until convergence, yielding a parsimonious feature set.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(dtype=float, index=excluded)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:25} p-value {best_pval:.4f}")
# Backward step
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
We fit an OLS regression on the selected predictors using statsmodels. The .summary() output reports coefficients, p-values, R², and diagnostic statistics (AIC, F‑statistic), highlighting each variable’s significance.
We predict on the held‑out test set and compute R² (variance explained) and RMSE (root‑mean‑square error) to quantify predictive accuracy on unseen plants.
# Block 4: Select features
selected = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A residual vs. predicted plot checks for heteroscedasticity or patterns, validating linear model assumptions and identifying potential outliers.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per MW")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost per MW")
plt.show()
Summary
By applying stepwise regression to a combined hydropower dataset, we isolate the key drivers of annual O&M cost per MW—such as capacity factor, head, unit age, and turbine type—while pruning less informative features. The resulting linear model balances interpretability with performance (strong test‑set R² and low RMSE), offering asset managers a transparent tool to benchmark costs and plan maintenance budgets more effectively.
