Hydro Energy Cost Prediction using Stepwise Regression in ML

Upgrade Your Skills, Upgrade Your Career - Learn more

Accurately predicting the operation & maintenance (O&M) cost of hydropower plants is important for budgeting, tariff setting, and lifecycle planning. O&M costs depend on plant characteristics (installed capacity, head, turbine type), reservoir attributes (area, volume), and operational metrics (capacity factor, unit age). In this project, we’ll predict the annual O&M cost per MW of capacity using a stepwise linear regression approach to isolate the most significant cost drivers and build a transparent, parsimonious model that supports asset managers in planning and benchmarking.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Ordinary Least Squares regression  
from sklearn.model_selection import train_test_split   # Train/test splitting  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

US Hydropower Dataset

Global Power Plant Database

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We merge GloHydroRes plant/reservoir attributes (capacity, head, turbine type, reservoir metrics) with FERC Form 1 O&M costs. Initial .info() and .describe() verify completeness and ranges.

# Block 1: Load hydropower plant attributes and cost data
# Assume you have merged:
#  - GloHydroRes plant/reservoir attributes (plant_id, capacity_MW, head_m, reservoir_area_km2, reservoir_volume_m3, turbine_type) :contentReference[oaicite:1]{index=1}
#  - FERC Form 1 O&M costs per plant and year (form1_cost_usd) :contentReference[oaicite:2]{index=2}
df = pd.read_csv("hydro_plant_om_costs.csv")

print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We drop records missing any key feature or cost. We normalise cost by installed capacity to derive cost_per_MW. Categorical turbine types are one‑hot encoded. We split into training and test sets (80/20) for unbiased evaluation.

# Block 2: Clean & encode
# Drop rows with missing critical fields
df = df.dropna(subset=[
    'capacity_MW','head_m','reservoir_area_km2','reservoir_volume_m3',
    'turbine_type','capacity_factor','unit_age_years','form1_cost_usd'
])

# Derive cost per MW for normalization
df['cost_per_MW'] = df['form1_cost_usd'] / df['capacity_MW']

# One‑hot encode turbine type (e.g., Francis, Kaplan, Pelton)
df_enc = pd.get_dummies(df, columns=['turbine_type'], drop_first=True)

# Define predictors and target
X = df_enc[[
    'capacity_MW','head_m','reservoir_area_km2','reservoir_volume_m3',
    'capacity_factor','unit_age_years'
] + [col for col in df_enc.columns if col.startswith('turbine_type_')]]
y = df_enc['cost_per_MW']

# Split into training/test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

Our stepwise_selection alternates between forward inclusion (adding excluded predictors with p < 0.01) and backward elimination (removing included predictors with p > 0.05) until convergence, yielding a parsimonious feature set.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False
        # Forward step
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(dtype=float, index=excluded)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:25} p-value {best_pval:.4f}")

        # Backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")

        if not changed:
            break
    return included

Model Building & Evaluation

We fit an OLS regression on the selected predictors using statsmodels. The .summary() output reports coefficients, p-values, R², and diagnostic statistics (AIC, F‑statistic), highlighting each variable’s significance.

We predict on the held‑out test set and compute R² (variance explained) and RMSE (root‑mean‑square error) to quantify predictive accuracy on unseen plants.

# Block 4: Select features
selected = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residual vs. predicted plot checks for heteroscedasticity or patterns, validating linear model assumptions and identifying potential outliers.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per MW")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost per MW")
plt.show()

Summary

By applying stepwise regression to a combined hydropower dataset, we isolate the key drivers of annual O&M cost per MW—such as capacity factor, head, unit age, and turbine type—while pruning less informative features. The resulting linear model balances interpretability with performance (strong test‑set R² and low RMSE), offering asset managers a transparent tool to benchmark costs and plan maintenance budgets more effectively.