Hide code cell content

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

14. Machine Learning#

The basic issue in finance is that we want to know how expected returns move around, but we only observe realized returns

We can compile lots and lots of information/data about different assets

We saw how to run OLS regression of returns on a large set of characteristics ( I think it was 30)

But we didn;t even think interactions among them–say the value characteristic might have different information for returns for small vs big stocks–considering all these interactions would leads us to estimate 900 coefficients. And of course there are potentially many more characteristics and their lags that could be informative for expected returns and co-movement

You can see that very quickly you run out of data

Here where recent advances in machine learning can be super useful

In the end of the day we want to estimate a function F that maps observed characteristics in future returns

\[R_{t+1}=F(X_t)\]

This function can be linear

\[R_{t+1}=BX_t\]

or linear in the interactions

\[R_{t+1}=BX_t+C X_t βŠ— X_t\]

Or have even higher order or non-linear relationships, that is instead of including the chracteristic , we include dummies according to the rank of the characteristic relative to other stocks in the cross-sectional

Here where the tools if machine learning can be useful to us

We will now discuss a few of the most used methods

  • Lasso Regression (L1 regularization)

  • Random Forest Regression

  • Gradient Boosted Regression Trees (GBRT)

  • Elastic Net Regression (combination of L1 and L2 regularization)

  • Neural Network Regression (customizable number of layers)

We will apply those to our data set

We will have a training/estimation sample (1972-1992) and a tuning sample (1992-2002)

We will not use it today but I also reserved a test sample (2002-2016) for you to evaluate your favorite model.

url = "../../assets/data/characteristics_raw.csv"

df_X = pd.read_csv(url)
# This simply shits the date to be in an end of month basis


df_X['date'] = pd.to_datetime(df_X['date'], format='%m/%Y')
df_X.set_index(['date','permno'],inplace=True)

df_X['1972':'1991'].to_pickle('../../assets/data/characteristics19721991.pkl')
df_X['1992':'2001'].to_pickle('../../assets/data/characteristics19922001.pkl')
df_X['2002':].to_pickle('../../assets/data/characteristics20022016.pkl')
url = "../../assets/data/characteristics19721991.pkl"

df_train = pd.read_pickle(url)
df_train=df_train.drop(columns=['rf','rme'])
display(df_train)
url = "../../assets/data/characteristics19922001.pkl"
df_tuning = pd.read_pickle(url)
df_tuning=df_tuning.drop(columns=['rf','rme'])
display(df_tuning)
re size value prof fscore debtiss repurch nissa growth aturnover ... momrev valuem nissm strev ivol betaarb indrrev price age shvol
date permno
1972-07-01 10006 0.028600 12.399869 -0.125361 -1.662274 1 0 0 0.691632 0.055546 -0.402127 ... 0.241023 0.046338 0.691632 -0.025281 0.015680 0.875315 -0.033445 3.769883 5.135798 0.264547
10102 0.039757 12.217334 0.354954 -1.533574 3 0 0 0.702357 0.032625 -0.280661 ... 0.280555 0.525299 0.702357 -0.066667 0.013668 1.167972 -0.029807 2.862201 5.135798 0.159992
10137 -0.044767 13.069874 -0.088697 -2.285618 2 1 0 0.735522 0.130297 -1.473819 ... -0.024738 -0.042177 0.735522 -0.034483 0.010347 0.755496 -0.020019 3.044522 5.135798 0.102413
10145 -0.062422 13.608366 0.075484 -1.563468 3 0 0 0.693165 0.033959 -0.210598 ... 0.529800 0.062691 0.693165 -0.036735 0.018345 1.097189 -0.011115 3.384390 5.135798 0.208178
10153 -0.065600 11.752572 0.944457 -1.443505 2 0 1 0.688459 0.016692 0.087675 ... 0.158727 1.029572 0.688459 -0.107407 0.020491 1.246057 -0.079017 2.484907 5.135798 0.215979
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1991-01-01 90369 0.047830 12.802441 -0.693011 -3.167399 3 1 0 0.693650 0.204222 -1.182231 ... 0.431073 -0.612392 0.692859 0.028794 0.020753 0.789805 -0.019639 3.496508 4.174387 0.482675
90609 0.297830 14.421899 -1.469354 -0.192073 5 0 0 0.795055 0.421466 0.196485 ... -0.311436 -2.001907 0.725049 0.047619 0.023579 1.373399 -0.001491 3.496508 4.304065 1.249267
91380 0.277409 14.513939 -0.929698 -0.441911 6 1 0 0.697018 0.092376 0.446092 ... 0.253155 -0.340558 0.695715 0.052273 0.026668 1.325531 -0.026854 2.442347 4.219508 0.417471
91695 0.110589 12.718260 -1.582293 -0.409848 5 1 0 0.693282 -0.004634 0.234171 ... -0.006908 -1.302569 0.692102 0.117647 0.028104 0.783925 0.045506 3.167583 4.204693 0.589516
92655 0.177596 12.899849 -1.385607 -0.341648 6 1 0 0.884419 0.340396 0.553001 ... 0.298045 -1.877044 0.875576 0.148148 0.028847 1.096923 0.099715 3.146305 4.343805 1.262840

204284 rows Γ— 30 columns

re size value prof fscore debtiss repurch nissa growth aturnover ... momrev valuem nissm strev ivol betaarb indrrev price age shvol
date permno
1992-01-01 10078 0.071490 14.803577 -0.821429 -0.351670 5 1 0 0.713175 0.337526 0.326859 ... -0.375068 -0.827133 0.688756 0.182292 0.036514 1.279770 0.143753 3.345508 4.276666 1.612871
10095 -0.131137 13.042999 -1.288738 -2.907817 2 1 1 0.823573 0.299850 -2.007233 ... 0.097856 -2.439453 0.879456 0.497268 0.044026 1.046930 0.331867 4.226834 4.276666 2.038652
10104 0.272462 13.962658 -0.943996 0.072345 3 0 0 0.712626 0.536861 0.209624 ... -0.919460 -1.741446 0.710315 0.074074 0.032754 1.434134 -0.091418 2.674149 4.276666 1.062730
10107 0.077499 16.289499 -2.240325 -0.123331 5 1 1 0.703894 0.427835 0.068269 ... -0.010327 -2.681148 0.708289 0.143959 0.013396 1.383749 -0.021534 4.711780 4.276666 0.735207
10119 0.158277 13.891799 -0.598919 -2.420088 4 1 0 0.693147 0.153250 -0.435732 ... 0.056359 -0.886515 0.693128 0.098684 0.015079 0.760442 -0.015104 3.038552 4.276666 0.145575
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2001-01-01 88664 -0.270994 14.544541 -2.222151 -0.639122 6 1 1 0.706994 0.415774 0.054443 ... 0.117828 -2.832748 0.734983 0.158508 0.030403 0.917149 0.241260 4.129148 5.214936 1.044554
90100 -0.004434 14.481504 -0.821788 -1.134622 7 0 1 0.655505 0.036674 -0.057457 ... -0.123972 -0.702888 0.694680 0.291677 0.043108 0.593961 0.162599 2.560130 4.997212 0.430254
90609 0.647295 14.915539 -2.160840 -0.608490 6 1 1 0.673995 0.009418 -0.422648 ... -0.183386 -0.172097 0.679562 -0.017647 0.038304 1.443601 0.065105 1.652258 5.267858 1.329642
91380 -0.009058 13.713922 0.137959 -0.303891 6 1 1 0.699451 -0.106341 0.617338 ... -0.792238 -1.032178 0.701598 0.282815 0.037141 0.831341 0.137473 3.308351 5.236442 1.157457
92655 -0.086296 16.449309 -0.847793 -0.821228 6 0 1 0.661662 0.057290 0.644070 ... -0.115775 -1.687913 0.674296 0.046351 0.021378 0.834816 -0.021866 4.117003 5.283204 0.619079

99615 rows Γ— 30 columns

14.1. 1. Lasso Regression#

Lasso (Least Absolute Shrinkage and Selection Operator) regression is a linear regression model with L1 regularization. It minimizes the following objective:

\[ \min_{\beta} \left( \frac{1}{2n} \sum_{i=1}^n (y_i - X_i^\top \beta)^2 + \alpha \|\beta\|_1 \right) \]
  • Key Characteristics:

    • Shrinks some coefficients to exactly zero, effectively performing feature selection.

    • Useful for sparse models where only a subset of predictors are important.

    • Struggles with multicollinearity, as it tends to arbitrarily select one among correlated predictors.

  1. Feature Extraction: X is extracted from the columns after the first 3. Y is the excess return.

  2. Feature Standardization: Standardizes X using StandardScaler, which is important for Lasso because it is sensitive to feature scaling.

  3. Lasso Regression: Fits a Lasso regression model with a specified regularization strength (alpha).

  4. Evaluation: Outputs the coefficients and ( R^2 ) score on test data, if a train-test split is used.

You can adjust the alpha parameter in Lasso() to tune the regularization strength. A smaller value of alpha reduces regularization, while a larger value increases it.

Note that here we are implicitly using the tuning sample to pick the amount of regularization. So once we pick our favorite alpha, which looks like to 0.002, we need to look at some other sample to if that worked

#We will start by standardizing our characteristics. This is done by subtracting the mean and dividing by the standard deviation. 


X_train = df_train.iloc[:, 1:]
X_train= X_train.groupby('date').apply(lambda x: (x - x.mean()) / x.std())

X_tuning = df_tuning.iloc[:, 1:]
X_tuning= X_tuning.groupby('date').apply(lambda x: (x - x.mean()) / x.std())
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split



from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Extract Y (excess return) and X (characteristics)

Y_train = df_train.iloc[:, 0]   # Excess return 

Y_tuning = df_tuning.iloc[:, 0]   # Excess return 


# # Perform Lasso regression
lasso = Lasso(alpha=0.0025)  # You can adjust the alpha (regularization strength)
lasso.fit(X_train, Y_train)

# Coefficients and intercept
print("Lasso Coefficients:", lasso.coef_)
print("Intercept:", lasso.intercept_)



Y_pred = lasso.predict(X_tuning)

# Compute Mean Squared Error (MSE)
mse = mean_squared_error(Y_tuning, Y_pred)
print("Mean Squared Error (MSE):", mse)

mae= mean_absolute_error(Y_tuning, Y_pred)
print("Mean Absolute Error (MAE):", mae)

r2 = r2_score(Y_tuning, Y_pred)
print("R-squared (R2):", r2)
Lasso Coefficients: [-0.          0.          0.          0.         -0.          0.
 -0.         -0.          0.         -0.          0.         -0.
  0.         -0.          0.          0.          0.          0.
  0.00031046 -0.          0.         -0.         -0.         -0.
 -0.         -0.00223662 -0.          0.         -0.        ]
Intercept: 0.005779088395769684
Mean Squared Error (MSE): 0.01145434617567078
Mean Absolute Error (MAE): 0.07446115851398497
R-squared (R2): -0.00030052087524023996
# Define a range of alpha values
alphas = np.logspace(-5, -2, 50)  # range for alphas
coefficients = []
mses=[]
# Perform Lasso regression for each alpha
for alpha in alphas:
    lasso = Lasso(alpha=alpha, max_iter=10000)  # Ensure convergence with high iterations
    lasso.fit(X_train, Y_train)
    coefficients.append(lasso.coef_)
    Y_pred = lasso.predict(X_tuning)
    mse = mean_squared_error(Y_tuning, Y_pred)
    mses.append(mse)


# Convert coefficients to a NumPy array for plotting
coefficients = np.array(coefficients)

# Plot the coefficients as a function of alpha
plt.figure(figsize=(10, 6))
for i in range(coefficients.shape[1]):
    plt.plot(alphas, coefficients[:, i])

plt.xscale('log')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Coefficients')
plt.title('Lasso Coefficients as a Function of Alpha')
plt.legend(df_train.iloc[:, 1:].columns, bbox_to_anchor=(1.05, 1), loc='upper left')

plt.grid(True)
plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 6))
plt.plot(alphas, mses)
plt.xlabel('Alpha (log scale)')
plt.ylabel('Mean-squared error test sample')
plt.xscale('log')
plt.show()
../../_images/465d7aa93c93ec069d39c2358e44720a67cbadd437ac6d088bbca94db87e2270.png ../../_images/72f8b584d7dd48c0cd6de61c259483ba5923ebc9aa21700df178c08ee102745a.png

Note we are using Mean-squared error as way to evaluate our model

To compute the Mean Squared Error (MSE) for the Lasso model after fitting it to the training data, you can use the mean_squared_error function from sklearn.metrics. Here’s how you can do it based on your original code:

Steps to Compute MSE:

  1. Make Predictions:

    • Use lasso.predict(X_test) to get predictions on the test set.

  2. Compute MSE:

    • Compare the predicted values (Y_pred) with the actual values (Y_test) using mean_squared_error.

MSE Calculation:

  • The formula for MSE is: [ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 ]

  • mean_squared_error automates this calculation.

Scaling:

  • Ensure the test set features (X_test) are transformed using the same scaler fitted on the training set to maintain consistency.

Output:

  • The MSE provides a measure of how well the model is predicting the excess returns on unseen data. Lower values indicate better performance.

#I will save below our lasso model at our optimal alpha
lasso = Lasso(alpha=0.002, max_iter=10000)  # Ensure convergence with high iterations
lasso.fit(X_train, Y_train)
Y_pred = lasso.predict(X_tuning)
mean_squared_error(Y_tuning, Y_pred)
0.011444912722223555

14.1.1. Including Interactions#

One possibility here is that the information is in the interactions of the characteristics, i.e., we want to augment the model to

\[\sum_{j=1}^N b_{j} x_{i,j}+\sum_{l=1}^N\sum_{j=1}^N b_{l,j} x_{i,l} x_{i,j}\]
from sklearn.preprocessing import PolynomialFeatures

# Assuming X_train contains the characteristics
degree = 2  # Degree of interactions (2 means pairwise interactions)
poly = PolynomialFeatures(degree=degree, interaction_only=True, include_bias=False)

# Generate cross-product features
X_train_interactions = poly.fit_transform(X_train)
X_tuning_interactions = poly.fit_transform(X_tuning)
# Feature names (optional: useful for understanding what each column represents)
feature_names = poly.get_feature_names_out(input_features=df_train.iloc[:, 1:].columns)

# Print the shape of the transformed dataset
print("Original X_train shape:", X_train.shape)
print("Transformed X_train shape:", X_train_interactions.shape)
print("Feature Names:", feature_names)
# Get the number of input features
input_dim = X_train_interactions.shape[1]
Original X_train shape: (204284, 29)
Transformed X_train shape: (204284, 435)
Feature Names: ['size' 'value' 'prof' 'fscore' 'debtiss' 'repurch' 'nissa' 'growth'
 'aturnover' 'gmargins' 'ep' 'sgrowth' 'lev' 'roaa' 'roea' 'sp' 'mom'
 'indmom' 'mom12' 'momrev' 'valuem' 'nissm' 'strev' 'ivol' 'betaarb'
 'indrrev' 'price' 'age' 'shvol' 'size value' 'size prof' 'size fscore'
 'size debtiss' 'size repurch' 'size nissa' 'size growth' 'size aturnover'
 'size gmargins' 'size ep' 'size sgrowth' 'size lev' 'size roaa'
 'size roea' 'size sp' 'size mom' 'size indmom' 'size mom12' 'size momrev'
 'size valuem' 'size nissm' 'size strev' 'size ivol' 'size betaarb'
 'size indrrev' 'size price' 'size age' 'size shvol' 'value prof'
 'value fscore' 'value debtiss' 'value repurch' 'value nissa'
 'value growth' 'value aturnover' 'value gmargins' 'value ep'
 'value sgrowth' 'value lev' 'value roaa' 'value roea' 'value sp'
 'value mom' 'value indmom' 'value mom12' 'value momrev' 'value valuem'
 'value nissm' 'value strev' 'value ivol' 'value betaarb' 'value indrrev'
 'value price' 'value age' 'value shvol' 'prof fscore' 'prof debtiss'
 'prof repurch' 'prof nissa' 'prof growth' 'prof aturnover'
 'prof gmargins' 'prof ep' 'prof sgrowth' 'prof lev' 'prof roaa'
 'prof roea' 'prof sp' 'prof mom' 'prof indmom' 'prof mom12' 'prof momrev'
 'prof valuem' 'prof nissm' 'prof strev' 'prof ivol' 'prof betaarb'
 'prof indrrev' 'prof price' 'prof age' 'prof shvol' 'fscore debtiss'
 'fscore repurch' 'fscore nissa' 'fscore growth' 'fscore aturnover'
 'fscore gmargins' 'fscore ep' 'fscore sgrowth' 'fscore lev' 'fscore roaa'
 'fscore roea' 'fscore sp' 'fscore mom' 'fscore indmom' 'fscore mom12'
 'fscore momrev' 'fscore valuem' 'fscore nissm' 'fscore strev'
 'fscore ivol' 'fscore betaarb' 'fscore indrrev' 'fscore price'
 'fscore age' 'fscore shvol' 'debtiss repurch' 'debtiss nissa'
 'debtiss growth' 'debtiss aturnover' 'debtiss gmargins' 'debtiss ep'
 'debtiss sgrowth' 'debtiss lev' 'debtiss roaa' 'debtiss roea'
 'debtiss sp' 'debtiss mom' 'debtiss indmom' 'debtiss mom12'
 'debtiss momrev' 'debtiss valuem' 'debtiss nissm' 'debtiss strev'
 'debtiss ivol' 'debtiss betaarb' 'debtiss indrrev' 'debtiss price'
 'debtiss age' 'debtiss shvol' 'repurch nissa' 'repurch growth'
 'repurch aturnover' 'repurch gmargins' 'repurch ep' 'repurch sgrowth'
 'repurch lev' 'repurch roaa' 'repurch roea' 'repurch sp' 'repurch mom'
 'repurch indmom' 'repurch mom12' 'repurch momrev' 'repurch valuem'
 'repurch nissm' 'repurch strev' 'repurch ivol' 'repurch betaarb'
 'repurch indrrev' 'repurch price' 'repurch age' 'repurch shvol'
 'nissa growth' 'nissa aturnover' 'nissa gmargins' 'nissa ep'
 'nissa sgrowth' 'nissa lev' 'nissa roaa' 'nissa roea' 'nissa sp'
 'nissa mom' 'nissa indmom' 'nissa mom12' 'nissa momrev' 'nissa valuem'
 'nissa nissm' 'nissa strev' 'nissa ivol' 'nissa betaarb' 'nissa indrrev'
 'nissa price' 'nissa age' 'nissa shvol' 'growth aturnover'
 'growth gmargins' 'growth ep' 'growth sgrowth' 'growth lev' 'growth roaa'
 'growth roea' 'growth sp' 'growth mom' 'growth indmom' 'growth mom12'
 'growth momrev' 'growth valuem' 'growth nissm' 'growth strev'
 'growth ivol' 'growth betaarb' 'growth indrrev' 'growth price'
 'growth age' 'growth shvol' 'aturnover gmargins' 'aturnover ep'
 'aturnover sgrowth' 'aturnover lev' 'aturnover roaa' 'aturnover roea'
 'aturnover sp' 'aturnover mom' 'aturnover indmom' 'aturnover mom12'
 'aturnover momrev' 'aturnover valuem' 'aturnover nissm' 'aturnover strev'
 'aturnover ivol' 'aturnover betaarb' 'aturnover indrrev'
 'aturnover price' 'aturnover age' 'aturnover shvol' 'gmargins ep'
 'gmargins sgrowth' 'gmargins lev' 'gmargins roaa' 'gmargins roea'
 'gmargins sp' 'gmargins mom' 'gmargins indmom' 'gmargins mom12'
 'gmargins momrev' 'gmargins valuem' 'gmargins nissm' 'gmargins strev'
 'gmargins ivol' 'gmargins betaarb' 'gmargins indrrev' 'gmargins price'
 'gmargins age' 'gmargins shvol' 'ep sgrowth' 'ep lev' 'ep roaa' 'ep roea'
 'ep sp' 'ep mom' 'ep indmom' 'ep mom12' 'ep momrev' 'ep valuem'
 'ep nissm' 'ep strev' 'ep ivol' 'ep betaarb' 'ep indrrev' 'ep price'
 'ep age' 'ep shvol' 'sgrowth lev' 'sgrowth roaa' 'sgrowth roea'
 'sgrowth sp' 'sgrowth mom' 'sgrowth indmom' 'sgrowth mom12'
 'sgrowth momrev' 'sgrowth valuem' 'sgrowth nissm' 'sgrowth strev'
 'sgrowth ivol' 'sgrowth betaarb' 'sgrowth indrrev' 'sgrowth price'
 'sgrowth age' 'sgrowth shvol' 'lev roaa' 'lev roea' 'lev sp' 'lev mom'
 'lev indmom' 'lev mom12' 'lev momrev' 'lev valuem' 'lev nissm'
 'lev strev' 'lev ivol' 'lev betaarb' 'lev indrrev' 'lev price' 'lev age'
 'lev shvol' 'roaa roea' 'roaa sp' 'roaa mom' 'roaa indmom' 'roaa mom12'
 'roaa momrev' 'roaa valuem' 'roaa nissm' 'roaa strev' 'roaa ivol'
 'roaa betaarb' 'roaa indrrev' 'roaa price' 'roaa age' 'roaa shvol'
 'roea sp' 'roea mom' 'roea indmom' 'roea mom12' 'roea momrev'
 'roea valuem' 'roea nissm' 'roea strev' 'roea ivol' 'roea betaarb'
 'roea indrrev' 'roea price' 'roea age' 'roea shvol' 'sp mom' 'sp indmom'
 'sp mom12' 'sp momrev' 'sp valuem' 'sp nissm' 'sp strev' 'sp ivol'
 'sp betaarb' 'sp indrrev' 'sp price' 'sp age' 'sp shvol' 'mom indmom'
 'mom mom12' 'mom momrev' 'mom valuem' 'mom nissm' 'mom strev' 'mom ivol'
 'mom betaarb' 'mom indrrev' 'mom price' 'mom age' 'mom shvol'
 'indmom mom12' 'indmom momrev' 'indmom valuem' 'indmom nissm'
 'indmom strev' 'indmom ivol' 'indmom betaarb' 'indmom indrrev'
 'indmom price' 'indmom age' 'indmom shvol' 'mom12 momrev' 'mom12 valuem'
 'mom12 nissm' 'mom12 strev' 'mom12 ivol' 'mom12 betaarb' 'mom12 indrrev'
 'mom12 price' 'mom12 age' 'mom12 shvol' 'momrev valuem' 'momrev nissm'
 'momrev strev' 'momrev ivol' 'momrev betaarb' 'momrev indrrev'
 'momrev price' 'momrev age' 'momrev shvol' 'valuem nissm' 'valuem strev'
 'valuem ivol' 'valuem betaarb' 'valuem indrrev' 'valuem price'
 'valuem age' 'valuem shvol' 'nissm strev' 'nissm ivol' 'nissm betaarb'
 'nissm indrrev' 'nissm price' 'nissm age' 'nissm shvol' 'strev ivol'
 'strev betaarb' 'strev indrrev' 'strev price' 'strev age' 'strev shvol'
 'ivol betaarb' 'ivol indrrev' 'ivol price' 'ivol age' 'ivol shvol'
 'betaarb indrrev' 'betaarb price' 'betaarb age' 'betaarb shvol'
 'indrrev price' 'indrrev age' 'indrrev shvol' 'price age' 'price shvol'
 'age shvol']
# Define a range of alpha values
alphas = [0.002,0.00225,0.0025,0.00275,0.003,0.0035]  # range for alphas
coefficients = []
mses=[]
# Perform Lasso regression for each alpha
for alpha in alphas:
    lasso = Lasso(alpha=alpha, max_iter=10000)  # Ensure convergence with high iterations
    lasso.fit(X_train_interactions, Y_train)
    coefficients.append(lasso.coef_)
    Y_pred = lasso.predict(X_tuning_interactions)
    mse = mean_squared_error(Y_tuning, Y_pred)
    mses.append(mse)


# Convert coefficients to a NumPy array for plotting
coefficients = np.array(coefficients)


alpha_index = alphas.index(0.00275)
surviving_features = np.where(coefficients[alpha_index, :] != 0)[0]

# Plot the coefficients as a function of alpha
plt.figure(figsize=(10, 6))
for i in range(coefficients.shape[1]):
    if i in surviving_features:
        # Plot surviving features with legend
        plt.plot(alphas, coefficients[:, i], label=feature_names[i])
    else:
        # Plot non-surviving features without legend
        plt.plot(alphas, coefficients[:, i], color='gray', alpha=0.5)

plt.xscale('log')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Coefficients')
plt.title('Lasso Coefficients as a Function of Alpha')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title="Surviving Features")
plt.grid(True)
plt.tight_layout()
plt.show()

# Plot Mean-squared error for tuning sample
plt.figure(figsize=(10, 6))
plt.plot(alphas, mses, marker='o')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Mean-squared error (tuning sample)')
plt.xscale('log')
plt.title('Mean-squared Error as a Function of Alpha')
plt.grid(True)
plt.tight_layout()
plt.show()
../../_images/7154ecd2f9997b8c19780401156fd31e406c872a48542f694f557afe74f907e6.png ../../_images/f4b290c126e92f94541a6d238b9c98a069d7c2dc25d26d2e4d06c822f43fea09.png

Note that there is virtually no improvement out of sample

14.1.2. Non-Parametric Models#

Instead of assuming that the relationship between the dependent variable $ y ) and the characteristic ( size ) is linear, we consider a more flexible model. In this approach, the relationship is linear in terms of the percentiles of the characteristic.

14.1.2.1. Original Linear Model#

In a linear regression, the model is typically expressed as:

\[ y_{i, t+1} = \beta \cdot size_{i, t} \]

where:

  • \( y_{i, t+1} \): The dependent variable (e.g., return of asset \( i \) at time \( t+1 \)).

  • \( size_{i, t} \): A characteristic of asset \( i \) at time \( t \) (e.g., market capitalization).

  • \( \beta \): A regression coefficient indicating the relationship between \( size_{i, t} \) and \( y_{i, t+1} \).

14.1.2.2. Non-Parametric Percentile-Based Model#

To introduce non-linearity, we instead model \( y_{i, t+1} \) as a function of the percentiles of \( size \) within each time period. The model becomes:

[ y_{i, t+1} = \sum_p \beta_p \cdot 1_{{size_{i, t} \in \text{Percentile}(p, \text{size}_t)}} ]

where:

  • \( p \): The percentile group (e.g., \( p = 1 \) for the 0-20% percentile, \( p = 2 \) for the 20-40% percentile, etc.).

  • \( \beta_p \): The regression coefficient for percentile \( p \).

  • \( 1_{\{size_{i, t} \in \text{Percentile}(p, \text{size}_t)\}} \): An indicator function that equals 1 if \( size_{i, t} \) falls in the \( p \)-th percentile of the \( size \) distribution for time \( t \), and 0 otherwise.

14.1.2.3. Explanation#

  1. Intuition: Instead of assuming a linear relationship between \( y \) and \( size \), the model captures how \( y \) varies across different percentile ranges of \( size \).

  2. Flexibility: The model allows for different effects (\( \beta_p \)) for each percentile range, enabling it to capture non-linear relationships.

  3. Interpretation: For example, \( \beta_1 \) represents the average effect of assets in the lowest 20% of \( size \) on \( y \), while \( \beta_5 \) represents the effect for assets in the highest 20%.

This approach is particularly useful when the relationship between \( y \) and \( size \) is not well-approximated by a straight line but instead varies across different ranges of \( size \).

# Define the number of percentiles
num_percentiles = 5

# Initialize an empty list to store the new columns
new_columns = []
df=df_train.iloc[:, 1:].copy()
# Loop through each characteristic
for characteristic in df.columns:
    # Group by date and calculate percentiles
    grouped = df[characteristic].groupby(level='date')
    
    # Apply percentile binning for each date
    percentile_bins = grouped.apply(
        lambda x: pd.qcut(x, q=num_percentiles, labels=False, duplicates='drop')  # Bins from 0 to 4
    )
    
    # Create binary columns for each percentile
    for percentile in range(num_percentiles):
        col_name = f"{characteristic}_p{percentile+1}"
        df[col_name] = (percentile_bins == percentile).astype(int)
        df=df.copy()
        new_columns.append(col_name)

# Keep the new columns only for verification (if needed)
new_characteristics_df = df[new_columns].copy()

# Output the shape of the new DataFrame
print("Original DataFrame shape:", df_train.iloc[:, 1:].shape)
print("New DataFrame shape after adding percentiles:", new_characteristics_df.shape)
new_characteristics_df
Original DataFrame shape: (204284, 29)
New DataFrame shape after adding percentiles: (204284, 145)
size_p1 size_p2 size_p3 size_p4 size_p5 value_p1 value_p2 value_p3 value_p4 value_p5 ... age_p1 age_p2 age_p3 age_p4 age_p5 shvol_p1 shvol_p2 shvol_p3 shvol_p4 shvol_p5
date permno
1972-07-01 10006 0 0 1 0 0 0 0 0 1 0 ... 0 0 1 0 0 0 0 0 1 0
10102 0 1 0 0 0 0 0 0 0 1 ... 0 0 1 0 0 0 0 1 0 0
10137 0 0 0 1 0 0 0 0 1 0 ... 0 0 1 0 0 0 1 0 0 0
10145 0 0 0 1 0 0 0 0 0 1 ... 0 0 1 0 0 0 0 0 1 0
10153 1 0 0 0 0 0 0 0 0 1 ... 0 0 1 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1991-01-01 90369 1 0 0 0 0 0 0 1 0 0 ... 1 0 0 0 0 0 0 0 1 0
90609 0 0 0 1 0 1 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1
91380 0 0 0 1 0 0 1 0 0 0 ... 1 0 0 0 0 0 0 1 0 0
91695 1 0 0 0 0 1 0 0 0 0 ... 1 0 0 0 0 0 0 0 1 0
92655 1 0 0 0 0 1 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1

204284 rows Γ— 145 columns

# Define a range of alpha values
alphas = [0.0001,0.0005,0.001]  # range for alphas
coefficients = []
mses=[]
# Perform Lasso regression for each alpha
for alpha in alphas:
    lasso = Lasso(alpha=alpha, max_iter=10000)  # Ensure convergence with high iterations
    lasso.fit(new_characteristics_df.values, Y_train)
    coefficients.append(lasso.coef_)



# Convert coefficients to a NumPy array for plotting
coefficients = np.array(coefficients)


alpha_index = alphas.index(0.0005)
surviving_features = np.where(coefficients[alpha_index, :] != 0)[0]

# Plot the coefficients as a function of alpha
plt.figure(figsize=(10, 6))
for i in range(coefficients.shape[1]):
    if i in surviving_features:
        # Plot surviving features with legend
        plt.plot(alphas, coefficients[:, i], label=new_characteristics_df.columns[i])
    else:
        # Plot non-surviving features without legend
        plt.plot(alphas, coefficients[:, i], color='gray', alpha=0.5)

plt.xscale('log')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Coefficients')
plt.title('Lasso Coefficients as a Function of Alpha')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title="Surviving Features")
plt.grid(True)
plt.tight_layout()
plt.show()
../../_images/8bb06f06879adac582cb1ae9f1d8fd4eb584d8fe0b90a1a767ae6c208ea989f9.png

14.1.3. 2. Random Forest Regression#

Neural Networks

Random Forest is an ensemble method that combines multiple decision trees to make predictions. Each tree is trained on a bootstrap sample of the data, and predictions are averaged:

\[ \hat{y} = \frac{1}{T} \sum_{t=1}^T h_t(X) \]

Where \( h_t(X) \) is the prediction of the ( t )-th tree.

  • Key Characteristics:

    • Reduces overfitting by averaging predictions across trees.

    • Handles non-linear relationships and interactions between features well.

    • Relatively robust to noisy data and outliers.

    • Does not extrapolate beyond the range of the training data.

  1. Random Forest Regressor:

    • A RandomForestRegressor is initialized with:

      • n_estimators=100: Builds 100 decision trees.

      • max_depth=None: Allows trees to grow until all leaves are pure or contain less than the minimum samples.

      • random_state=42: Ensures reproducibility.

      • n_jobs=-1: Utilizes all available CPU cores for faster training.

  2. Feature Importances:

    • The relative importance of each feature is extracted using the feature_importances_ attribute and displayed in a sorted DataFrame.

  3. Adjusting Hyperparameters:

  • You can tune the following hyperparameters to optimize model performance:

    • n_estimators: Increase or decrease the number of trees.

    • max_depth: Limit the depth of trees to prevent overfitting.

    • min_samples_split: Minimum number of samples required to split an internal node.

    • min_samples_leaf: Minimum number of samples required to be at a leaf node.

from sklearn.ensemble import RandomForestRegressor




# Build the Random Forest Regressor
random_forest = RandomForestRegressor(
    n_estimators=250,  # Number of trees in the forest
    max_depth=10,    # Maximum depth of the trees
    random_state=42,   # Ensures reproducibility
    n_jobs=-1          # Use all available cores for training
)

# Train the Random Forest model
random_forest.fit(X_train, Y_train)

# Make predictions on the test set
Y_pred = random_forest.predict(X_tuning)

# Evaluate the model
mse = mean_squared_error(Y_tuning, Y_pred)
r2 = r2_score(Y_tuning, Y_pred)

mse = mean_squared_error(Y_tuning, Y_pred)
print("Mean Squared Error (MSE):", mse)

mae= mean_absolute_error(Y_tuning, Y_pred)
print("Mean Absolute Error (MAE):", mae)

r2 = r2_score(Y_tuning, Y_pred)
print("R-squared (R2):", r2)

# Optional: Feature importance
feature_importances = random_forest.feature_importances_
feature_names = df_train.iloc[:, 1:].columns
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(importance_df)
Mean Squared Error (MSE): 0.01148951126275974
Mean Absolute Error (MAE): 0.07460830016978094
R-squared (R2): -0.003371464811474878

Feature Importances:
      Feature  Importance
27        age    0.186314
26      price    0.086422
25    indrrev    0.065550
16        mom    0.065456
22      strev    0.064356
19     momrev    0.053585
18      mom12    0.052670
10         ep    0.048531
28      shvol    0.045249
17     indmom    0.038484
24    betaarb    0.034132
23       ivol    0.030491
0        size    0.029805
20     valuem    0.028786
8   aturnover    0.017810
21      nissm    0.017426
11    sgrowth    0.016147
2        prof    0.014705
1       value    0.013960
15         sp    0.013442
6       nissa    0.012874
13       roaa    0.012841
7      growth    0.012358
9    gmargins    0.011966
14       roea    0.011521
12        lev    0.010726
3      fscore    0.002920
4     debtiss    0.000890
5     repurch    0.000584

14.1.4. 4. Gradient Boosted Regression Trees (GBRT)#

GBRT is an ensemble technique that builds trees sequentially, where each tree corrects the errors of the previous one. The prediction is updated iteratively:

\[ \hat{y}_t(X) = \hat{y}_{t-1}(X) + \nu \cdot g_t(X) \]

Where:

  • \( g_t(X) \): Gradient of the loss function with respect to predictions.

  • \( \nu\): Learning rate, controlling the contribution of each tree.

  • Key Characteristics:

    • Optimizes a differentiable loss function (e.g., squared error for regression).

    • Can capture complex, non-linear patterns in the data.

    • Requires careful tuning of hyperparameters (e.g., learning rate, number of trees, maximum tree depth).

  1. Gradient Boosted Regression Trees:

    • A GradientBoostingRegressor is initialized with:

      • n_estimators=100: Builds 100 trees.

      • learning_rate=0.1: Controls the contribution of each tree to the final prediction.

      • max_depth=3: Limits the depth of individual trees to prevent overfitting.

      • random_state=42: Ensures reproducibility.

  2. Adjusting Hyperparameters:

  • You can tune the following hyperparameters to optimize the model:

    • n_estimators: Increase for more stages of boosting.

    • learning_rate: Decrease for smaller incremental updates (often requires increasing n_estimators).

    • max_depth: Control tree depth to balance bias and variance.

    • subsample: Use a fraction of samples for each stage (e.g., subsample=0.8 for 80% of the data).

from sklearn.ensemble import GradientBoostingRegressor




# Build the Gradient Boosting Regressor
gbrt = GradientBoostingRegressor(
    n_estimators=300,       # Number of boosting stages to perform
    learning_rate=0.2,      # Shrinks the contribution of each tree
    max_depth=5,            # Maximum depth of each tree
    random_state=42         # Ensures reproducibility
)

# Train the Gradient Boosting model
gbrt.fit(X_train, Y_train)

# Make predictions on the test set
Y_pred = gbrt.predict(X_tuning)

# Evaluate the model
mse = mean_squared_error(Y_tuning, Y_pred)
r2 = r2_score(Y_tuning, Y_pred)

mse = mean_squared_error(Y_tuning, Y_pred)
print("Mean Squared Error (MSE):", mse)

mae= mean_absolute_error(Y_tuning, Y_pred)
print("Mean Absolute Error (MAE):", mae)

r2 = r2_score(Y_tuning, Y_pred)
print("R-squared (R2):", r2)

# Optional: Feature importance
feature_importances = gbrt.feature_importances_
feature_names = df_train.iloc[:, 1:].columns
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(importance_df)
Mean Squared Error (MSE): 0.012230932833230374
Mean Absolute Error (MAE): 0.07760391337870105
R-squared (R2): -0.06811932311395719

Feature Importances:
      Feature  Importance
27        age    0.294144
16        mom    0.057104
22      strev    0.055204
18      mom12    0.052066
19     momrev    0.050959
28      shvol    0.045834
25    indrrev    0.045597
26      price    0.040428
17     indmom    0.040013
24    betaarb    0.035627
10         ep    0.029968
20     valuem    0.027339
23       ivol    0.027101
0        size    0.024653
21      nissm    0.018743
8   aturnover    0.017848
15         sp    0.016247
11    sgrowth    0.016003
6       nissa    0.015233
7      growth    0.014040
12        lev    0.013535
1       value    0.013454
9    gmargins    0.013224
2        prof    0.013146
14       roea    0.010066
13       roaa    0.008488
3      fscore    0.002793
5     repurch    0.000652
4     debtiss    0.000489

14.1.5. 5. Elastic Net Regression#

Elastic Net combines L1 (Lasso) and L2 (Ridge) regularization to balance feature selection and multicollinearity handling. The objective function is:

\[ \min_{\beta} \left( \frac{1}{2n} \sum_{i=1}^n (y_i - X_i^\top \beta)^2 + \alpha_1 \|\beta\|_1 + \alpha_2 \|\beta\|_2^2 \right) \]

Where:

  • \( \|\beta\|_1 \): Lasso penalty encourages sparsity.

  • \( \|\beta\|_2^2 \): Ridge penalty shrinks coefficients to reduce multicollinearity.

  • Key Characteristics:

    • Balances Lasso’s feature selection and Ridge’s stability with correlated predictors.

    • Controlled by two hyperparameters:

      • \( \alpha \): Overall regularization strength.

      • \( \rho \) (mixing ratio): Balance between L1 and L2 penalties.

  1. Elastic Net Regressor:

    • The ElasticNet regressor is initialized with:

      • alpha=0.1: Controls the overall strength of regularization.

      • l1_ratio=0.5: Specifies the mix of L1 (Lasso) and L2 (Ridge) penalties:

        • l1_ratio=0 : Equivalent to Ridge regression.

        • l1_ratio=1 : Equivalent to Lasso regression.

        • l1_ratio=0.5 : Balances L1 and L2 penalties.

      • random_state=42: Ensures reproducibility.

  2. Adjusting Hyperparameters:

  • alpha:

    • Larger values apply stronger regularization, reducing overfitting but increasing bias.

  • l1_ratio:

    • Adjust to control the balance between L1 and L2 penalties:

      • Increase towards 1 for more sparsity (feature selection).

      • Decrease towards 0 to favor Ridge-like behavior (handles multicollinearity).

from sklearn.linear_model import ElasticNet

# Initialize and train the Elastic Net regressor
elastic_net = ElasticNet(
    alpha=0.005,       # Regularization strength (higher values = stronger penalty)
    l1_ratio=0.5,    # Balance between L1 (Lasso) and L2 (Ridge) regularization
    random_state=42  # Ensures reproducibility
)

# Train the Elastic Net model
elastic_net.fit(X_train, Y_train)

# Make predictions on the test set
Y_pred = elastic_net.predict(X_tuning)

# Evaluate the model
mse = mean_squared_error(Y_tuning, Y_pred)
r2 = r2_score(Y_tuning, Y_pred)

mse = mean_squared_error(Y_tuning, Y_pred)
print("Mean Squared Error (MSE):", mse)

mae= mean_absolute_error(Y_tuning, Y_pred)
print("Mean Absolute Error (MAE):", mae)

r2 = r2_score(Y_tuning, Y_pred)
print("R-squared (R2):", r2)

# Print the coefficients
coefficients = pd.DataFrame({
    'Feature': df_train.iloc[:, 1:].columns,
    'Coefficient': elastic_net.coef_
}).sort_values(by='Coefficient', ascending=False)

print("\nCoefficients:")
print(coefficients)
Mean Squared Error (MSE): 0.01144784660786539
Mean Absolute Error (MAE): 0.07444204750342037
R-squared (R2): 0.0002670821080715813

Coefficients:
      Feature  Coefficient
20     valuem     0.002396
17     indmom     0.000461
15         sp     0.000052
0        size    -0.000000
1       value     0.000000
27        age     0.000000
26      price    -0.000000
24    betaarb    -0.000000
23       ivol     0.000000
22      strev     0.000000
21      nissm    -0.000000
18      mom12     0.000000
16        mom    -0.000000
14       roea     0.000000
13       roaa     0.000000
12        lev     0.000000
11    sgrowth     0.000000
10         ep     0.000000
9    gmargins    -0.000000
8   aturnover     0.000000
7      growth    -0.000000
6       nissa    -0.000000
5     repurch     0.000000
4     debtiss    -0.000000
3      fscore     0.000000
2        prof     0.000000
28      shvol    -0.000000
19     momrev    -0.001270
25    indrrev    -0.001636

14.1.6. 6. Neural Network Regression#

Neural Networks

A neural network is a flexible, non-linear model that uses layers of neurons to approximate complex relationships between inputs (( X )) and outputs (( Y )). The simplest form of a feedforward neural network can be expressed as:

\[ \hat{y} = f(W^{[L]} \sigma(W^{[L-1]} \dots \sigma(W^{[1]} X + b^{[1]}) + b^{[L-1]}) + b^{[L]} \]
  • Key Characteristics:

    • Consists of an input layer, hidden layers, and an output layer.

    • Activation functions (\(\sigma\), e.g., ReLU or sigmoid) introduce non-linearity.

      • \(ReLU(x)=max(0,x)\)

      • \(Sigmoid(x)=1/(1+e^{βˆ’x})\)

    • The number of layers and neurons can be tuned to fit data complexity.

    • Requires careful tuning of hyperparameters (e.g., learning rate, number of layers, epochs).

  1. Building the Neural Network:

    • Function build_and_train_model:

      • Parameters:

        • num_layers: Number of layers in the neural network (free parameter).

        • input_dim: Number of input features (dimensions).

      • Model Architecture:

        • Input Layer:

          • Uses Dense layer with 64 neurons and relu activation function.

          • input_dim specifies the number of input features.

        • Hidden Layers:

          • Adds additional hidden layers based on num_layers.

          • Each hidden layer has 64 neurons with relu activation.

        • Output Layer:

          • A single neuron without activation (linear activation) for regression output.

      • Compilation:

        • Uses the adam optimizer and mean_squared_error loss function suitable for regression tasks.

      • Training: In addition to the network structure (layers and neurons), you also have to pick parameters that control the training process

        Epochs: Controls the total number of training cycles. More epochs mean more opportunities for the model to learn, but excessive epochs can lead to overfitting.

        Batch Size: Controls how the dataset is split into smaller subsets for gradient updates. Balances memory usage and convergence speed.

        Validation Split: Reserves a portion of the data to monitor generalization during training and guide callbacks like early stopping.

        By carefully tuning these parameters, you can balance training efficiency, model generalization, and computational resources.

Adjusting the Number of Layers:

  • Free Parameter:

    • You can adjust the num_layers variable to change the depth of the neural network.

    • For example, setting num_layers = 5 will create a network with one input layer and four hidden layers.

Additional Considerations:

  • Hyperparameter Tuning:

    • You might want to experiment with different numbers of neurons, activation functions, epochs, and batch sizes to improve model performance.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
def build_and_train_model(num_layers, input_dim, X_train, Y_train,neurons,validation_data,epochs):
    """
    Builds and trains a neural network model.

    Parameters:
    - num_layers: int, number of layers in the neural network
    - input_dim: int, number of input features
    - X_train: training features
    - Y_train: training target

    Returns:
    - model: Trained Keras model
    """
    model = Sequential()
    # Add the input layer

    model.add(Dense(neurons, activation='relu', input_dim=input_dim))

    # Add hidden layers
    for _ in range(num_layers - 1):
        neurons = max(1, int(neurons * 0.5)) 
        model.add(Dense(neurons, activation='relu'))

    # Add the output layer
    model.add(Dense(1))  # Single neuron for regression output

    # Compile the model
    model.compile(optimizer='adam', loss='mean_squared_error')

    # Train the model
    model.fit(
        X_train, Y_train,
        epochs=epochs,                # You can adjust the number of epochs
        batch_size=128,            # You can adjust the batch size
        validation_data=validation_data,    
    )
    return model

# Specify the number of layers (free parameter)
num_layers = 5  # Adjust this number as needed


input_dim = X_train.shape[1]

# Build and train the model
model = build_and_train_model(2, input_dim, X_train ,Y_train.values,16,(X_tuning, Y_tuning.values),epochs=10)
Epoch 1/10
1596/1596 ━━━━━━━━━━━━━━━━━━━━ 3s 1ms/step - loss: 0.0362 - val_loss: 0.0121
Epoch 2/10
1596/1596 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - loss: 0.0098 - val_loss: 0.0117
Epoch 3/10
1596/1596 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - loss: 0.0096 - val_loss: 0.0115
Epoch 4/10
1596/1596 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - loss: 0.0096 - val_loss: 0.0118
Epoch 5/10
1596/1596 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - loss: 0.0095 - val_loss: 0.0116
Epoch 6/10
1596/1596 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - loss: 0.0095 - val_loss: 0.0115
Epoch 7/10
1596/1596 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - loss: 0.0093 - val_loss: 0.0116
Epoch 8/10
1596/1596 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - loss: 0.0094 - val_loss: 0.0116
Epoch 9/10
1596/1596 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - loss: 0.0094 - val_loss: 0.0116
Epoch 10/10
1596/1596 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - loss: 0.0093 - val_loss: 0.0116
from tensorflow.keras.utils import plot_model

# Visualize the model architecture
plot_model(model, to_file='model.png', show_shapes=True, show_layer_names=True)
You must install pydot (`pip install pydot`) for `plot_model` to work.

14.1.7. The Whole Shebang#

You can potentially combine the Non-Parametric Percentile-Based Model with the interactions.

The key issue is tha as you make the model richer and richer the scope for the training to produce garabage increases

You need to do much more validation/model regularization

Examples include

  • L1 regularization:L1 regularization adds a penalty to the loss function based on the absolute values of the weights, encouraging sparsity by driving some weights to zero. This prevents overfitting and simplifies the model, making it useful in high-dimensional data or feature selection.

  • Early stopping: Early stopping monitors validation loss during training and halts when the loss stops improving, preventing overfitting. It ensures the model generalizes well and avoids unnecessary training.

  • Batch normalization: Batch normalization normalizes layer inputs to have a mean of 0 and variance of 1, speeding up training and reducing sensitivity to initialization. It also acts as a regularizer by introducing noise during training.

  • Ensembles: Ensembles combine predictions from multiple models, reducing variance and improving accuracy. They are effective for generalization but increase computational cost.


14.2. Wrap up#

Comparison Table across methods

Model

Type

Key Strengths

Limitations

Lasso Regression

Linear

Feature selection, interpretable coefficients

Struggles with multicollinearity

Neural Network Regression

Non-linear

Flexible, captures complex patterns

Requires significant tuning and data

Random Forest Regression

Non-linear

Robust to overfitting, handles feature interactions well

Computationally expensive for large data

GBRT

Non-linear

Accurate, optimizes for specific loss functions

Sensitive to hyperparameters, overfitting

Elastic Net Regression

Linear

Handles multicollinearity, balances selection & stability

Can be slower than Ridge or Lasso

Let me know if you’d like additional detail or comparisons!

For an academic investigation of these methods see β€œEmpirical Asset Pricing via Machine Learning” (https://academic.oup.com/rfs/article/33/5/2223/5758276?login=true). The figures used in this notebook are also from that paper.