Skip to content

Experience the fusion of finance and artificial intelligence with this cutting-edge project that implements a deep reinforcement learning agent for automated stock trading. Leveraging advanced neural network architectures and a meticulously crafted reward function, the agent learns to make profitable trading decisions in a simulated environment

License

Notifications You must be signed in to change notification settings

MFaizan18/AlphaQuantTrader

Repository files navigation

AlphaQuantTrader

Welcome to the AlphaQuantTrader project! This powerful trading bot is designed to revolutionize the way you approach algorithmic trading by leveraging advanced machine learning techniques and Bayesian statistics.

1) What Exactly is AlphaQuantTrader?

AlphaQuantTrader is a sophisticated trading bot that utilizes reinforcement learning to automate and optimize stock trading decisions. It operates within a custom-built trading environment that simulates real-world market conditions, allowing the bot to make informed buy, hold, or sell decisions based on historical financial data. The bot is powered by a deep Q-network (DQN) model, which learns to maximize portfolio returns by continuously adapting to changing market conditions.

Additionally, AlphaQuantTrader incorporates Bayesian statistical methods to dynamically adjust risk and enhance decision-making, ensuring robust performance even in volatile markets.

2) Key Features

2.1) Custom Trading Environment: AlphaQuantTrader is built on a custom Gym environment that accurately simulates trading activities, including transaction costs and portfolio management, offering a realistic trading experience.

2.2) Deep Q-Network (DQN): At its core, the bot uses a DQN model to predict and execute the most profitable trading actions, learning from past market data to improve its strategies over time.

2.3) Bayesian Risk Management: The bot integrates Bayesian updating techniques to adjust its risk management strategies dynamically, taking into account market volatility and uncertainty.

2.4) Historical Data Processing: AlphaQuantTrader preprocesses historical market data, utilizing various technical indicators, statistical measures, and volatility analysis to inform its trading decisions.

2.5) Portfolio Optimization: Through reinforcement learning, the bot continuously seeks to optimize its portfolio by balancing risk and reward, aiming to maximize long-term gains.

3) Why Use AlphaQuantTrader?

AlphaQuantTrader is ideal for traders and developers looking to automate and enhance their trading strategies with cutting-edge AI. Whether you're aiming to reduce the manual effort in trading or seeking a robust system that adapts to market changes, AlphaQuantTrader provides the tools and intelligence to help you stay ahead in the financial markets. With its combination of deep learning and Bayesian techniques, AlphaQuantTrader offers a strategic edge that goes beyond traditional trading algorithms.

4) Model Overview

Let's start by importimg the necessary libraries

import yfinance as yf
import pandas as pd
import numpy as np
from scipy.stats import norm, skew, kurtosis
from sklearn.preprocessing import StandardScaler
import gym
from gym import spaces
import numpy as np
import pandas as pd
import random
from collections import namedtuple
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Input, Conv1D, ReLU, Bidirectional, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
import matplotlib.pyplot as plt

5) Preprocessing

The preprocessing phase is crucial for preparing raw financial data into a form suitable for training the reinforcement learning model. This step involves acquiring historical market data, cleaning it, and engineering the necessary features to ensure that the model receives meaningful input for effective learning.

5.1) Data Acquisition

Historical stock price data for the National Stock Exchange of India (^NSEI) index is downloaded using the yfinance library. The dataset spans over 11.5 years, from January 21, 2013, to July 31, 2024, and is normalized before serving as the foundation for training, testing, and validating the reinforcement learning model.

data = yf.download('^NSEI', start='2013-01-21', end='2024-07-31', interval='1d')

Here's a glimpse of the data we're working with. The first 10 rows of the data are as follows:

| RSI        | MACD_Histogram | VWAP        | Bollinger_Band_Width | 50_day_MA   | 20_day_MA   | 9_day_MA    | Skewness   | Kurtosis   | dynamic_rolling_variances | CDF        | Nor_Adj_Close |
|------------|----------------|-------------|----------------------|-------------|-------------|-------------|------------|------------|---------------------------|------------|---------------|
| 31.955411  | -0.060194      | -1.462427   | -0.492711             | -1.429522   | -1.458934   | -1.492421   | 0.288366   | -0.856547  | -0.389327                 | 0.589216   | -1.470435     |
| 29.391651  | -0.054519      | -1.464739   | -0.483284             | -1.432472   | -1.459385   | -1.495261   | 0.236524   | -0.978247  | 0.158286                  | 0.379025   | -1.496615     |
| 27.721806  | -0.203237      | -1.468015   | -0.409436             | -1.435887   | -1.463075   | -1.499912   | 0.094578   | -0.947947  | 0.635391                  | 0.337802   | -1.530784     |
| 19.556652  | -0.309120      | -1.471731   | -0.339685             | -1.439498   | -1.467749   | -1.504013   | 0.302799   | -0.825508  | 0.386888                  | 0.476451   | -1.538269     |
| 20.469411  | -0.362698      | -1.474321   | -0.293727             | -1.442931   | -1.473391   | -1.508227   | 0.455540   | -0.599875  | 0.116669                  | 0.498186   | -1.541855     |
| 20.091448  | -0.444192      | -1.478158   | -0.283511             | -1.447108   | -1.481328   | -1.513620   | 0.325813   | -0.646141  | -0.025186                 | 0.423713   | -1.558513     |
| 32.756245  | -0.346920      | -1.481539   | -0.338587             | -1.450828   | -1.488085   | -1.516843   | 0.478166   | -0.774842  | 0.514644                  | 0.647224   | -1.536372     |
| 40.468962  | -0.190394      | -1.484694   | -0.411093             | -1.454113   | -1.493723   | -1.520285   | 0.255360   | -0.963981  | 0.432596                  | 0.590046   | -1.524083     |
| 38.301869  | -0.177125      | -1.487965   | -0.431047             | -1.457913   | -1.499406   | -1.527121   | 0.252123   | -0.983150  | 0.548091                  | 0.382168   | -1.546868     |
| 42.957792  | -0.073499      | -1.491163   | -0.541240             | -1.461274   | -1.505404   | -1.534106   | 0.188423   | -0.977782  | 0.497006                  | 0.603417   | -1.532995     |

And the last 10 rows of the data are as follows:

| RSI        | MACD_Histogram | VWAP        | Bollinger_Band_Width | 50_day_MA   | 20_day_MA   | 9_day_MA    | Skewness   | Kurtosis   | dynamic_rolling_variances | CDF        | Nor_Adj_Close |
|------------|----------------|-------------|----------------------|-------------|-------------|-------------|------------|------------|---------------------------|------------|---------------|
| 42.527646  | -0.896734      | 1.950455    | 0.636007              | 2.833609    | 2.575108    | 2.554237    | -0.409604  | -0.511399  | 0.463655                  | 2.13E-124  | 2.312439      |
| 41.891021  | -1.123832      | 1.951993    | 0.729387              | 2.826262    | 2.562206    | 2.527068    | -0.420964  | -0.578321  | 1.003222                  | 1          | 2.366975      |
| 40.672664  | -0.853302      | 1.953345    | 0.728086              | 2.819489    | 2.554111    | 2.505245    | -0.593637  | -0.640503  | 1.253434                  | 1          | 2.431240      |
| 47.306983  | -0.398670      | 1.954657    | 0.615841              | 2.813195    | 2.545945    | 2.488191    | -0.593378  | -0.639957  | 0.982796                  | 1          | 2.472024      |
| 52.190567  | -0.170272      | 1.955863    | 0.621062              | 2.806073    | 2.545545    | 2.474022    | -0.227646  | -0.843989  | 0.754123                  | 6.42E-07   | 2.448055      |
| 47.627936  | 0.164088       | 1.956818    | 0.616688              | 2.798321    | 2.546114    | 2.464745    | -0.285838  | -0.861025  | 0.514884                  | 0.999998   | 2.476777      |
| 43.283158  | 0.657489       | 1.958015    | 0.596113              | 2.790355    | 2.550518    | 2.465206    | -0.505755  | -0.850246  | 0.391843                  | 1          | 2.527952      |
| 41.250978  | 0.937666       | 1.959117    | 0.598335              | 2.781250    | 2.551341    | 2.463853    | -0.351689  | -0.790676  | -0.027929                 | 0.037553   | 2.521112      |
| 41.153581  | 1.084516       | 1.961293    | 0.548161              | 2.772495    | 2.547859    | 2.472356    | -0.266748  | -0.648239  | -0.946615                 | 0.122264   | 2.517752      |
| 49.592960  | 1.411373       | 1.962448    | 0.580896              | 2.765918    | 2.550630    | 2.501117    | -0.593796  | -0.454717  | -0.906079                 | 1          | 2.570008      |

Split the data based on the date

# Define the split dates
train_end_date = '2021-12-31'
val_end_date = '2022-12-31'
test_start_date = '2023-01-01'

5.2) Data Preprocessing Functions

# Function to calculate daily returns
def calculate_log_returns(data):
    # Log return formula applied to the 'Adj Close' price
    log_returns = np.log(data['Adj Close'] / data['Adj Close'].shift(1))
    return log_returns

# Function to calculate RSI (Relative Strength Index)
def calculate_rsi(data, window=14):
    delta = data['Adj Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=window).mean()
    rs = gain / loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

# Function to calculate MACD Histogram
def calculate_macd_histogram(data, short_window=12, long_window=26, signal_window=9):
    short_ema = data['Close'].ewm(span=short_window, adjust=False).mean()
    long_ema = data['Close'].ewm(span=long_window, adjust=False).mean()
    macd_line = short_ema - long_ema
    signal_line = macd_line.ewm(span=signal_window, adjust=False).mean()
    macd_histogram = macd_line - signal_line
    return macd_histogram

# Function to calculate VWAP
def calculate_vwap(data):
    typical_price = (data['High'] + data['Low'] + data['Close']) / 3
    vwap = (typical_price * data['Volume']).cumsum() / data['Volume'].cumsum()
    return vwap

# Function to calculate Bollinger Band Width
def calculate_bollinger_band_width(data, window=20, num_std_dev=2):
    middle_band = data['Adj Close'].rolling(window=window).mean()
    std_dev = data['Adj Close'].rolling(window=window).std()
    upper_band = middle_band + (num_std_dev * std_dev)
    lower_band = middle_band - (num_std_dev * std_dev)
    bollinger_band_width = upper_band - lower_band
    return bollinger_band_width

# Calculate features on the entire dataset
data['Log Returns'] = calculate_log_returns(data)
data['RSI'] = calculate_rsi(data)
data['MACD_Histogram'] = calculate_macd_histogram(data)
data['VWAP'] = calculate_vwap(data)
data['Bollinger_Band_Width'] = calculate_bollinger_band_width(data)
data['50_day_MA'] = data['Close'].rolling(window=50).mean()
data['20_day_MA'] = data['Close'].rolling(window=20).mean()
data['9_day_MA'] = data['Close'].rolling(window=9).mean()
data['Skewness'] = data['Log Returns'].rolling(window=20).apply(lambda x: skew(x, bias=False))
data['Kurtosis'] = data['Log Returns'].rolling(window=20).apply(lambda x: kurtosis(x, bias=False))

calculate_log_returns(data): Computes the log returns of the stock based on the closing prices, which are essential for understanding the day-to-day movement of the stock.

calculate_rsi(data, window=14): Calculates the Relative Strength Index (RSI), a momentum oscillator used to evaluate overbought or oversold conditions in the stock market. This is based on average gains and losses over a rolling window of 14 periods by default.

calculate_macd_histogram(data, short_window=12, long_window=26, signal_window=9): Computes the Moving Average Convergence Divergence (MACD) histogram, which indicates the strength and direction of a stock’s momentum. The MACD is calculated using the difference between short-term and long-term exponential moving averages (EMAs).

calculate_vwap(data): Calculates the Volume Weighted Average Price (VWAP), a trading benchmark that gives the average price a security has traded throughout the day, based on both volume and price.

calculate_bollinger_band_width(data, window=20, num_std_dev=2): Computes the Bollinger Band Width, which measures the volatility of a stock by calculating the spread between the upper and lower Bollinger Bands, based on the moving average and standard deviation of adjusted closing prices over a rolling window.

5.3) Skewness and Kurtosis

data['Skewness']: Computes the rolling skewness of log returns over a 20-day window. Skewness measures the asymmetry of the return distribution, providing insights into whether returns are skewed more to the left (negative skew) or right (positive skew).

data['Kurtosis']: Computes the rolling kurtosis of log returns over a 20-day window. Kurtosis measures the "tailedness" of the return distribution, indicating whether returns have more or fewer extreme values (fat tails or thin tails) compared to a normal distribution.

5.4) Volatility and Dynamic Window Size Calculation

In this phase of the project, the aim is to calculate market volatility and adjust the analysis window size dynamically based on this volatility. These steps are crucial for capturing the changing market dynamics more accurately. To better understand market conditions, it's important to calculate how much prices fluctuate over time (volatility) and adjust the analysis window size accordingly. Below is the code that handles these calculations:

# Function to calculate fixed window rolling volatility (standard deviation)
def calculate_fixed_window_volatility(data, window_size=20):
    return data.rolling(window=window_size).std()

# Function to determine dynamic window size based on volatility
def determine_dynamic_window_size(volatility, min_window=5, max_window=20, epsilon=1e-8):
    # Add epsilon to volatility to avoid division by zero or near-zero values
    inverse_volatility = 1 / (volatility + epsilon)
    
    # Normalize the inverse volatility to scale between min and max window sizes
    normalized_window_size = (inverse_volatility - inverse_volatility.min()) / (inverse_volatility.max() - inverse_volatility.min())
    
    # Scale the normalized window size to the specified window range
    dynamic_window_size = normalized_window_size * (max_window - min_window) + min_window
    
    # Fill any NaN values with the minimum window size and convert to integers
    return dynamic_window_size.fillna(min_window).astype(int)

# Calculate volatility and dynamic window sizes
data.loc[:, 'volatility'] = calculate_fixed_window_volatility(data['Log Returns'])
data.loc[:, 'dynamic_window_sizes'] = determine_dynamic_window_size(data['volatility'])

This code block begins by calculating the rolling volatility of log returns using the calculate_fixed_window_volatility function. The function takes the daily returns and computes how volatile the market has been over the last 20 days, storing the results in the volatility column.

Next, the determine_dynamic_window_size function adjusts the window size dynamically based on the calculated volatility. This adjustment ensures that during periods of high volatility, the analysis focuses on more recent data by using a smaller window size. The dynamically adjusted window sizes are then stored in the dynamic_window_sizes column.

The calculated dynamic_window_size will later be used for more accurate volatility calculation in the subsequent steps.

5.5) Dynamic Rolling Variance Calculation

Building on the previous step where we generated dynamic_window_sizes, this section calculates the rolling variance of daily returns using an exponential moving average (EMA). The calculate_rolling_variance function applies EMA to the dynamically adjusted window sizes, ensuring that more recent data is emphasized during periods of high volatility.

# Function to calculate rolling variance using exponential moving average (EMA)
def calculate_rolling_variance(data, window_size):
    return data.ewm(span=window_size).var()

# Initialize a list to store dynamic rolling variances, pre-filled with NaN values
dynamic_rolling_variances = [np.nan] * len(data)

# Calculate dynamic rolling variances for each data point
for idx, (_, row) in enumerate(data.iterrows()):
    window_size = int(row['dynamic_window_sizes'])  # Get dynamic window size for this row

    # Check if there are enough data points behind the current data point to create a window
    if idx < window_size - 1:
        continue  # Skip if not enough data points for rolling variance

    # Calculate rolling variance for the previous 'window_size' rows including current index
    start_idx = idx - window_size + 1
    end_idx = idx + 1
    data_window = data['Daily Returns'].iloc[start_idx : end_idx]

    # Calculate rolling variance
    dynamic_rolling_variance = calculate_rolling_variance(data_window, window_size).iloc[-1]

    # Store the result in the list at the correct index
    dynamic_rolling_variances[idx] = dynamic_rolling_variance

# Add the dynamic rolling variances as a new column to your DataFrame
data.loc[:, 'dynamic_rolling_variances'] = dynamic_rolling_variances

The code iterates through each data point, using the dynamic window size from the previous step to calculate the variance. This ensures that in more volatile market conditions, the rolling variance is calculated using a smaller, more focused window of recent data.

The dynamically calculated rolling variances are stored in a list and added as a new column, dynamic_rolling_variances, in the dataset. This allows the model to better capture how market variability evolves, adapting its analysis based on changes in volatility over time.

Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF)

In this project, a fixed window size of 20 was chosen for calculating rolling volatility. This decision was informed by analyzing the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots of the volatility data.

ACF&PACF Plot of Volatility

The ACF plot shows the correlation between the volatility and its lagged values, while the PACF plot helps in understanding the direct relationship between volatility and its lags, excluding the influence of intermediate lags.

Upon examining these plots, it was observed that the significant correlations and patterns start to diminish around the 20th lag. This indicates that a 20-day window effectively captures the relevant historical volatility, making it a suitable choice for the fixed window size in this context.

This selection ensures that the model accounts for the most impactful recent volatility trends, without incorporating too much noise from older data.

5.6) Bayesian Updating with Normal-Inverse-Gamma

In this project, Bayesian updating continues to play a key role in dynamically estimating the mean and variance of market returns. This section introduces the Normal-Inverse-Gamma conjugate prior, which allows us to update our estimates of the mean and variance efficiently. The Normal-Inverse-Gamma distribution is especially useful for modeling uncertainty in both the mean and variance of the returns.

Just as in the previous steps, this iterative Bayesian process ensures that the model adapts to incoming data, making it more robust and flexible in a dynamic market environment. By continuously updating the parameters, we can refine our understanding of market returns and make more informed decisions.

Why Use the Normal-Inverse-Gamma?

The Normal-Inverse-Gamma conjugate prior is chosen because it allows for simultaneous updating of both the mean and variance. This is crucial in financial markets, where both the average returns and volatility evolve over time. This Bayesian approach updates the posterior estimates of the mean (mu), the variance (sigma²), and the parameters governing these distributions, leading to more accurate and adaptive predictions.

Specifically, the Bayesian method helps in calculating:

Posterior mean (mu_posterior): Updated based on new daily return data.

Posterior variance (sigma²): Calculated using the updated posterior parameters.

The Bayesian Formulas Used

The code starts by defining initial priors:

mu_prior = 0  # Prior mean
kappa_prior = 1  # Prior precision factor
alpha_prior = 3  # Prior alpha for Inverse-Gamma
beta_prior = 2  # Prior beta for Inverse-Gamma

These priors represent our initial assumptions about the market's behavior:

mu_prior is the prior mean, set to 0.

kappa_prior is the prior precision, set to 1, representing initial certainty.

alpha_prior and beta_prior are parameters for the Inverse-Gamma distribution, governing our initial belief about the volatility.

The code then employs Bayesian formulas to update the posterior mean and variance, as well as to adjust the parameters of the Inverse-Gamma distribution, which models the uncertainty in volatility.

Updating the Posterior Mean, Precision, and Variance

The posterior mean (mu_posterior) and variance (sigma²) are updated using the following formulas:

Posterior Mean, Precision and Variance

Where:

  • x_i represents the new daily return data point.

  • mu_prior and kappa_prior are the prior estimates for the mean and precision.

  • alpha_prior and beta_prior are the prior parameters governing volatility.

These formulas describe how the posterior estimates are updated. The posterior mean (mu_posterior) is a weighted combination of the prior mean and the new data point x_i, where the weights are determined by the precision (kappa). The posterior precision (kappa_posterior) increases by 1 with each new observation, reflecting greater certainty in the estimate of the mean. Alpha increases by 0.5 with each observation, and beta adjusts based on how much the new observation deviates from the prior mean, weighted by the prior precision. These updates ensure that the model incorporates new information while retaining prior knowledge.

def update_posterior(x_i, mu_prior, kappa_prior, alpha_prior, beta_prior):
    """Update posterior parameters using the Normal-Inverse-Gamma conjugate prior."""
    # Update kappa
    kappa_posterior = kappa_prior + 1
    
    # Update mu
    mu_posterior = (kappa_prior * mu_prior + x_i) / kappa_posterior
    
    # Update alpha
    alpha_posterior = alpha_prior + 0.5
    
    # Update beta
    beta_posterior = beta_prior + (kappa_prior * (x_i - mu_prior) ** 2) / (2 * kappa_posterior)
    
    return mu_posterior, kappa_posterior, alpha_posterior, beta_posterior

Posterior Variance Formula

After updating the posterior parameters, the variance of the mean is calculated. The variance reflects the uncertainty in our estimate of the mean and is based on the updated posterior parameters.

Posterior_Variance

Where:

  • beta_posterior and alpha_posterior are the updated parameters from the previous step.

  • kappa_posterior is the updated precision, which influences the certainty of the mean.

The expected variance (sigma^2_expected) is calculated by dividing the updated posterior beta by the updated posterior alpha minus 1. The posterior variance of the mean (sigma^2_posterior) is then derived by dividing the expected variance by the updated precision (kappa_posterior). A higher precision results in lower variance, indicating that the model has more certainty about its mean estimate as new data is processed.

  def calculate_posterior_variance(kappa_posterior, alpha_posterior, beta_posterior):
    """Calculate the posterior variance of mu."""
    # Expected variance (sigma^2)
    expected_variance = beta_posterior / (alpha_posterior - 1)
    
    # Posterior variance of mu
    sigma_posterior_squared = expected_variance / kappa_posterior
    
    return sigma_posterior_squared

Iterating Bayesian Updates and Calculating the CDF

After updating the posterior mean and variance for each data point in the dataset, we now use these updated estimates to calculate the Cumulative Distribution Function (CDF). The CDF is important because it helps us assess the probability that a particular daily return will be below a certain threshold, giving insight into the likelihood of different market outcomes. By utilizing the updated posterior parameters, including the mean and variance, we can calculate the CDF for each daily return, which reflects the probability distribution based on new data.

CDF

cdfs = {}

# Loop through the data for Bayesian updates
for i, row in data.iterrows():
    x_i = row['Log Returns']
    
    # Update posterior parameters
    mu_posterior, kappa_posterior, alpha_posterior, beta_posterior = update_posterior(
        x_i, mu_prior, kappa_prior, alpha_prior, beta_prior)
    
    # Calculate posterior variance of mu
    sigma_posterior_squared = calculate_posterior_variance(kappa_posterior, alpha_posterior, beta_posterior)
    
    # Store posterior mean and standard deviation
    updated_bayes_means[i] = mu_posterior
    updated_bayes_sds[i] = np.sqrt(sigma_posterior_squared)
    
    # Calculate CDF
    cdfs[i] = norm.cdf(x_i, mu_posterior, np.sqrt(sigma_posterior_squared))
    
    # Update priors for next iteration
    mu_prior, kappa_prior, alpha_prior, beta_prior = mu_posterior, kappa_posterior, alpha_posterior, beta_posterior

# Store the calculated CDF values in the DataFrame
data.loc[:, 'CDF'] = data.index.map(cdfs)

In this section, the code iterates through each observation in the dataset. For each log return (x_i), the posterior parameters for the mean (mu_posterior), precision (kappa_posterior), and variance-controlling parameters (alpha_posterior, beta_posterior) are updated using Bayesian updating, just like in the previous step. After updating the parameters, the posterior variance (sigma_posterior_squared) of the mean is calculated, giving us an updated sense of the uncertainty around the mean estimate.

Next, the CDF is computed using the updated posterior mean and variance. This step uses the norm.cdf function, which calculates the probability that a given return will be less than or equal to x_i based on the updated distribution. The CDF values represent these probabilities for each data point. After calculating the CDF, the priors are updated with the new posterior values, allowing the model to continuously adapt as more data comes in. This ensures that each new observation refines our estimates further, making the model more adaptive and accurate over time.

Finally, all the calculated CDF values are stored in a new column (CDF) in the data DataFrame, making it easy to access and use them for further analysis or decision-making.

6) Model Training

In this section, we move into the model training phase of our project, where we aim to develop a strategy for trading based on the financial data we have preprocessed. To achieve this, we use reinforcement learning (RL), a powerful machine learning approach that allows our model to learn from its own actions in a simulated environment. One of the most effective tools for developing RL models is the OpenAI Gym library. OpenAI Gym provides a collection of environments — computational representations of problems — that standardize how agents (our models) interact with these environments. This allows us to build, test, and compare various RL strategies effectively.

6.1) What is an Environment in OpenAI Gym?

In the context of reinforcement learning, an environment is essentially a problem that the agent (our trading model) tries to solve. It defines the state space (all possible situations the agent might encounter), the action space (all possible moves the agent can make), and the reward structure (how good or bad each action is, given the current state). The OpenAI Gym library provides a convenient framework for building these environments and includes many pre-built environments for classic RL problems like cart-pole balancing, playing video games, and more. However, since our problem is a custom financial trading scenario, we need to create a custom environment that reflects the dynamics of trading in a financial market.

6.2) Feature Normalization and Data Splitting

In this step, we split the dataset into training and validation sets and normalize specific features to ensure the model performs optimally. We exclude certain features like the RSI and CDF from normalization since they represent ratios and probabilities, which don’t require scaling. The normalization process ensures that all features, aside from RSI and CDF, are on a consistent scale, making it easier for the model to learn from the data.

# List of features to normalize (excluding 'RSI' and 'CDF')
features_to_normalize = [
    'Nor_Adj_Close', 'MACD_Histogram', 'VWAP',
    '50_day_MA', '20_day_MA', '9_day_MA',
    'Skewness', 'Kurtosis', 'dynamic_rolling_variances'
]

# Split the data into training and validation sets
train_end_date = '2021-12-31'
val_end_date = '2022-12-31'

train_data = data[data.index <= train_end_date].copy()
val_data = data[(data.index > train_end_date) & (data.index <= val_end_date)].copy()

# Ensure there's no overlap
assert train_data.index.max() < val_data.index.min(), "Training and validation data overlap!"

# Normalize features using scalers fitted on training data
scaler = StandardScaler()
train_data[features_to_normalize] = scaler.fit_transform(train_data[features_to_normalize])
val_data[features_to_normalize] = scaler.transform(val_data[features_to_normalize])

# Reset indices but keep the date as a column
train_data.reset_index(inplace=True)
val_data.reset_index(inplace=True)
train_data.rename(columns={'index': 'Date'}, inplace=True)
val_data.rename(columns={'index': 'Date'}, inplace=True)

# Ensure 'Date' columns are datetime
train_data['Date'] = pd.to_datetime(train_data['Date'])
val_data['Date'] = pd.to_datetime(val_data['Date'])

The features to be normalized include values such as adjusted close prices, technical indicators like MACD Histogram, and statistical measures like skewness. However, we intentionally avoid normalizing RSI and CDF because RSI is a bounded ratio between 0 and 100, and CDF represents probabilities, both of which are already on a standard scale. The dataset is split into training data (up to the end of 2021) and validation data (2022). We fit a StandardScaler on the training data, ensuring both the training and validation sets are scaled similarly. Lastly, the indices are reset and converted to datetime format, making the data ready for use in the next steps.

6.3) Creating a Custom Trading Environment

To model our trading scenario, we define a custom environment class named TradingEnv by extending the gym.Env class from OpenAI Gym. This custom environment will simulate the process of trading a financial asset, allowing our RL model to interact with it by buying, holding, or selling based on the available data.

Let's walk through the custom environment, breaking it down into separate parts with explanations.

Initialization (init)

The __init__ function sets up the environment with necessary parameters and variables. This includes the initial balance, transaction costs, lookback window, and the dataset used for training. It also defines the action space (Buy, Hold, Sell) and observation space for the reinforcement learning model.

class TradingEnv(gym.Env):
    def __init__(self, data, initial_balance=1000000, transaction_cost=0.000135, lookback_window=10):
        super(TradingEnv, self).__init__()

        # Dataset and parameters
        self.data = data
        self.initial_balance = initial_balance
        self.transaction_cost = transaction_cost
        self.lookback_window = lookback_window

        # Define min and max percentage for episode length calculation
        self.min_percentage = 0.30
        self.max_percentage = 0.90

        # Environment state
        self.balance = self.initial_balance
        self.stock_owned = 0
        self.current_step = 0
        self.current_position = None
        self.entry_price = None
        self.last_action = None
        self.profit_target_reached = False
        self.trading_history = []

        # Action space: Buy, Hold, Sell
        self.action_space = gym.spaces.Discrete(3)
        
        # Observation space: shape is (lookback_window, n_features), i.e., sequence of timesteps
        self.n_features = len(self._get_observation())
        self.observation_space = gym.spaces.Box(
            low=-np.inf, high=np.inf, shape=(self.lookback_window, self.n_features), dtype=np.float32
        )

        # Initialize a tracker to ensure each datapoint is covered
        self.datapoints_covered = np.zeros(len(self.data) - self.lookback_window, dtype=bool)

This function initializes the trading environment. The dataset (data) contains historical market data, and the agent starts with a predefined balance (initial_balance). The agent can choose between three actions: Buy, Hold, and Sell, defined in the action_space. The environment tracks essential variables like stock owned, current balance, and trading history. It also sets up a lookback window to observe past market data and normalize the data, ensuring that the environment processes data over a range of time steps.

The observation_space defines the shape of the data the model receives at each timestep, which includes the lookback window's worth of market and internal state features.

Observation Function (_get_observation)

The _get_observation function generates a single feature vector (observation) for the current timestep. It pulls data such as adjusted close price, RSI, MACD Histogram, and internal state features (balance, stock owned, and last action).

def _get_observation(self, index=None):
    """
    Return a single observation (a feature vector) from the dataset.
    The observation corresponds to the timestep at the provided index,
    including internal state features.
    """
    if index is None:
        index = self.current_step
    if index >= len(self.data):
        raise ValueError("Current step exceeds the length of the data")

    current_data = self.data.iloc[index]

    # Construct the feature vector with the specified features
    observation = np.array([
        current_data['Nor_Adj_Close'],
        current_data['RSI'],
        current_data['MACD_Histogram'],
        current_data['VWAP'],
        current_data['Bollinger_Band_Width'],
        current_data['50_day_MA'],
        current_data['20_day_MA'],
        current_data['9_day_MA'],
        current_data['Skewness'],
        current_data['Kurtosis'],
        current_data['dynamic_rolling_variances'],
        current_data['CDF'],
        # Internal state features
        self.balance / self.initial_balance,  # Normalize balance
        self.stock_owned,
        self.last_action if self.last_action is not None else -1  # Last action taken (-1 if none)
    ], dtype=np.float32)

    return observation

This function pulls relevant features from the dataset and combines them with internal state variables like current balance, stock owned, and the last action. This observation is what the agent receives at each timestep to make trading decisions. It is designed to provide a snapshot of the market's current state, including technical indicators like MACD Histogram, RSI, and other calculated features. Internal features (e.g., normalized balance) help the agent keep track of its portfolio.

Setup Timesteps Function (_setup_timesteps)

The _setup_timesteps function calculates the number of timesteps (or duration) for each episode, based on how much data remains and random intervals.

def _setup_timesteps(self, start_index):
    """
    Setup the number of timesteps for an episode, based on the start index and remaining data.
    Returns 0 if there are not enough sequences to form at least one timestep.
    """
    dataset_size = len(self.data)
    remaining_data = dataset_size - (start_index + 1)
    remaining_sequences = remaining_data - (self.lookback_window - 1)
    
    if remaining_sequences < 1:
        return 0

    # Calculate min and max timesteps based on percentages
    min_timesteps = max(1, int(remaining_sequences * self.min_percentage))
    max_timesteps = int(remaining_sequences * self.max_percentage)
    
    if max_timesteps <= min_timesteps:
        max_timesteps = min_timesteps + 1
    
    num_timesteps = np.random.randint(min_timesteps, max_timesteps)
    num_timesteps = min(num_timesteps, remaining_sequences)
    
    return num_timesteps

The environment uses this function to decide the length of each episode dynamically, based on the amount of data left in the dataset. It ensures that the environment has enough data to process meaningful trading sequences. The random selection of timesteps between the minimum and maximum percentages keeps each episode varied, making the environment more dynamic for the reinforcement learning model.

Reset Function (reset)

The reset function restores the environment to its initial state at the start of each new episode, selecting a valid starting point for the agent to begin trading.

def reset(self):
    """
    Reset the environment to the initial state and return the first sequence.
    Randomly select a starting point from uncovered datapoints.
    If no valid starting point is found, reset the tracker and attempt again.
    """
    # Reset environment variables
    self.balance = self.initial_balance
    self.stock_owned = 0
    self.current_position = None
    self.entry_price = None
    self.trading_history = []

    # Attempt to find a valid starting point
    valid_start_found = False
    max_attempts = len(self.datapoints_covered)  # Prevent infinite loop
    attempts = 0

    while not valid_start_found and attempts < max_attempts:
        uncovered_indices = np.where(self.datapoints_covered == False)[0]
        
        if len(uncovered_indices) == 0:
            self.datapoints_covered[:] = False
            uncovered_indices = np.where(self.datapoints_covered == False)[0]
            if len(uncovered_indices) == 0:
                raise ValueError("Dataset too small to form any sequences.")
        
        selected_index = np.random.choice(uncovered_indices)
        self.current_step = selected_index + (self.lookback_window - 1)
        
        self.num_timesteps = self._setup_timesteps(self.current_step)
        
        if self.num_timesteps > 0:
            valid_start_found = True
            start_cover = self.current_step - self.lookback_window + 1
            end_cover = self.current_step + 1
            self.datapoints_covered[start_cover:end_cover] = True
            return self._get_sequence(self.current_step)
        else:
            self.datapoints_covered[selected_index] = True
            attempts += 1
            print(f"Skipping episode starting at index {selected_index} due to insufficient data.")

    if not valid_start_found:
        self.datapoints_covered[:] = False
        print("Resetting datapoints_covered tracker and attempting to reset again.")
        return self.reset()

This function resets the environment at the beginning of each episode. It restores the initial balance, clears the trading history, and randomly selects an uncovered starting point from the dataset. The function ensures that enough data is available to form valid trading sequences. If not enough data is available, it resets the data tracker and tries again, ensuring fair starting points for every episode.

Reset Evaluation Function (reset_eval)

The reset_eval function resets the environment specifically for evaluation, always starting from the beginning of the dataset.

def reset_eval(self):
    """
    Reset the environment to the initial state for evaluation/testing.
    Start from the beginning of the dataset.
    """
    self.balance = self.initial_balance
    self.stock_owned = 0
    self.current_position = None
    self.entry_price = None
    self.trading_history = []
    self.profit_target_reached = False

    self.current_step = self.lookback_window - 1
    return self._get_sequence(self.current_step)

This function is a simplified reset used during the evaluation phase. It resets all internal states like balance, stock ownership, and trading history but starts from the beginning of the dataset, unlike

Get Sequence Function (_get_sequence)

The _get_sequence function returns a sequence of timesteps based on the current step, using the lookback window to gather past observations and the present one.

def _get_sequence(self, current_step):
    """
    Return a sequence of timesteps looking back from the current step.
    The sequence includes the current step and the previous lookback_window - 1 steps.
    """
    if current_step < self.lookback_window - 1:
        raise ValueError("Not enough data points to create the lookback window")

    sequence_start = current_step - self.lookback_window + 1
    sequence_end = current_step + 1  # Include the current step

    # Collect the sequence of observations for each timestep in the window
    sequence = []
    for i in range(sequence_start, sequence_end):
        observation = self._get_observation(i)  # Get the observation for each timestep
        sequence.append(observation)

    return np.array(sequence).astype(np.float32)

This function provides the reinforcement learning model with a sequence of observations. It uses the current step and a lookback window to collect data from both past and present timesteps, forming the "input history" for the model. This ensures that the agent can consider historical data when making decisions. The function checks that there is enough data to form a sequence and raises an error if the dataset is too small.

Step Function (step)

The step function processes the agent’s action and returns the next sequence, reward, and a flag indicating whether the episode is finished. It also updates the agent’s portfolio, calculates transaction costs, and rewards based on the chosen action.

def step(self, action):
        """
        Take an action, calculate the reward, and return the next sequence.
        """
        # Record the current price
        current_price = self.data.iloc[self.current_step]['Adj Close']
        
        # Initialize variables
        reward = 0
        transaction_cost_value = 0
    
       
        # Scaling factors
        immediate_reward_scale = 1.0  # Scale for immediate rewards (e.g., price difference rewards)
        sharpe_ratio_scale = 100
        
        # Handle the actions: 0 = Sell, 1 = Hold, 2 = Buy
        if action == 2:  # Buy
            if self.stock_owned == 0:
                # Buy with 15% of balance (as before)
                max_shares = (self.balance * 0.10) / current_price
                if max_shares >= 1:
                    shares_to_buy = np.floor(max_shares)
                    total_purchase = shares_to_buy * current_price
                    transaction_cost_value = total_purchase * self.transaction_cost
                    total_cost = total_purchase + transaction_cost_value
                    self.stock_owned += shares_to_buy
                    self.balance -= total_cost
                    self.entry_price = current_price  # Track the purchase price
                    
                    # No reward or penalty for buying; just tracking the position
                    reward = 0 
                else:
                    # penalty for overspending 
                    reward -= 10 * immediate_reward_scale
            else:
                # penalizing for repitative buying 
                reward -= 10 * immediate_reward_scale
    
        elif action == 1:  # Hold
            if self.stock_owned > 0:
                # Calculate the difference between the current price and the entry price
                price_difference = current_price - self.entry_price
                reward = price_difference * self.stock_owned  # Reward is profit or loss on held stock
    
                # Scale the immediate reward
                reward *= immediate_reward_scale
    
            else:
                # penalty for holding with out stock held
                reward -= 10 * immediate_reward_scale
    
        elif action == 0:  # Sell
            if self.stock_owned > 0:
                # Calculate profit or loss based on sell price and buy price
                price_difference = current_price - self.entry_price
                reward = price_difference * self.stock_owned  # Reward is profit or loss on sold stock
    
                # Sell all shares
                gross_sell_value = self.stock_owned * current_price
                transaction_cost_value = gross_sell_value * self.transaction_cost
                net_sell_value = gross_sell_value - transaction_cost_value
                self.balance += net_sell_value
                self.stock_owned = 0
                self.entry_price = None
    
                # Scale the immediate reward
                reward *= immediate_reward_scale
    
            else:
                # Penalty for attempting to sell without holding any stock
                reward -= 10 * immediate_reward_scale  # Penalty remains the same
    
        # Update last action
        self.last_action = action
    
        # Move the step forward
        self.current_step += 1
    
        # Update net worth
        current_net_worth = self.balance + self.stock_owned * current_price
    
        # Set reward to zero for the buy action (no immediate feedback for buy)
        if action == 2:
            reward = 0  # No reward for buying
    
        # Check if we're done
        done = self.current_step >= len(self.data) - 1
    
        # If done, liquidate any remaining stock holdings
        if done and self.stock_owned > 0:
            # Sell all shares at current price
            gross_sell_value = self.stock_owned * current_price
            transaction_cost_value = gross_sell_value * self.transaction_cost
            net_sell_value = gross_sell_value - transaction_cost_value
            self.balance += net_sell_value
            self.stock_owned = 0
            self.entry_price = None
            # Update net worth
            current_net_worth = self.balance
    
        # If done (end of the episode), calculate Sharpe Ratio and add it as a scaled reward
        if done:
            historical_net_worth = [entry['portfolio_value'] for entry in self.trading_history]
            returns = np.diff(historical_net_worth)
            sharpe_ratio = self.calculate_sharpe_ratio(returns)
    
            # Scale the Sharpe Ratio to make it comparable to immediate rewards
            scaled_sharpe_ratio = sharpe_ratio * sharpe_ratio_scale  # Example scaling factor
    
            # Add the scaled Sharpe Ratio as a final bonus reward
            reward += scaled_sharpe_ratio
    
        # Task completion reward (10% total return)
        if not self.profit_target_reached and current_net_worth >= self.initial_balance * 1.10:
            # Calculate bonus as 10% of profit
            profit = current_net_worth - self.initial_balance
            bonus_reward = profit * 0.10
            reward += bonus_reward
            self.profit_target_reached = True
    
        # Log the trading history
        self.trading_history.append({
            'step': self.current_step,
            'current_price': current_price,
            'balance': self.balance,
            'portfolio_value': current_net_worth,
            'reward': reward,
            'action': action
        })
    
        # Get the next sequence after the action
        if not done:
            next_sequence = self._get_sequence(self.current_step)
        else:
            # Ensure next_sequence is a NumPy array of the correct shape filled with zeros
            next_sequence = np.zeros_like(self._get_sequence(self.current_step - 1))
    
        return next_sequence, reward, done, {"portfolio_value": current_net_worth}

The step function processes the agent’s chosen action and returns the next observation sequence, reward, and done flag. The agent can choose to buy, hold, or sell stocks. Buying involves purchasing stock with a portion of the available balance, holding maintains the stock position, and selling involves liquidating the holdings. Rewards are calculated based on the change in stock price relative to the entry price, with penalties for invalid actions (e.g., selling without holding stock). At the end of the episode, the Sharpe ratio is calculated and added to the reward, and any remaining stock is liquidated.

Calculate Sharpe Ratio Function (calculate_sharpe_ratio)

The calculate_sharpe_ratio function computes the Sharpe ratio for the agent's trading performance, based on the returns and a given risk-free rate.

def calculate_sharpe_ratio(self, returns, risk_free_rate=0.02):
    """
    Calculate the Sharpe ratio based on the returns.
    """
    excess_returns = returns - risk_free_rate
    sharpe_ratio = np.mean(excess_returns) / (np.std(excess_returns) + 1e-9)
    return sharpe_ratio

This function calculates the Sharpe ratio, a popular metric for evaluating the risk-adjusted return of an investment. It measures the excess return per unit of risk taken, where the risk is represented by the standard deviation of returns. The higher the Sharpe ratio, the better the risk-adjusted performance. In this case, it is used as a final bonus reward at the end of each episode.

Render Function (render)

The render function plots the portfolio value, rewards, and actions over the course of an episode, providing a visual representation of the agent's performance.

def render(self, mode='human'):
    """
    Render plots showing the portfolio value, rewards, and actions over the episode.
    """
    if len(self.trading_history) == 0:
        print("No trading history available for this episode.")
        return

    steps = [entry['step'] for entry in self.trading_history]
    portfolio_values = [entry['portfolio_value'] for entry in self.trading_history]
    rewards = [entry['reward'] for entry in self.trading_history]
    prices = [entry['current_price'] for entry in self.trading_history]
    
    actions = [entry.get('action', None) for entry in self.trading_history]

    fig, axs = plt.subplots(3, 1, figsize=(12, 15))

    axs[0].plot(steps, portfolio_values, label='Portfolio Value', color='blue')
    axs[0].set_title('Portfolio Value Over Time')
    axs[0].set_xlabel('Timestep')
    axs[0].set_ylabel('Portfolio Value (INR)')
    axs[0].legend()
    axs[0].grid(True)

    cumulative_rewards = np.cumsum(rewards)
    axs[1].plot(steps, rewards, label='Reward', color='orange')
    axs[1].plot(steps, cumulative_rewards, label='Cumulative Reward', color='green', linestyle='--')
    axs[1].set_title('Rewards Over Time')
    axs[1].set_xlabel('Timestep')
    axs[1].set_ylabel('Reward')
    axs[1].legend()
    axs[1].grid(True)

    axs[2].plot(steps, prices, label='Price', color='black')
    
    buy_steps = [step for step, action in zip(steps, actions) if action == 2]
    buy_prices = [price for price, action in zip(prices, actions) if action == 2]
    sell_steps = [step for step, action in zip(steps, actions) if action == 0]
    sell_prices = [price for price, action in zip(prices, actions) if action == 0]
    hold_steps = [step for step, action in zip(steps, actions) if action == 1]
    hold_prices = [price for price, action in zip(prices, actions) if action == 

6.4) SumTree Class for Prioritized Experience Replay

The SumTree is an essential data structure used in Prioritized Experience Replay, which helps in efficiently sampling experiences based on their priority. The SumTree stores experiences along with their associated priorities and ensures that experiences with higher priority are sampled more frequently during training.

class SumTree:
    def __init__(self, capacity):
        self.capacity = capacity  # Number of leaf nodes (experiences)
        self.tree = np.zeros(2 * capacity - 1)  # Binary tree array
        self.data = np.zeros(capacity, dtype=object)  # Experience storage
        self.size = 0
        self.data_pointer = 0

    def add(self, priority, data):
        idx = self.data_pointer + self.capacity - 1
        self.data[self.data_pointer] = data
        self.update(idx, priority)

        self.data_pointer += 1
        if self.data_pointer >= self.capacity:  # Replace when full
            self.data_pointer = 0

        self.size = min(self.size + 1, self.capacity)

    def update(self, idx, priority):
        change = priority - self.tree[idx]
        self.tree[idx] = priority

        while idx != 0:
            idx = (idx - 1) // 2
            self.tree[idx] += change

    def get_leaf(self, value):
        parent_idx = 0
        while True:
            left_child_idx = 2 * parent_idx + 1
            right_child_idx = left_child_idx + 1

            if left_child_idx >= len(self.tree):
                leaf_idx = parent_idx
                break
            else:
                if value <= self.tree[left_child_idx]:
                    parent_idx = left_child_idx
                else:
                    value -= self.tree[left_child_idx]
                    parent_idx = right_child_idx

        data_idx = leaf_idx - self.capacity + 1
        return leaf_idx, self.tree[leaf_idx], self.data[data_idx]

    def total_priority(self):
        return self.tree[0]  # Root node

Sum tree operates with a fixed capacity, storing experiences and their associated priorities in a binary tree structure (self.tree), while the actual experiences are stored in a separate array (self.data). The add function inserts new experiences with a priority, and as the buffer fills, older experiences are cyclically replaced. The update function adjusts the priority of a specific experience, propagating the changes through the tree to maintain accurate total priority values. The get_leaf function performs a binary search to sample an experience based on its priority, ensuring that experiences with higher priorities are more likely to be selected. Finally, the total_priority function returns the total sum of all priorities, which is stored at the root of the tree and used for probability-based sampling. This structure enables the system to efficiently prioritize important experiences during training.

6.5) Setting Up the Environment and Prioritized Experience Replay Buffer

At this stage of the project, we are defining the trading environments and setting up the replay buffer using a SumTree structure. The replay buffer will store experiences that the agent learns from, and the prioritization ensures that the most important experiences are sampled more frequently during training. This section also initializes the beta parameter, which helps control the importance-sampling weights during training, ensuring that the agent learns from both high-priority and random experiences over time.

# Define the named tuple for experience
Experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])

# Setting up the environments
env = TradingEnv(train_data)
val_env = TradingEnv(val_data)
num_episodes = 550  # Adjust the number of episodes as needed

# Create the replay buffer using the SumTree class
MEMORY_SIZE = 50000
memory_buffer = SumTree(capacity=MEMORY_SIZE)  # Set replay buffer size to 50,000
alpha = 0.6  # Prioritization exponent

# Initialize beta parameters for importance-sampling weights
beta_start = 0.4  # Starting value for beta
beta_end = 1.0  # Final value for beta
beta_increment_per_episode = (beta_end - beta_start) / num_episodes

# Initialize beta
beta = beta_start

In this code, we begin by defining Experience to store each step of the agent’s interaction with the environment. We then set up the training and validation environments using the TradingEnv class. The replay buffer is created with a SumTree structure, which allows us to efficiently store and prioritize up to 50,000 experiences. The parameter alpha controls how much prioritization is applied, and beta is initialized to adjust the importance-sampling weights, increasing its value gradually over 550 episodes to ensure that the agent learns from both highly prioritized and randomly selected experiences.

6.6 Storing and Sampling Experiences in Prioritized Experience Replay

As the agent interacts with the environment, it gathers experiences that are crucial for learning. These two functions handle the storage of experiences in the replay buffer and the sampling of experiences with prioritization, making sure that more significant experiences have a higher probability of being replayed during training. This is essential for improving the efficiency of learning, especially in reinforcement learning setups where important experiences should be learned more frequently.

Storing Experiences

def store_experience(memory_buffer, exp):
    """
    Stores an experience tuple in the replay buffer.
    """
    # Get the maximum current priority from the leaf nodes
    max_priority = np.max(memory_buffer.tree[-memory_buffer.capacity:]) if memory_buffer.size > 0 else 1.0

    # Add the new experience to the memory buffer with the max priority
    memory_buffer.add(max_priority, exp)

This function is responsible for storing an experience tuple (state, action, reward, next_state, done) in the memory_buffer. The experience is added with the highest possible priority, ensuring it will be prioritized for replay when sampling. If the buffer already has experiences, it uses the maximum priority of existing experiences; otherwise, it defaults to a priority of 1.0 for the first experience.

Sampling Experiences

def sample_experiences(memory_buffer, batch_size, beta):
    current_size = memory_buffer.size
    actual_batch_size = min(batch_size, current_size)

    experiences = []
    indexes = []
    priorities = []

    total_priority = memory_buffer.total_priority()

    if total_priority == 0:
        print("Total priority is zero, cannot sample experiences.")
        return None, None, None

    segment = total_priority / actual_batch_size  # Divide total priority into equal segments

    for i in range(actual_batch_size):
        a = segment * i
        b = segment * (i + 1)

        # Sample a value within the segment
        s = np.random.uniform(a, b)
        idx, priority, experience = memory_buffer.get_leaf(s)

        experiences.append(experience)
        indexes.append(idx)
        priorities.append(priority)

    # Extract components from experiences
    states = np.array([e.state for e in experiences], dtype=np.float32)
    actions = np.array([e.action for e in experiences], dtype=np.int32)
    rewards = np.array([e.reward for e in experiences], dtype=np.float32)
    next_states = np.array([e.next_state for e in experiences], dtype=np.float32)
    dones = np.array([e.done for e in experiences], dtype=np.float32)

    # Calculate Importance-Sampling (IS) weights
    sampling_probabilities = priorities / total_priority
    is_weights = np.power(current_size * sampling_probabilities, -beta)
    is_weights /= is_weights.max()  # Normalize IS weights

    # Convert to tensors
    states = tf.convert_to_tensor(states, dtype=tf.float32)
    actions = tf.convert_to_tensor(actions, dtype=tf.int32)
    rewards = tf.convert_to_tensor(rewards, dtype=tf.float32)
    next_states = tf.convert_to_tensor(next_states, dtype=tf.float32)
    dones = tf.convert_to_tensor(dones, dtype=tf.float32)
    is_weights = tf.convert_to_tensor(is_weights, dtype=tf.float32)

    return (states, actions, rewards, next_states, dones), indexes, is_weights

This function is responsible for sampling experiences from the memory_buffer with prioritization. It first calculates the total priority and divides it into segments to ensure that experiences are sampled proportionally to their priority. A random value is selected within each segment, and the corresponding experience is retrieved from the SumTree.

The function extracts the state, action, reward, next_state, and done from the sampled experiences, and it computes importance-sampling weights to adjust for any bias introduced by prioritization. These weights ensure that experiences with lower probabilities are given more importance during training. Finally, the function converts all components into tensors to be used for training and returns them along with the indexes and importance-sampling weights.

6.7) Action Selection and Model Evaluation Functions

At this stage in our reinforcement learning model, we handle action selection, TD error calculation, priority updating, loss computation, learning, and model evaluation. Each of these components plays a crucial role in enabling the agent to make informed decisions, update its learning parameters, and optimize its trading strategies over time.

Action Selection

def get_action(state, epsilon, model):
    """
    Chooses an action based on the epsilon-greedy strategy, using the provided model.
    """
    if np.random.rand() <= epsilon:
        return random.choice([0, 1, 2])
    else:
        state = np.expand_dims(state, axis=0)  # Shape: (1, lookback_window, n_features)
        q_values = model(state)
        return np.argmax(q_values.numpy()[0])

The get_action function is responsible for selecting an action using an epsilon-greedy strategy. If a random number is less than epsilon, the function explores by choosing a random action (0, 1, or 2). Otherwise, the model is used to exploit by selecting the action with the highest predicted Q-value.

Computing Temporal Difference (TD) Error

def compute_td_error(q_network, target_q_network, experiences, gamma):
    """Computes the TD error for a batch of experiences."""
    states, actions, rewards, next_states, dones = experiences

    # Compute target Q-values
    q_next = target_q_network(next_states)
    max_q_next = tf.reduce_max(q_next, axis=1)
    y_targets = rewards + gamma * max_q_next * (1 - dones)

    # Compute current Q-values
    q_values = q_network(states)
    indices = tf.stack([tf.range(q_values.shape[0]), actions], axis=1)
    q_values_taken = tf.gather_nd(q_values, indices)

    # TD Errors
    td_errors = y_targets - q_values_taken
    return td_errors.numpy()

The compute_td_error function calculates the TD error, which is the difference between the predicted Q-values and the target Q-values. These errors are used to update priorities in the experience replay buffer.

Updating Priorities in the Replay Buffer

def update_priorities(memory_buffer, indexes, td_errors, alpha=alpha):
    """
    Updates the priorities of sampled experiences in the replay buffer.
    """
    for idx, td_error in zip(indexes, td_errors):
        priority = (np.abs(td_error) + 1e-5) ** alpha
        memory_buffer.update(idx, priority)

The update_priorities function updates the priorities of experiences in the memory buffer based on the TD errors. The priority is computed using the absolute TD error and raised to the power of alpha for prioritization purposes.

Computing the Loss

def compute_loss(experiences, gamma, q_network, target_q_network, is_weights):
    # Unpack the mini-batch of experience tuples
    states, actions, rewards, next_states, done_vals = experiences

    # Compute max Q^(s,a)
    max_qsa = tf.reduce_max(target_q_network(next_states), axis=-1)

    # Set y = R if episode terminates, otherwise set y = R + Îł max Q^(s,a).
    y_targets = rewards + (gamma * max_qsa * (1 - done_vals))

    # Get the q_values
    q_values = q_network(states)
    q_values = tf.gather_nd(q_values, tf.stack([tf.range(q_values.shape[0]),
                                                tf.cast(actions, tf.int32)], axis=1))

    # Calculate the loss
    loss = MSE(y_targets, q_values)
    
    # Apply importance-sampling weights
    weighted_loss = loss * is_weights

    # Return the mean weighted loss
    return tf.reduce_mean(weighted_loss)

The compute_loss function computes the loss by comparing the Q-values predicted by the q_network with the target Q-values. The loss is weighted using the importance-sampling weights to account for prioritized replay.

Agent Learning and Updating Networks

@tf.function
def agent_learn(experiences, gamma, is_weights):
    """
    Updates the weights of the Q networks.
    """
    # Calculate the loss
    with tf.GradientTape() as tape:
        loss = compute_loss(experiences, gamma, q_network, target_q_network, is_weights)

    # Get the gradients of the loss with respect to the weights.
    gradients = tape.gradient(loss, q_network.trainable_variables)

    # Update the weights of the q_network.
    optimizer.apply_gradients(zip(gradients, q_network.trainable_variables))

    # Update the weights of target q_network
    update_target_network(q_network, target_q_network)

The agent_learn function performs backpropagation to update the q_network weights using the computed gradients. It also periodically updates the target network weights to ensure stability in training.

Updating the Target Network

def update_target_network(q_network, target_q_network):
    for target_weights, q_net_weights in zip(target_q_network.weights, q_network.weights):
        target_weights.assign(TAU * q_net_weights + (1.0 - TAU) * target_weights)

The update_target_network function updates the target network by slowly adjusting its weights toward the Q-network’s weights, using a parameter TAU. This soft update stabilizes learning by making gradual updates.

Checking Update Conditions

def check_update_conditions(t, num_steps_upd, memory_buffer):
    if (t + 1) % num_steps_upd == 0 and memory_buffer.size >= batch_size:
        return True
    else:
        return False

The check_update_conditions function checks whether the conditions for updating the Q-network are met. It ensures that updates only occur after a certain number of steps and when the replay buffer has enough experiences.

Evaluating the Model

def evaluate_model(env, model):
    """
    Evaluates the model on the entire environment and returns the total reward and final portfolio value.
    """
    state = env.reset_eval()
    total_reward_val = 0
    done = False
    final_portfolio_value = env.initial_balance  # Initialize with the initial balance

    while not done:
        action = get_action(state, epsilon=0, model=model)
        next_state, reward, done, info = env.step(action)
        total_reward_val += reward
        final_portfolio_value = info['portfolio_value']
        state = next_state

    return total_reward_val, final_portfolio_value

The evaluate_model function evaluates the model by running it through the entire environment without exploration (i.e., always using the best action). It returns the total reward and the final portfolio value. This is typically used to assess how well the model performs on unseen data.

Each of these functions plays a crucial role in reinforcing the learning loop, updating the model's weights, prioritizing important experiences, and evaluating the performance of the reinforcement learning agent.

6.8) Network Architecture and Training Hyperparameters

At this point, we are defining the neural network architecture that the agent will use to predict Q-values for actions. This deep Q-network (DQN) is responsible for approximating the Q-values in reinforcement learning, helping the agent make decisions. The architecture is built using convolutional layers and LSTM layers, making it suitable for handling time-series data in stock trading.

# Access the parameters from the environment
lookback_window = env.lookback_window
n_features = env.n_features
num_actions = env.action_space.n

# Set the random seed for TensorFlow
SEED = 0 
tf.random.set_seed(SEED)

# Define regularization and dropout parameters
regularization_term = l2(0.003)
dropout_rate = 0.1

def create_model():
    model = Sequential([
        Input(shape=(lookback_window, n_features)),
        Conv1D(filters=32, kernel_size=3, activation='relu', padding='same'),
        BatchNormalization(),
        Dropout(dropout_rate),
        Bidirectional(LSTM(32, return_sequences=True, kernel_regularizer=regularization_term)),
        BatchNormalization(),
        Dropout(dropout_rate),
        Bidirectional(LSTM(32, return_sequences=True, kernel_regularizer=regularization_term)),
        BatchNormalization(),
        Dropout(dropout_rate),
        Bidirectional(LSTM(16, return_sequences=False, kernel_regularizer=regularization_term)),
        BatchNormalization(),
        Dropout(dropout_rate),
        Dense(64, activation='relu', kernel_regularizer=regularization_term),
        BatchNormalization(),
        Dropout(dropout_rate),
        Dense(32, activation='relu', kernel_regularizer=regularization_term),
        Dropout(dropout_rate),
        Dense(num_actions, activation='linear')
    ])
    return model

# Create the Q-network and target Q-network
q_network = create_model()
target_q_network = create_model()
target_q_network.set_weights(q_network.get_weights())

# Optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001, clipnorm=1.0)

This code defines the architecture of the Q-network and target Q-network. The network uses a combination of convolutional and LSTM layers, which are suitable for processing sequential data like stock prices. Convolutional layers help extract local patterns, while LSTM layers capture the long-term dependencies in the time-series data.

To prevent overfitting, BatchNormalization is used after each layer to normalize the activations, and Dropout is added for regularization. The L2 regularization (controlled by regularization_term) further helps in reducing overfitting by penalizing large weights in the network. The network ends with fully connected layers, and the final layer has a linear activation function to output Q-values for all possible actions.

Two networks are created: the q_network is used to make predictions, and the target_q_network is updated periodically to provide stable targets during training. The optimizer chosen is Adam, with a learning rate of 0.0001 and gradient clipping to improve training stability.

6.9) Updated Hyperparameters

In this section, we define the key hyperparameters used to control the reinforcement learning process. These hyperparameters influence various aspects of how the agent explores the environment, learns from experiences, and improves over time. Adjusting these values is crucial for optimizing the agent's performance and ensuring that it learns effectively.

# Updated Hyperparameters
batch_size = 256
gamma = 0.99
epsilon = 1.0
epsilon_decay = 0.995 # Changed Confirmation
epsilon_min = 0.1
NUM_STEPS_FOR_UPDATE =4 
TAU = 1e-2
acceptable_drawdown = 0.15
render_interval = 100
evaluation_interval = 10

# Initialize dynamic early stopping parameters
best_val_reward = -np.inf
episodes_without_improvement = 0
dynamic_patience = int(0.2 * num_episodes)

Here, we update several key hyperparameters: Batch Size: This determines how many experiences are sampled and used in each training step. Gamma (γ): The discount factor, which controls how much future rewards are weighted when calculating the total reward. A value of 0.99 means that the agent is highly future-focused. Epsilon (ε): The exploration rate used in the epsilon-greedy strategy. The agent starts by exploring (with ε = 1.0) and gradually shifts toward exploitation as ε decays over time. Epsilon Decay: This controls how quickly the exploration rate decreases. As training progresses, the agent reduces exploration and relies more on learned experiences. TAU: The soft update parameter used for updating the target network. This controls how quickly the target Q-network's weights are updated to match the Q-network. Acceptable Drawdown: The maximum allowable percentage drop in portfolio value before the episode terminates early, used to manage risk. Render Interval: Determines how often the environment will be rendered visually during training for monitoring purposes. Evaluation Interval: Controls how frequently the agent is evaluated on validation data during training. Additionally, dynamic early stopping parameters like best_val_reward, episodes_without_improvement, and dynamic_patience are set to allow early stopping if the model's performance stagnates during training. These parameters help improve efficiency and avoid overfitting by halting training when no significant improvements are observed.

6.10) Training Loop

The training loop is the heart of the reinforcement learning process, where the agent learns from interacting with the environment over multiple episodes. In each episode, the agent takes actions, collects rewards, stores experiences, and updates its Q-network to improve future decision-making. This process is critical for building an effective trading agent that can generalize and adapt to various market conditions. Here’s how it all fits together.

for episode in range(num_episodes):

    try:
        # Reset the environment and get the initial sequence of timesteps (the state)
        state = env.reset()
    except ValueError as e:
        print(f"Episode {episode + 1}: {e}")
        print("Terminating training due to insufficient data to form any sequences.")
        break  # Exit the training loop

    total_reward = 0
    
    portfolio_value = env.initial_balance
    highest_portfolio_value = portfolio_value
    max_drawdown = 0  # Reset max_drawdown for each episode

    max_num_timesteps = env.num_timesteps

    # Validate that max_num_timesteps is positive
    if max_num_timesteps <= 0:
        print(f"Episode {episode + 1}: max_num_timesteps = {max_num_timesteps}. Skipping episode.")
        continue  # Skip to the next episode

    for t in range(max_num_timesteps):

        # Get the action using epsilon-greedy strategy
        action = get_action(state, epsilon, q_network)

        # Step the environment forward by one timestep and get the next sequence of timesteps
        next_state, reward, done, info = env.step(action)

        # Update portfolio value using the info dictionary
        portfolio_value = info.get('portfolio_value', env.initial_balance)

        # Update highest portfolio value and calculate drawdown
        highest_portfolio_value = max(highest_portfolio_value, portfolio_value)
        drawdown = (highest_portfolio_value - portfolio_value) / highest_portfolio_value
        max_drawdown = max(max_drawdown, drawdown)  # Update max drawdown for this episode

        # Store experience in replay buffer
        exp = Experience(state, action, reward, next_state, done)
        store_experience(memory_buffer, exp)

        # If max drawdown exceeds acceptable level, terminate the episode early
        if max_drawdown > acceptable_drawdown:
            print(f"Episode {episode + 1} terminated early due to excessive max drawdown ({max_drawdown:.2%}).")
            break

        # If the environment signals done, terminate the episode
        if done:
            print(f"Episode {episode + 1} terminated by environment signal.")
            break

        # Update the Q-network every NUM_STEPS_FOR_UPDATE steps
        if check_update_conditions(t, NUM_STEPS_FOR_UPDATE, memory_buffer):
            experiences, indexes, is_weights = sample_experiences(memory_buffer, batch_size, beta)
            if experiences is not None:
                # Compute TD errors
                td_errors = compute_td_error(q_network, target_q_network, experiences, gamma)
                # Update priorities in the replay buffer
                update_priorities(memory_buffer, indexes, td_errors)
                # Train the agent
                agent_learn(experiences, gamma, is_weights)

        # Move to the next state and accumulate rewards
        state = next_state
        total_reward += reward

    # Decay epsilon (exploration rate)
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay
        epsilon = max(epsilon, epsilon_min)  # Ensure epsilon does not go below epsilon_min

    # Increment beta towards beta_end
    beta = min(beta + beta_increment_per_episode, beta_end)

    # Evaluate model on validation data every `evaluation_interval` episodes
    if (episode + 1) % evaluation_interval == 0:
        val_reward, final_portfolio_value = evaluate_model(val_env, q_network)
        print(f"Episode {episode + 1}: Validation Reward = {val_reward}, Training Reward = {total_reward}")

        # Calculate percentage return on validation data
        percentage_return = ((final_portfolio_value - env.initial_balance) / env.initial_balance) * 100
        print(f"Episode {episode + 1}: Validation Percentage Return = {percentage_return:.2f}%")

        # Update validation rewards list and early stopping criteria
        if val_reward > best_val_reward:
            best_val_reward = val_reward
            episodes_without_improvement = 0  # Reset counter
            
        else:
            episodes_without_improvement += 1  # Increment counter if no improvement

        # Check for dynamic early stopping
        if episodes_without_improvement >= dynamic_patience:
            print(f"Early stopping triggered after {episode + 1} episodes with no improvement.")
            break

    # Render the environment every `render_interval` episodes
    if (episode + 1) % render_interval == 0:
        env.render()

# -------------------------
# Save the model after training completes or stops
# -------------------------
print("Training completed or stopped. Saving final model...")
os.makedirs('models', exist_ok=True)
q_network.save('models/q_network_final_model.keras')
print("Final model saved successfully.")

In this training loop, each episode starts by resetting the environment, and the agent interacts with it through a series of steps. The agent selects actions using the epsilon-greedy strategy, balances exploration and exploitation, and stores experiences in the replay buffer. If the maximum allowable drawdown is exceeded, the episode is terminated early to prevent large losses. The Q-network is updated periodically based on a batch of sampled experiences, and the agent continuously improves by minimizing the TD error. Throughout training, epsilon (exploration rate) decays and beta (importance-sampling weight) increments. The model is evaluated at regular intervals, and early stopping is triggered if no improvement is observed for a set number of episodes. After training concludes, the final model is saved for future use.

7) Model Testing

7.1) Testing the Saved Model on Test Data

After training the Q-network, the next logical step is to test its performance on unseen test data. This section outlines how the model is loaded, evaluated, and its performance visualized. We preprocess the test dataset similarly to the training and validation data, ensuring no overlap and proper scaling. Then, we load the saved model and run it through the test environment to evaluate how well it generalizes to new data.

# Load and preprocess the test data
test_data = data[data.index > val_end_date].copy()

# Ensure there's no overlap with validation data
assert val_data['Date'].max() < test_data.index.min(), "Validation and test data overlap!"

# Apply the same scaling to the test data
test_data[features_to_normalize] = scaler.transform(test_data[features_to_normalize])

# Reset index but keep the date as a column
test_data.reset_index(inplace=True)
test_data.rename(columns={'index': 'Date'}, inplace=True)

# Ensure 'Date' column is datetime
test_data['Date'] = pd.to_datetime(test_data['Date'])

# Now ensure there's no overlap with validation data using 'Date' columns
assert val_data['Date'].max() < test_data['Date'].min(), "Validation and test data overlap!"

# Instantiate the test environment
test_env = TradingEnv(test_data)

# Load the best saved model
trained_q_network = tf.keras.models.load_model('models/q_network_final_model.keras')

# Evaluate the model on the test environment using the existing evaluate_model function
test_reward, test_final_portfolio_value = evaluate_model(test_env, trained_q_network)

# Calculate percentage return on test data
test_percentage_return = ((test_final_portfolio_value - test_env.initial_balance) / test_env.initial_balance) * 100

print(f"Test Reward = {test_reward}")
print(f"Test Final Portfolio Value = {test_final_portfolio_value}")
print(f"Test Percentage Return = {test_percentage_return:.2f}%")

# Visualize the agent's performance on the test data
test_env.render()

In this section, the test dataset is prepared similarly to the validation and training datasets, ensuring consistency in scaling. The trained model is then loaded using TensorFlow's load_model function. The agent is evaluated on the test environment using the evaluate_model function, which simulates trading on the test dataset and provides performance metrics such as the total reward, final portfolio value, and percentage return. Finally, the agent's performance is visualized using the render function, providing insights into its actions and portfolio evolution during testing.

test_portfolio_value test_reward test_actions

7.2) Sharpe Ratio and Maximum Drawdown

The final step in evaluating the agent's performance involves calculating additional performance metrics that provide more insights into the quality of the model's predictions and its robustness. This section introduces the calculation of key financial metrics like the Sharpe Ratio and Maximum Drawdown. These metrics are useful for evaluating the risk-adjusted return and the worst loss encountered during the trading period, respectively.

# Calculate additional performance metrics
def calculate_performance_metrics(env):
    """
    Calculate performance metrics such as Sharpe ratio and maximum drawdown.
    """
    portfolio_values = [entry['portfolio_value'] for entry in env.trading_history]
    returns = np.diff(portfolio_values)
    
    # Sharpe Ratio
    sharpe_ratio = np.mean(returns) / (np.std(returns) + 1e-9)
    
    # Maximum Drawdown
    peak = portfolio_values[0]
    max_drawdown = 0
    for value in portfolio_values:
        if value > peak:
            peak = value
        drawdown = (peak - value) / peak
        if drawdown > max_drawdown:
            max_drawdown = drawdown
    
    return sharpe_ratio, max_drawdown

# Calculate and print performance metrics
sharpe_ratio, max_drawdown = calculate_performance_metrics(test_env)
print(f"Test Sharpe Ratio = {sharpe_ratio:.4f}")
print(f"Test Maximum Drawdown = {max_drawdown:.2%}")

In this part of the project, we calculate the Sharpe Ratio, which measures the average return earned in excess of the risk-free rate per unit of volatility or total risk, and the Maximum Drawdown, which measures the largest single drop from peak to trough during the evaluation period. By calculating these performance metrics, we can gain insights into how well the agent performed in terms of risk-adjusted returns and how significant the largest losses were during trading. These metrics help further validate the performance of the trained reinforcement learning model in a trading environment.

---------------------------------------------------
|               Performance Results               |
---------------------------------------------------
| Test Reward               |   788,750.77        |
| Test Final Portfolio Value| 1,034,368.61        |
| Test Percentage Return     |     3.44%          |
| Test Sharpe Ratio          |     0.1159         |
| Test Maximum Drawdown      |     0.67%          |
---------------------------------------------------

8) Conclusion and Future Improvements

In this project, we built an algorithmic trading model using reinforcement learning with Prioritized Experience Replay, convolutional neural networks, and bidirectional LSTMs. We trained and tested the model on stock market data, aiming to maximize portfolio returns while minimizing risks such as drawdown. Although the model achieved a modest percentage return of 3.44%, with a Sharpe Ratio of 0.1159, it suggests that the current performance is suboptimal for real-world applications. The Maximum Drawdown of 0.67% shows good risk control, but there is significant room for improvement in terms of profitability and the overall robustness of the trading strategy.

Key Challenges and Areas for Improvement:

Computational Limitations: One of the main challenges we faced was the computational restraint. Training the model for a large number of episodes would provide more opportunities for the agent to learn long-term patterns in the stock market. However, due to hardware constraints, we had to limit the number of episodes, which likely affected the model's ability to converge to a truly optimal policy.

Reward System: The reward system in the current implementation could be further refined. The immediate rewards are based on price differences, but more sophisticated reward structures could be explored, such as using profit targets, risk-adjusted returns, or penalizing excessive drawdowns. Additionally incorporating slippage, and market impact more comprehensively would make the model more realistic and effective.

Longer Training Duration: Increasing the number of training episodes or using more advanced techniques like distributed training could help the model explore a larger portion of the state-action space. This would allow the agent to learn more complex patterns and potentially improve both profitability and risk management.

Advanced Models and Techniques: The model could benefit from exploring more advanced architectures, such as transformer-based models, which have recently shown success in time-series analysis and finance. Techniques like Proximal Policy Optimization (PPO) or A3C could also be explored to enhance the reinforcement learning framework.

Hyperparameter Tuning: Fine-tuning hyperparameters like batch size, learning rate, and exploration-exploitation strategies (epsilon decay, gamma, etc.) could further improve the model's performance. Automated methods like Bayesian optimization or genetic algorithms might provide better results than manual tuning.

In conclusion, this project provided a strong foundation for building a trading agent using reinforcement learning, but further improvements in training, reward design, and model architecture are essential for achieving a performance level suitable for real-world trading scenarios. By addressing the mentioned challenges and exploring advanced methods, the model can be fine-tuned to perform better in complex, dynamic financial environments.

About

Experience the fusion of finance and artificial intelligence with this cutting-edge project that implements a deep reinforcement learning agent for automated stock trading. Leveraging advanced neural network architectures and a meticulously crafted reward function, the agent learns to make profitable trading decisions in a simulated environment

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published