Backend

This public repo houses most of the code from my horse race predictions app (missing are the complete scripts to do feature engineering and train the model, as well as some external service private API calls that I do not want to publicize - a model summary at a high level is provided here in the readme). Currently, there is no live demo that can be run, as I have not open-sourced any of my data, or my model. Yet, the backend and frontend code is here to review, and information about the model is written here in the readme.

Backend

The backend of this project is a flask webserver. The source of the data is The Daily Racing Form without which it is impossible to run a useful local version (TODO: add sample data files and sample prediction model so that backend can run as a demo).

Domain Issues

The domain of horseracing data is fraught with issues. Horseracing data is sourced from a company like the aforementioned DRF, which all have very poor tooling for software bettors. Their primary business is tailored to readers and button-clickers. This leads to issues like trying to match a horse from a previous race with a name like Dan's Legacy with the same horse in the current race with a name like dans legacy. Simple things like this must be addressed by an app that wants to avoid garbage in garbage out. Thus, I use a robust ETL layer and store data in my own postgres database which I can then query for use in a machine learning model or an API and trust the output.

There is another interesting issue with horseracing, which is that there is a temporal element to them. The live odds at the time of racing are an input to my model, and to many horseracing prediction models. But you can only access the true live odds moments before the race begins. A horse may be entered into a race, then drop out as a "scratch" before the race begins. Or a race may be scheduled for a certain surface, then be changed due to rain. These variables are up in the air until the last minute, and are not adequately covered by the available data sources. So it introduces a requirement on this app to support user-input to quickly and ergonomically update the inputs to the model on the fly.

ETL

Pre-race info comes from DRF. I make use of an undocumented API to export files and upload them to S3. From S3, the files are downloaded and sent to processing via this script, which ultimately parses the files and stores the records in the DB.

On any given day, once the data has been processed into the DB, then the app will be able to render the data to inform the user about the day's upcoming races.

API

Diagram of the db models:

This app uses flask and sqlalchemy to serve a json API from the database. The routes are listed in main.py.

Model Summary

Feature Glossary

Feature Name	Feature Description	Technical Details
Race Id	Identifier for a specific race
Result Id	Identifier for an entrant in a race (that completed the race)
Field Size	Count of entries in a race
Track Tier	My own subjective ranking of track quality	One hot encoded as ['High Level', 'Decent Level', 'Other']
Race Surface	What surface a race will be run ovr	One hot encoded as ['Dirt', 'Turf', 'Other']
Quarter	Calendar quarter of the year	One hot encoded as 1-4
Sprint	Is the race shorter than 7f?
Route	Is the race as long as 9f or longer?
Horse Sex	The sex of an individual entrant	One hot encoded as ['Filly', 'Mare', 'Colt', 'Horse', 'Other'] (yes, 'horse' is an official sex designation in horseacing, for a male horse older than 3 that has not been gelded)
Horse Age	Horse age, in years - horses are aged based soley on the year in which they were born	One hot encoded as ['2', '3', 'Other']
Lifetime EPS	Lifetime earnings per previous start	LN() is taken on earnings, then the EPS is normalized to the particular race. Across the population of all races, this normalized variable has a standard normal dist. with a mean of 0 and StDev of 1
Turf EPS	Lifetime earnings per previous start, only in turf races	Same procesing as Lifetime EPS
Past Performance Count	How many races has this horse previously run?
Average Performance	Performance is a proprietary combination of elements in a given previous race	Performance is calculated for all previous races, and the weighted average is taken. Days since the previous race is the weights vector. This value is normalized within the particular race.
Average Performance at Similar	Performrance, but only from races at a similar distance, on the same surface
Average Speed	Weighted average of speeds from previous races	Days since the race is the weights vector, and values are normalized within the race
Avereage Speed at Similar	Speed, but only from similar previous races
Average First-Call Position	Weighted average of the position early in previous races	Similarly wieghted and normalized as the othr variables
Average First-Call Position at Similar	Average early position, but only from similar previous races
DSLR	Days since last race
Jockey Win %	Winning percentage of the horse's jockey	YTD
Trainer Win %	Winning percentage of the horse's trainer	YTD
Count120	Number of races in which the horse ran in the past 120 days
Implied Probability	The probability that the final odds of this horse imply	1/odds - 1 * adj - where odds is the decimal odds, and adj is an adjustment for track takeout in the win pool

Simplified Code

# load raw data - preprocess as much as possible in the DB
statement = open('query.sql').read()
df = pd.read_sql(statement, engine)

# Prep the dependent variable "winner" to be a 0-indexed representation of the post position of the winning horse
race_winners = df[df.final_position == 1][['race_id',
                                           'post_position',
                                           'track_tier_TIER1',
                                           'track_tier_TIER2',
                                           'race_surface_d',
                                           'race_surface_t',
                                           'quarter_1.0',
                                           'quarter_2.0',
                                           'quarter_3.0',
                                           'quarter_4.0',
                                           'field_size',
                                           'sprint',
                                           'route'
                                           ]]
race_winners = race_winners.rename(
    {'post_position': 'winner'}, axis='columns')
race_winners['winner'] = race_winners['winner'] - 1

# Select horse-specific attributes to pull into the model
model_data = df[[
    'race_id',
    'post_position',
    'horse_sex_f',
    'horse_sex_m',
    'horse_sex_c',
    'horse_sex_h',
    'age_2.0',
    'age_3.0',
    'lifetime_eps',
    'turf_eps',
    'pp_count',
    'total_perf',
    'total_beyer',
    'total_first_call',
    'similar_perf',
    'similar_beyer',
    'similar_first_pos',
    'dslr',
    'jock',
    'trainer',
    'count120',
    'implied_proba'
]]

# Pivot to the race level, and rejoin the race_winners df - already at the race level
model_data = model_data.pivot(
    index='race_id', columns='post_position', values=model_data.columns[2:])

model_data = race_winners.join(
    model_data, on='race_id', how='inner')

# Prepare to train the model 
X = model_data[model_data.columns[2:]]
y = model_data['winner']

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, train_size=0.93, test_size=0.07, random_state=1)

Neural Network Architecture

Previous implementations of this model have taught me to be very careful of overfitting the training data. After some testing, I settled on 3 hidden layers, with regularization on each, and dropout applied between them. The final layer ouputs a vector of 20 values which gets processed by softmax into a vector of win probabilites for the horse at each post position. 20 is the maximum number of horses we can allow in a race with this model.

Using the SparseCategoricalCrossentropy allows us to compare the output of softmax to the true winner value. Let's say in a 5-horse race, the 2nd horse wins, and the model predicted this with 50% accuracy. The true winner vector would be 1 (0-indexed) and the predicted vector would be something like [0.1, 0.5, 0.1, 0.2, 0.1].

We train the model to minimize loss of this fuction in a maximum of 500 epochs, or until 30 epochs provives no improvement on a validation set.

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(120, activation='tanh',
                          activity_regularizer=tf.keras.regularizers.L2(0.001)),
    tf.keras.layers.Dropout(0.20),
    tf.keras.layers.Dense(
        80, activity_regularizer=tf.keras.regularizers.L2(0.001)),
    tf.keras.layers.Dropout(0.20),
    tf.keras.layers.Dense(
        20, activity_regularizer=tf.keras.regularizers.L2(0.001)),
    tf.keras.layers.Softmax()
])

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              optimizer='adam', metrics=['accuracy'])

model.fit(x=X_train, y=y_train, validation_split=0.12, epochs=500, callbacks=[
          tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=30)])

Results:

model.evaluate(X_train, y_train)
3524/3524 [==============================] - 4s 1ms/step - loss: 1.9756 - accuracy: 0.3527
[1.9755834341049194, 0.3526991307735443]


model.evaluate(X_test, y_test)                                                                                    266/266 [==============================] - 0s 1ms/step - loss: 2.0323 - accuracy: 0.3366
[2.0323143005371094, 0.33659282326698303]


# example strategies using predictions on the test set
test_data[(test_data.pred_proba > 1.15 * test_data.implied_proba) & (test_data.pred_proba > 0.2)].groupby('win')['win_pay'].agg('sum')
win
False   -305000.0
True     312920.0

test_data[(test_data.pred_proba > 1.15 * test_data.implied_proba) & (test_data.pred_proba > 0.25)].groupby('win')['win_pay'].agg('sum')
win
False   -173200.0
True     200530.0

Summary

So in the long run, the model produces positive returns. The caveat is that "Implied Probability" is a post-hoc variable only available to us after a race. We can approximate this variable by using live odds close to race time as a substitute, but odds can change pretty quickly, and the slight edge of ~5% we might have erodes in realistic scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
api		api
assets		assets
client		client
.gitignore		.gitignore
model_summary.md		model_summary.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Backend

Domain Issues

ETL

API

Model Summary

Feature Glossary

Simplified Code

Neural Network Architecture

Summary

About

Releases

Packages

Languages

45930/public-horseracing-app

Folders and files

Latest commit

History

Repository files navigation

Backend

Domain Issues

ETL

API

Model Summary

Feature Glossary

Simplified Code

Neural Network Architecture

Summary

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages