Feature Glossary

Feature Name	Feature Description	Technical Details
Race Id	Identifier for a specific race
Result Id	Identifier for an entrant in a race (that completed the race)
Field Size	Count of entries in a race
Track Tier	My own subjective ranking of track quality	One hot encoded as ['High Level', 'Decent Level', 'Other']
Race Surface	What surface a race will be run ovr	One hot encoded as ['Dirt', 'Turf', 'Other']
Quarter	Calendar quarter of the year	One hot encoded as 1-4
Sprint	Is the race shorter than 7f?
Route	Is the race as long as 9f or longer?
Horse Sex	The sex of an individual entrant	One hot encoded as ['Filly', 'Mare', 'Colt', 'Horse', 'Other'] (yes, 'horse' is an official sex designation in horseacing, for a male horse older than 3 that has not been gelded)
Horse Age	Horse age, in years - horses are aged based soley on the year in which they were born	One hot encoded as ['2', '3', 'Other']
Lifetime EPS	Lifetime earnings per previous start	LN() is taken on earnings, then the EPS is normalized to the particular race. Across the population of all races, this normalized variable has a standard normal dist. with a mean of 0 and StDev of 1
Turf EPS	Lifetime earnings per previous start, only in turf races	Same procesing as Lifetime EPS
Past Performance Count	How many races has this horse previously run?
Average Performance	Performance is a proprietary combination of elements in a given previous race	Performance is calculated for all previous races, and the weighted average is taken. Days since the previous race is the weights vector. This value is normalized within the particular race.
Average Performance at Similar	Performrance, but only from races at a similar distance, on the same surface
Average Speed	Weighted average of speeds from previous races	Days since the race is the weights vector, and values are normalized within the race
Avereage Speed at Similar	Speed, but only from similar previous races
Average First-Call Position	Weighted average of the position early in previous races	Similarly wieghted and normalized as the othr variables
Average First-Call Position at Similar	Average early position, but only from similar previous races
DSLR	Days since last race
Jockey Win %	Winning percentage of the horse's jockey	YTD
Trainer Win %	Winning percentage of the horse's trainer	YTD
Count120	Number of races in which the horse ran in the past 120 days
Implied Probability	The probability that the final odds of this horse imply	1/odds - 1 * adj - where odds is the decimal odds, and adj is an adjustment for track takeout in the win pool

Simplified Code

# load raw data - preprocess as much as possible in the DB
statement = open('query.sql').read()
df = pd.read_sql(statement, engine)

# Prep the dependent variable "winner" to be a 0-indexed representation of the post position of the winning horse
race_winners = df[df.final_position == 1][['race_id',
                                           'post_position',
                                           'track_tier_TIER1',
                                           'track_tier_TIER2',
                                           'race_surface_d',
                                           'race_surface_t',
                                           'quarter_1.0',
                                           'quarter_2.0',
                                           'quarter_3.0',
                                           'quarter_4.0',
                                           'field_size',
                                           'sprint',
                                           'route'
                                           ]]
race_winners = race_winners.rename(
    {'post_position': 'winner'}, axis='columns')
race_winners['winner'] = race_winners['winner'] - 1

# Select horse-specific attributes to pull into the model
model_data = df[[
    'race_id',
    'post_position',
    'horse_sex_f',
    'horse_sex_m',
    'horse_sex_c',
    'horse_sex_h',
    'age_2.0',
    'age_3.0',
    'lifetime_eps',
    'turf_eps',
    'pp_count',
    'total_perf',
    'total_beyer',
    'total_first_call',
    'similar_perf',
    'similar_beyer',
    'similar_first_pos',
    'dslr',
    'jock',
    'trainer',
    'count120',
    'implied_proba'
]]

# Pivot to the race level, and rejoin the race_winners df - already at the race level
model_data = model_data.pivot(
    index='race_id', columns='post_position', values=model_data.columns[2:])

model_data = race_winners.join(
    model_data, on='race_id', how='inner')

# Prepare to train the model 
X = model_data[model_data.columns[2:]]
y = model_data['winner']

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, train_size=0.93, test_size=0.07, random_state=1)

Neural Network Architecture

Previous implementations of this model have taught me to be very careful of overfitting the training data. After some testing, I settled on 3 hidden layers, with regularization on each, and dropout applied between them. The final layer ouputs a vector of 20 values which gets processed by softmax into a vector of win probabilites for the horse at each post position. 20 is the maximum number of horses we can allow in a race with this model.

Using the SparseCategoricalCrossentropy allows us to compare the output of softmax to the true winner value. Let's say in a 5-horse race, the 2nd horse wins, and the model predicted this with 50% accuracy. The true winner vector would be 1 (0-indexed) and the predicted vector would be something like [0.1, 0.5, 0.1, 0.2, 0.1].

We train the model to minimize loss of this fuction in a maximum of 500 epochs, or until 30 epochs provives no improvement on a validation set.

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(120, activation='tanh',
                          activity_regularizer=tf.keras.regularizers.L2(0.001)),
    tf.keras.layers.Dropout(0.20),
    tf.keras.layers.Dense(
        80, activity_regularizer=tf.keras.regularizers.L2(0.001)),
    tf.keras.layers.Dropout(0.20),
    tf.keras.layers.Dense(
        20, activity_regularizer=tf.keras.regularizers.L2(0.001)),
    tf.keras.layers.Softmax()
])

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              optimizer='adam', metrics=['accuracy'])

model.fit(x=X_train, y=y_train, validation_split=0.12, epochs=500, callbacks=[
          tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=30)])

Results:

model.evaluate(X_train, y_train)
3524/3524 [==============================] - 4s 1ms/step - loss: 1.9756 - accuracy: 0.3527
[1.9755834341049194, 0.3526991307735443]


model.evaluate(X_test, y_test)                                                                                    266/266 [==============================] - 0s 1ms/step - loss: 2.0323 - accuracy: 0.3366
[2.0323143005371094, 0.33659282326698303]


# example strategies using predictions on the test set
test_data[(test_data.pred_proba > 1.15 * test_data.implied_proba) & (test_data.pred_proba > 0.2)].groupby('win')['win_pay'].agg('sum')
win
False   -305000.0
True     312920.0

test_data[(test_data.pred_proba > 1.15 * test_data.implied_proba) & (test_data.pred_proba > 0.25)].groupby('win')['win_pay'].agg('sum')
win
False   -173200.0
True     200530.0

Summary

So in the long run, the model produces positive returns. The caveat is that "Implied Probability" is a post-hoc variable only available to us after a race. We can approximate this variable by using live odds close to race time as a substitute, but odds can change pretty quickly, and the slight edge of ~5% we might have erodes in realistic scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model_summary.md

model_summary.md

Feature Glossary

Simplified Code

Neural Network Architecture

Summary

Files

model_summary.md

Latest commit

History

model_summary.md

File metadata and controls

Feature Glossary

Simplified Code

Neural Network Architecture

Summary