Replay Analysis tools for HSReplay.net
Tools for doing large scale analysis on hsreplay.net data.
Replay analysis jobs are written using Yelp's MRJob library to process replays at scale via Map Reduce on EMR. Data scientists can easily develop jobs locally and then submit a request to a member of the HearthSim team to have them run the job at scale on a production map reduce cluster.
Checkout chess_brawl.py
for an example of how to write a job that uses a hearthstone .hslog.export.EntityTreeExporter
subclass to do an analysis against the replay xml files.
Jobs that follow this template will have several things in common:
- They use the
mapred.protocols.BaseS3Protocol
base class to abstract away the raw storage details and implement directly against thehsreplay.document.HSReplayDocument
class. - They implement a subclass of
EntityTreeExporter
and use the exposed hooks to capture whatever event data the job is focused on analyzing. - They usually emit their final output as aggregates in a CSV like format so that final chart generation and analysis can be done in interactive visual tools like Excel.
To run a job you must first make sure you have the libraries listed in requirements.txt installed. Then the command to invoke a job is:
$ python <JOB_NAME>.py <INPUT_FILE.TXT>
The INPUT_FILE.TXT
must be the path to a text file on the local file system that contains
newline delimited lines where each line follows the format <STORAGE_LOCATION>:<FILE_PATH>
.
If STORAGE_LOCATION
is the string local
than the job will look for the file on the
local file system. If it is any other value, like hsreplaynet-replays
then it assumes
that the file is stored in an S3 bucket with that name.
Let's assume that your job script is named my_job.py
and your input file is named
inputs.txt
and looks as follows:
local:uploads/2016/09/ex1_replay.xml
local:uploads/2016/09/ex2_replay.xml
local:uploads/2016/09/ex3_replay.xml
The BaseS3Protocol
will then look for those files in the ./uploads
directory which
it expects to be in the same folder as where you invoked the script from.
Once the test data is prepared, then the job can be run by invoking:
$ python my_job.py inputs.txt
This will run the job entirely in a single process which makes it easy to attach a debugger or employ any other traditional development practice. In addition, one of the benefits of using Map Reduce is that the isolated nature of map() and reduce() functions makes them easy to unit test.
When your job is ready, have a member of the HearthSim team run it on the production data set. There are a few small changes necessary to make the job run on EMR.
- You must replace the
<STORAGE_LOCATION>
ininputs.txt
with the name of the raw log data bucket. Usuallyhsreplaynet-replays
, so that it looks like:
hsreplaynet-replays:uploads/2016/09/ex1_replay.xml
hsreplaynet-replays:uploads/2016/09/ex2_replay.xml
hsreplaynet-replays:uploads/2016/09/ex3_replay.xml
Since you likely want to run it on a larger set of inputs, you can ask a member of the HearthSim team to help you generate a larger input file by telling them the type of replays that you'd like to run the job over.
-
You must run
$ ./package_libraries.sh
to generate a zip of the libraries in this repo so that they get shipped up to the map reduce cluster. -
When the HearthSim team member invokes the job they will do so from a machine in the data center that is configured with the correct AWS credentials in the environment. They will also use the
-r emr
option to tell MRJob to use EMR. E.g.
$ python my_job.py -r emr inputs.txt
And that's it! MRJob will automatically provision an elastic map reduce cluster, whose
size can be tuned by a HearthSim member by editing mrjob.conf
prior to launching the
job. When the job is done MRJob will either stream the results back to console or save
them in S3 and then tear down the EMR cluster.
Happy Questing, Adventurer!
When working on the data processing infrastructure it is possible to only pay the cost of bootstrapping the cluster once by first running this command:
$ mrjob create-cluster --conf-path mrjob.conf
This will create a cluster that will remain active until it's idle for a full hour and then it will shut itself down. The command will return a cluster ID token that looks like 'j-1CSVCLY28T3EY'.
Then when invoking subsequent jobs the additional --cluster-id <ID>
command can be used
to have the job run on the already provisioned cluster. E.g.
$ python my_job.py -r emr --conf-path mrjob.conf --cluster-id j-1CSVCLY28T3EY inputs.txt
Copyright © HearthSim - All Rights Reserved
With the exception of the job scripts under /contrib
, which are licensed under the MIT license. The full license text
is available in the /contrib/LICENSE
file.
This is a HearthSim project. All development
happens on our IRC channel #hearthsim
on Freenode.