java -Dspark.master=local[*] -cp metorikku-standalone.jar com.yotpo.metorikku.Metorikku -c examples/movies.yaml
Let's analyze a small movie lens data set!
We have our movies.csv in our file_inputs folder
+-------+-------------------------------------------------------------------------------+-------------------------------------------+
|movieId|title |genres |
+-------+-------------------------------------------------------------------------------+-------------------------------------------+
|1 |Toy Story (1995) |Adventure|Animation|Children|Comedy|Fantasy|
|2 |Jumanji (1995) |Adventure|Children|Fantasy |
|3 |Grumpier Old Men (1995) |Comedy|Romance |
|4 |Waiting to Exhale (1995) |Comedy|Drama|Romance |
|5 |Father of the Bride Part II (1995) |Comedy |
|6 |Heat (1995) |Action|Crime|Thriller |
|7 |Sabrina (1995) |Comedy|Romance |
|8 |Tom and Huck (1995) |Adventure|Children |
|9 |Sudden Death (1995) |Action |
|10 |GoldenEye (1995) |Action|Adventure|Thriller |
|11 |American President, The (1995) |Comedy|Drama|Romance |
|12 |Dracula: Dead and Loving It (1995) |Comedy|Horror |
|13 |Balto (1995) |Adventure|Animation|Children |
|14 |Nixon (1995) |Drama |
|15 |Cutthroat Island (1995) |Action|Adventure|Romance |
and also our ratings.csv:
+------+-------+------+----------+
|userId|movieId|rating|timestamp |
+------+-------+------+----------+
|1 |31 |2.5 |1260759144|
|1 |1029 |3.0 |1260759179|
|1 |1061 |3.0 |1260759182|
|1 |1129 |2.0 |1260759185|
|1 |1172 |4.0 |1260759205|
|1 |1263 |2.0 |1260759151|
|1 |1287 |2.0 |1260759187|
|1 |1293 |2.0 |1260759148|
|1 |1339 |3.5 |1260759125|
|1 |1343 |2.0 |1260759131|
|1 |1371 |2.5 |1260759135|
|1 |1405 |1.0 |1260759203|
|1 |1953 |4.0 |1260759191|
|1 |2105 |4.0 |1260759139|
|1 |2150 |3.0 |1260759194|
|1 |2193 |2.0 |1260759198|
We are registering our data sources, our variables and our output configurations Here's our example configuration:
# The MQL file path
metrics:
- examples/movies_metric.yaml
inputs:
movies: examples/file_inputs/movies.csv
ratings: examples/file_inputs/ratings.csv
# Set custom variables that would be accessible from the SQL
variables:
myFavoriteMovie: 'Princess Bride, The (1987)'
output:
file:
dir: examples/output
# If set to true, triggers Explain before saving
explain: true
# Set Log Level : ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
logLevel: WARN
# Set Application Name to have app name prefix in spark instrumentation counters
appName: moviesApp
# Shows a Preview of the output
showPreviewLines: 100
# Location to store dataframe checkpoints
checkpointStorageLocation: /tmp/
Our metric file is as follows:
steps:
- dataFrameName: moviesWithRatings
checkpoint: true
sql:
SELECT userid,
movies.movieid,
rating,
timestamp,
title,
genres
FROM ratings
JOIN movies ON ratings.movieid = movies.movieid
- dataFrameName: fantasyMoviesWithRatings
sql:
SELECT movieId,
cast(rating AS float) AS rating,
timestamp,
title,
genres
FROM moviesWithRatings
WHERE genres LIKE '%Fantasy%'
- dataFrameName: topFantasyMovies
sql:
SELECT movieId,
title,
avg(rating) AS averageRating
FROM fantasyMoviesWithRatings
GROUP BY movieId,
title
ORDER BY averageRating DESC
LIMIT 100
- dataFrameName: myFavoriteMovieRated
sql:
SELECT *
FROM topFantasyMovies
WHERE title = ${myFavoriteMovie}
output:
- dataFrameName: topFantasyMovies
outputType: Parquet
outputOptions:
saveMode: Overwrite
path: topFantasyMovies.parquet
We are running each step sequentially and here are the results:
+------+-------+------+----------+----------------------------------------------------------------------------+-----------------------------------------------+
|userId|movieId|rating|timestamp |title |genres |
+------+-------+------+----------+----------------------------------------------------------------------------+-----------------------------------------------+
|1 |31 |2.5 |1260759144|Dangerous Minds (1995) |Drama |
|1 |1029 |3.0 |1260759179|Dumbo (1941) |Animation|Children|Drama|Musical |
|1 |1061 |3.0 |1260759182|Sleepers (1996) |Thriller |
|1 |1129 |2.0 |1260759185|Escape from New York (1981) |Action|Adventure|Sci-Fi|Thriller |
|1 |1172 |4.0 |1260759205|Cinema Paradiso (Nuovo cinema Paradiso) (1989) |Drama |
|1 |1263 |2.0 |1260759151|Deer Hunter, The (1978) |Drama|War |
|1 |1287 |2.0 |1260759187|Ben-Hur (1959) |Action|Adventure|Drama |
|1 |1293 |2.0 |1260759148|Gandhi (1982) |Drama |
|1 |1339 |3.5 |1260759125|Dracula (Bram Stoker's Dracula) (1992) |Fantasy|Horror|Romance|Thriller |
|1 |1343 |2.0 |1260759131|Cape Fear (1991) |Thriller |
|1 |1371 |2.5 |1260759135|Star Trek: The Motion Picture (1979) |Adventure|Sci-Fi |
|1 |1405 |1.0 |1260759203|Beavis and Butt-Head Do America (1996) |Adventure|Animation|Comedy|Crime
+-------+------+----------+----------------------------------------------------------------------------------------------+---------------------------------------------------------+
|movieId|rating|timestamp |title |genres |
+-------+------+----------+----------------------------------------------------------------------------------------------+---------------------------------------------------------+
|1339 |3.5 |1260759125|Dracula (Bram Stoker's Dracula) (1992) |Fantasy|Horror|Romance|Thriller |
|2193 |2.0 |1260759198|Willow (1988) |Action|Adventure|Fantasy |
|2294 |2.0 |1260759108|Antz (1998) |Adventure|Animation|Children|Comedy|Fantasy |
|2968 |1.0 |1260759200|Time Bandits (1981) |Adventure|Comedy|Fantasy|Sci-Fi |
|265 |5.0 |835355697 |Like Water for Chocolate (Como agua para chocolate) (1992) |Drama|Fantasy|Romance |
|314 |4.0 |835356044 |Secret of Roan Inish, The (1994) |Children|Drama|Fantasy|Mystery |
|317 |2.0 |835355551 |Santa Clause, The (1994) |Comedy|Drama|Fantasy |
|367 |3.0 |835355619 |Mask, The (1994) |Action|Comedy|Crime|Fantasy |
|405 |2.0 |835356246 |Highlander III: The Sorcerer (a.k.a. Highlander: The Final Dimension) (1994) |Action|Fantasy |
|410 |3.0 |835355532 |Addams Family Values (1993) |Children|Comedy|Fantasy |
|485 |3.0 |835355918 |Last Action Hero (1993) |Action|Adventure|Comedy|Fantasy |
|551 |5.0 |835355767 |Nightmare Before Christmas, The (1993) |Animation|Children|Fantasy|Musical |
|587 |3.0 |835355779 |Ghost (1990) |Comedy|Drama|Fantasy|Romance|Thriller |
|661 |4.0 |835356141 |James and the Giant Peach (1996) |Adventure|Animation|Children|Fantasy|Musical |
+-------+------------------------------------------------------------------------------------------------------------------+------------------+
|movieId|title |averageRating |
+-------+------------------------------------------------------------------------------------------------------------------+------------------+
|59392 |Stargate: The Ark of Truth (2008) |5.0 |
|3216 |Vampyros Lesbos (Vampiras, Las) (1971) |5.0 |
|140747 |16 Wishes (2010) |5.0 |
|3837 |Phantasm II (1988) |5.0 |
|27792 |Saddest Music in the World, The (2003) |5.0 |
|118468 |Mei and the Kittenbus (2002) |5.0 |
|2086 |One Magic Christmas (1985) |5.0 |
|8254 |Arizona Dream (1993) |5.0 |
|4789 |Phantom of the Paradise (1974) |5.0 |
|74089 |Peter Pan (1960) |5.0 |
|96832 |Holy Motors (2012) |5.0 |
|106471 |One Piece Film: Strong World (2009) |5.0 |
|101962 |Wolf Children (Okami kodomo no ame to yuki) (2012) |5.0 |
|3612 |The Slipper and the Rose: The Story of Cinderella (1976) |5.0 |
|4591 |Erik the Viking (1989) |5.0 |
|7302 |Thief of Bagdad, The (1924) |5.0 |
|26749 |Prospero's Books (1991) |5.0 |
|99764 |It's Such a Beautiful Day (2012) |5.0 |
|53887 |O Lucky Man! (1973) |5.0 |
|95113 |Eaux d'artifice (1953) |5.0 |
|72356 |Partly Cloudy (2009) |4.75 |
|50641 |House (Hausu) (1977) |4.75 |
|114552 |Boxtrolls, The (2014) |4.5 |
|62764 |Black Moon (1975) |4.5 |
|136016 |The Good Dinosaur (2015) |4.5 |
|80748 |Alice in Wonderland (1933) |4.5 |
|110645 |Witching and Bitching (Brujas de Zugarramurdi, Las) (2014) |4.5 |
|27156 |Neon Genesis Evangelion: The End of Evangelion (Shin seiki Evangelion Gekijô-ban: Air/Magokoro wo, kimi ni) (1997)|4.5 |
|50011 |Bothersome Man, The (Brysomme mannen, Den) (2006) |4.5 |
|68835 |Were the World Mine (2008) |4.5 |
|4927 |Last Wave, The (1977) |4.5 |
+-------+--------------------------+-----------------+
|movieId|title |averageRating |
+-------+--------------------------+-----------------+
|1197 |Princess Bride, The (1987)|4.208588957055214|
+-------+--------------------------+-----------------+
java -Dspark.master=local[*] -cp metorikku-standalone.jar com.yotpo.metorikku.MetorikkuTester -t examples/movies_test.yaml
Metorikku also supports testing your logic, using MetorikkuTester.
Metorikku Tester expects a test-settings YAML file:
metric: movies_metric.yaml
mocks:
- name: movies
path: mocks/movies.jsonl
- name: ratings
path: mocks/ratings.jsonl
params:
variables:
myFavoriteMovie: Lord of the Rings, The (1978)
tests:
myFavoriteMovieRated:
- movieId: 1
title: Lord of the Rings, The (1978)
averageRating: 2.5
A test settings file consists of the following:
- Our metric file which has our business logic
- A set of mocks in the format of JSONL
- A set of variables if needed to be used inside our SQL queries
- A set of expected results
You can run MetorikkuTester as a stand alone application or a spark application