My own IMDb dataset importer - loads into a Marten DB document store.
This is just extracted from a private project, since it can be open-sourced to help you get started using the IMDb datasets.
- Contains the core models representing rows in each dataset
- Provides the parsers that can parse the dataset into models (includes built in batching, as to not run out of memory)
- Repository interfaces to abstract the infrastructure layer
- Implements the repository interfaces with Marten DB document stores
- Marten uses Postgres underneath
- The batch sync uses Postgres'
COPY
function internally for faster bulk inserts than establishing a connection per operation - doing it one by one would cause 2 connection lifetimes per row. The default batching (for loading into memory, and then sending to bulk insert) is 50k
- Provides an extension method to add the Marten Identities
StoreOptions.SetupIMDbIdentities()
to reduce duplicated code- For example, in an ASP.NET Core project you would run:
services.AddMarten(options =>
{
options.AutoCreateSchemaObjects = AutoCreate.All;
options.Connection(Configuration.GetConnectionString("Marten"));
options.SetupIMDbIdentities();
});
- Upon startup it runs a full sync of the IMDb datasets found here
- There is no scheduling for now, to rerun, restart the container/console app
- To run every day/week etc, look at a project like Quartz.NET
This project contains out of the box Docker + docker-compose
support.
To orchestrate the postgres database, and IMDb importer, simply run docker-compose up -d --build
(-d
makes it run in the background, --build
builds the Docker images)
Feel free to edit the environment variables found in the docker-compose.yml
file to configure to your needs
POSTGRES_USER
,POSTGRES_PASSWORD
, andPOSTGRES_DB
- setup the Postgres database with these default settings - only sets up on the first run.MARTEN_CONNECTION_STRING
- the connection string the thepostgres
service, this uses the Postgres configuration from above, feel free to change if the database is not in the docker-compose servicesBATCH_SIZE
- the amount of rows to load at a time, this can be lowered if RAM is an issueIMDB_BASE_PATH
- the directory to save IMDb raw dataset files to. Remove this entirely to save to thecurrent working directory/imdb
. By default in the docker compose project this is set to/imdb
as this folder is volume mountedDATABASE_NAME
- the name of the database, used for some direct queries after the import for statistics