Expat Cinema shows foreign movies with english subtitles that are screened in cinemas in the Netherlands. It can be found at https://expatcinema.com.
A GitHub Action is used to deploy to AWS. The action is triggered by a push to the main branch.
The .env file from cloud/ is only used when running it locally, when deploying using CI/CD the environment variables are set in the GitHub Secrets and Variables > Actions > Repository Secrets. The .env file is not checked into git, so it won't be available in the CI/CD environment.
It's possible to create a dev stage, by locally running e.g.
pnpm run synth # synthesize the cdk stack for dev
pnpm run watch # watch for changes, deploy to dev
pnpm run deploy # deploy to devThe scrapers run on a daily schedule defined in the cdk stack in cloud/lib/backend-stack.ts.
cd cloud; pnpm run scrapers:prodto run the scrapers on the prod stage, see output/expatcinema-prod-scrapers.json for the output of the scrapers.
cd cloud; pnpm run scrapersto run the scrapers on the dev stage, see output/expatcinema-dev-scrapers.json for the output of the scrapers.
If you want to run it on only a few scrapers, you can use the SCRAPERS environment variable in .env to specify which scrapers to run. After making changes, pnpm run deploy and pnpm run scrapers.
Or as cdk watch doesn't trigger on .env file changes, when running pnpm run watch trigger a deploy by making a change in a .ts file, and afterwards run pnpm run scrapers
pnpm run config:scraper can be used to get the lambda function configuration for the scrapers.
The web is deployed on a daily schedule using GitHub Actions. The schedule is defined in .github/workflows/web.yml. The schedule is needed to have the SSG (static site generator) get the latest data from the scrapers.
GitHub actions is used, web/ uses JamesIves/github-pages-deploy-action to deploy to the gh-pages branch, and the GitHub settings has Pages take the source branch gh-pages which triggers the GitHub built in pages-build-deployment
Easiest is to bump the version in web/package.json and push to master. This will trigger a GitHub Action that will deploy the web app to GitHub Pages. Note there's only a prod stage for the web app.
Note: Currently broken
pnpm run scrapers:local
Stores the output in cloud/output instead of S3 buckets and DynamoDB
Use SCRAPERS environment variable in .env.local to define a comma separated list of scrapers to locally run and diverge from the default set of scrapers in scrapers/index.js
And to call a single scraper, e.g. LOG_LEVEL=debug pnpm tsx scrapers/kinorotterdam.ts and then have e.g.
if (require.main === module) {
extractFromMoviePage(
'https://kinorotterdam.nl/films/cameron-on-film-aliens-1986/',
).then(console.log)
}
with the LOG_LEVEL=debug used to have debug output from the scrapers show up in the console
Creates a backup of the S3 buckets and DynamoDB tables
cd backup/
export STAGE=prod
aws s3 sync s3://expatcinema-scrapers-output-$STAGE expatcinema-scrapers-output-$STAGE --profile casper
aws s3 sync s3://expatcinema-public-$STAGE expatcinema-public-$STAGE --profile casper
aws dynamodb scan --table-name expatcinema-scrapers-analytics-$STAGE --profile casper > expatcinema-scrapers-analytics-$STAGE.json
aws dynamodb scan --table-name expatcinema-scrapers-movie-metadata-$STAGE --profile casper > expatcinema-scrapers-movie-metadata-$STAGE.jsonFor the DynamoDB tables, it might be better to use the Export to S3 functionality in the AWS Console, as these can be imported using aws dynamodb import-table
To convert the DynamoDB JSON format to a more readable format, you can use the following command:
cd backup/
export STAGE=prod
jq -c '.Items[] |
def dynamodb_to_json:
if type == "object" then
if has("S") then .S
elif has("N") then (.N | tonumber)
elif has("BOOL") then .BOOL
elif has("NULL") then null
elif has("L") then [.L[] | dynamodb_to_json]
elif has("M") then .M | with_entries(.value |= dynamodb_to_json)
else .
end
else .
end;
with_entries(.value |= dynamodb_to_json)
' expatcinema-scrapers-analytics-$STAGE.json > expatcinema-scrapers-analytics-$STAGE-converted.json
jq -c '.Items[] |
def dynamodb_to_json:
if type == "object" then
if has("S") then .S
elif has("N") then (.N | tonumber)
elif has("BOOL") then .BOOL
elif has("NULL") then null
elif has("L") then [.L[] | dynamodb_to_json]
elif has("M") then .M | with_entries(.value |= dynamodb_to_json)
else .
end
else .
end;
with_entries(.value |= dynamodb_to_json)
' expatcinema-scrapers-movie-metadata-$STAGE.json > expatcinema-scrapers-movie-metadata-$STAGE-converted.jsonThe S3 buckets can be restored by running the following commands
cd backup/
export STAGE=prod
aws s3 sync expatcinema-scrapers-output-$STAGE s3://expatcinema-scrapers-output-$STAGE --profile casper
aws s3 sync expatcinema-public-$STAGE s3://expatcinema-public-$STAGE --profile casperThe DynamoDB tables can be restored by running the following commands. Note that this doesn't batch, it just puts the items back one by one, which might be slow for large tables.
cd backup/
export STAGE=prod
jq -c '.Items[]' expatcinema-scrapers-analytics-$STAGE.json | while read -r item; do
aws dynamodb put-item \
--table-name expatcinema-scrapers-analytics-$STAGE \
--item "$item" \
--profile casper
done
jq -c '.Items[]' expatcinema-scrapers-movie-metadata-$STAGE.json | while read -r item; do
aws dynamodb put-item \
--table-name expatcinema-scrapers-movie-metadata-$STAGE \
--item "$item" \
--profile casper
done- Use https://t3.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&fallback_opts=TYPE,SIZE,URL&url=http://www.dokhuis.org&size=128 to get the favicon for the cinema.json file
Some scrapers need to run in a real browser, for which we use puppeteer and a lambda layer with Chromium.
- Find the preferred version of Chromium for the latest version of puppeteer at https://pptr.dev/supported-browsers, e.g. Chrome for Testing 123.0.6312.105 - Puppeteer v22.6.3
- Check if this version of Chromium is available (for running locally) at https://github.com/Sparticuz/chromium, check the package.json
- Check if this version of Chromium is available (as a lambda layer) at https://github.com/shelfio/chrome-aws-lambda-layer, e.g. Has Chromium v123.0.1 and arn:aws:lambda:us-east-1:764866452798:layer:chrome-aws-lambda:45
pnpm add puppeteer-core@22.6.3 @sparticuz/chromium@^123.0.1
pnpm add -D puppeteer@22.6.3After installing the new version of puppeteer and chromium update the lambda layer in the cdk stack, by doing a search and replace on arn:aws:lambda:eu-west-1:764866452798:layer:chrome-aws-lambda: and change e.g. 44 to 45
Run the following command to install Chromium locally:
pnpm run install-chromiumTo see if it's correctly installed, open it with pnpm run open-chromium
or see https://github.com/Sparticuz/chromium#running-locally--headlessheadful-mode for how
When running a puppeteer based scraper locally, e.g. AWS_PROFILE=casper pnpm tsx scrapers/ketelhuis.ts and getting an error like
Error: Failed to launch the browser process! spawn /tmp/localChromium/chromium/mac_arm-1205129/chrome-mac/Chromium.app/Contents/MacOS/Chromium ENOENT
you need to install Chromium locally, run pnpm run install-chromium which installs Chromium locally and then updates the LOCAL_CHROMIUM_EXECUTABLE_PATH in browser-local-constants.ts to point to the Chromium executable. See https://github.com/Sparticuz/chromium#running-locally--headlessheadful-mode for more information about how to install a locally running chromium.
To see if it's correctly installed, open it with pnpm run open-chromium