Add Script To Get Data From Prod #108

dajinchu · 2019-09-19T02:25:42Z

Right now in dev, if elasticsearch isn't installed locally, we just hit the prod search API endpoint.
If a developer does want to work on the backend, they must install elasticsearch and run yarn scrape and yarn index which will take ~1hr on first run. What might be preferable would be a script that just downloads the already-scraped data from prod and puts it in the right spot for yarn index to then slurp it all into a local ES installation.

Part of this effort might also be cleaning up the way the public data folder works. Right now there is both /neu.edu/201930.json that is meant for public use, and allTerms.json that is there just to send scraped data from Travis to prod, and they're similar but different formats.

The text was updated successfully, but these errors were encountered:

ryanhugh · 2019-09-19T03:44:06Z

Sounds like a great idea, but this will only allow people to work a small part of the backend without actually running the full scrapers. With this, they could run server.js and the indexing code, but not any of the scraping code (everything in backend/scrapers). The scraping code goes through the cache to backend/request.js, which pulls the data from NEU's site. All of the scraping code runs before allTerms.json is even assembled. Even the updater depends on the scraping code.

What would be an interesting idea is if we could store a full cache folder on prod, and then have an option to download the cache folder from prod instead of having everyone run the full scrapers on their laptop when they want to start developing on the backend.

Note, we would need some way to keep this online cache folder up to date, but when the scrapers run on Travis, they are in production mode, which bypasses the cache and doesn't create a cache folder. Also the cache folder is usually over 1GB.

ryanhugh · 2019-09-19T17:00:50Z

Oh another thing - once we merge Jenning's code for the new version of Banner the scrapers should take <10 minutes, which means that the cache will be smaller and the initial run will be a lot faster too

dajinchu · 2019-09-22T01:06:01Z

Hm that's a good point. We might just want to leave it alone and make developers run a full scrape

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Script To Get Data From Prod #108

Add Script To Get Data From Prod #108

dajinchu commented Sep 19, 2019

ryanhugh commented Sep 19, 2019

ryanhugh commented Sep 19, 2019

dajinchu commented Sep 22, 2019

Add Script To Get Data From Prod #108

Add Script To Get Data From Prod #108

Comments

dajinchu commented Sep 19, 2019

ryanhugh commented Sep 19, 2019

ryanhugh commented Sep 19, 2019

dajinchu commented Sep 22, 2019