Google Books word scrapper

This NodeJS application retrieves words as a picture from book covers, using the Google Books API and its integrated OCR. It just scrapes the data from the page, and extracts the part corresponding to the researched term from the book.

Example

Inputting "I am the one who knocks" yields the following result :

Getting started

Dependencies

Words-scrapper uses sharp to resize and crop the images.
As a native NodeJS module, it may require some additional build toold such as Visual C++ libraries and python on Windows.

How to use

const WordScrapper = require('./scrap.js');

const wordScrapper = new WordScrapper();
wordScrapper.init().then(() => {
  wordScrapper.search(['quilting']);
});

TODO

Use phantom-pool to optimize the browser instances between each HTTP request
Provide a way to input a word to search with an UI
Find a better way to target interesting DOM elements to locate the word on the cover
Prioritize covers over normal pages (use querySelectorAll and iterate over the nodeList)
Proper error handling
Use a Promise based rimraf
Allow FS to create directories if missing
Proper linting/babel
Differenciate exposed/private functions and document them

History

Working with Axiom and JSDom didn't lead too far as some of Google's Javascript wasn't properly executed, with the following stacktrace :

Error: Uncaught [TypeError: Cannot read property 'closure_lm_485376' of null]

I had an issue with Google not serving some files. I found out that it was because it required the NID cookie in the outgoing request. More details here.

Scrapping caveats

Full size cover picture does not always exist, and its low res version is unsufficient to extract an image properly
Google references pages as well as book covers, and applies OCR on these, thus results could actually be extracted from pages rather than the cover.
Some images can be cropped too thinly due to improper highlighting, while other can actually include some other text.
I cannot seem to use concurrent PhantomJS instances performance-wise, as I run out of memory doing so.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
old		old
result		result
temp		temp
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
scrap.js		scrap.js
test.js		test.js
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Books word scrapper

Example

Getting started

Dependencies

How to use

TODO

History

Scrapping caveats

About

Releases

Packages

Languages

ojathelonius/google-words-scrapper

Folders and files

Latest commit

History

Repository files navigation

Google Books word scrapper

Example

Getting started

Dependencies

How to use

TODO

History

Scrapping caveats

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages