Skip to content
This repository has been archived by the owner on Mar 9, 2023. It is now read-only.

Latest commit

 

History

History
123 lines (88 loc) · 4.88 KB

savreviews.md

File metadata and controls

123 lines (88 loc) · 4.88 KB

savreviews.pl

Maintenance

Download all reviews for a book, e.g., for sentiment analysis

From r/goodreads (2018) or the Goodreads Developers forum, Breslin (2018) or Giulia (2018):

I simply need to obtain all (or as many) reviews for two books, namely Woolf's To the Lighthouse and Mrs Dalloway, so that i can then analyse the corpus obtained from them and see if readers define the two novels as "difficult".

Output format

$ cat savreviews-book12345-stars2.txt
2018/12/29 #1234567

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut
aliquip ex ea <em>commodo consequat</em>. 

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum 
dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non 
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

-------------------------------------------------------------------------------
2018/10/21 #7654321

Ut enim ad minim veniam, quis nostrud <b>exercitation</b> ullamco laboris nisi 
ut aliquip ex ea commodo consequat: <a href="https://example.com">example.com</a>

-------------------------------------------------------------------------------
2018/04/01 #918273

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi

Note:

The generated files (one per star-rating) contain review-texts, dates and the review-ID only. They do not contain any other information, e.g., user names. If there is interest in these details or other output formats, just contact me or add an issue.

How to generate this on a GNU/Linux operating system

  1. Install the toolbox
  2. at the prompt, enter:
$ ./savreviews.pl --help
$ ./savreviews.pl 59716  # Goodreads Book-ID in URL

Loading reviews for "To the Lighthouse"... 5271 of 5860 [searching]

Number of reviews per year:
2007 ################                           263
2008 #####################                      343
2009 ################                           266
2010 #################                          276
2011 ######################                     357
2012 #############################              473
2013 ##################################         565
2014 ############################               456
2015 ###########################                440
2016 #############################              474
2017 ####################################       599
2018 ########################################   648
2019 ######                                     111

Writing reviews to:
./list-out/savreviews-book59716-stars0.txt
./list-out/savreviews-book59716-stars1.txt
./list-out/savreviews-book59716-stars2.txt
./list-out/savreviews-book59716-stars3.txt
./list-out/savreviews-book59716-stars4.txt
./list-out/savreviews-book59716-stars5.txt

Total time: 36 minutes

Observations and limitations

  • long runtime: Goodreads slows down all requests and we have to load a lot of data
  • there's no way to load all reviews of a book, but the program tries different things to get as many fulltext reviews as possible -- this can take very long (see --rigor parameter and this)
  • needs data cleansing on your side
  • review text might include user-entered (broken) HTML code and URLs
  • review text can be in any language, e.g., German or Russian
  • review text might include non-latin characters, e.g., Cyrillic
  • no duplicate reviewers, but could theoretically contain duplicate reviews posted by different members (statistically negligible?)

Feedback

If you like this project, give it a star on GitHub. Report bugs or suggestions via GitHub or see the AUTHORS.md file.

See also