Skip to content

Latest commit

 

History

History
78 lines (54 loc) · 3.23 KB

CONTRIBUTING.md

File metadata and controls

78 lines (54 loc) · 3.23 KB

Contributing

Please read the design thoroughly before modifying the code. It lists some gotchas which may be helpful to know ahead of time.

Design

Subcommands

Each command goes into the subcommands module. each has a docopt string for the command-specific arguments. the .command() function is invoked with the merging of the global arguments dict (from the main script) with the command-specific arguments dict.

If the module provides the args_schema variable, the command-specific arguments will be passed through the schema validation.

Scraping - Messages

Scraping is done with Selenium to allow for scraping private sites.

The Yahoo Groups undocumented JSON API is used:

  • https://groups.yahoo.com/api/v1/groups/<group_name>/messages?count=1&sortOrder=desc&direction=-1 to get the total number of messages.
  • https://groups.yahoo.com/api/v1/groups/<group_name>/messages/<message_number> to get the data, with HTML content, for the given message
  • https://groups.yahoo.com/api/v1/groups/<group_name>/messages/<message_number>/raw to get the data, with raw content, for the given message

All the message data from the API is combined and inserted into a mongo database with the same name as the group. Data is stored as returned from the API except the message id is stored into the _id field.

Scraping - Files

Files are scraped through the human-consumable interface (i.e. the website) as I couldn't figure out the JSON API calls for it.

They are stored in a GridFS instance with the name <group_name>_gridfs.

Static Site Dumping

All the group data - messages and files - can be dumped into a static site which is viewable without any internet connection whatsoever, and without needing to run a local browser.

The static site is an AngularJS app. The message index data is stored as a separate .js file and loaded with JSONP. This allows us to essentially load data from the local filesystem.

The messages themselves are stored in batches of n messages (default 500) and loaded on-demand.

A site is "dumped" by copying everything but the data from the static_site_template directory, and rendering the necessary data into jsonp files.

The group files are copied into the files directory, and it is left up to the browser to display the contents, as if browsing any other local directory.

Angular Templates

Angular Templates are in the modules subdirectory. The tricky part is that there is no way to load these .html files if the page is being served from the file-system. Modern browsers prevent access to arbitrary files due to security concerns, and they won't load the <script type="text/ng-template" ...> tags.

To work around this without being forced to include all the .html in one file, there's the modules/load-templates.js file. This preloads all the templates into $templateCache. This file is automatically generated by dump_site, which reads the modules/*.html files and inserts their data into the file.

NOTE: Template ids all start with ./, e.g. modules/foo.html will be loaded as the template called ./modules/foo.html. If you include a template without a leading ., it will not work on the generated site.

Paths are followed recursively.