Skip to content

How to save everything online. Tools for for scraping, saving, downloading, hoarding, archiving, etc.

License

Notifications You must be signed in to change notification settings

all-the-data/awesome-data-hoarding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 

Repository files navigation

awesome-data-hoarding

A concise cheat-sheet of commands and tools for scraping, saving, hoarding, archiving, collecting, organising and browsing data.

Inspired by Reddit's /r/DataHoarder

Quick reference

Which archiving tool should you choose for each web service?

  • Amazon Video: Unknown. Check torrents instead.

  • BBC iPlayer: youtube-dl / yt-dlp

  • Discord: DiscordChatExporter (see below for notes)

  • Mediawiki website: Native dump using /wiki/Special:AllPages and /wiki/Special:Export.

  • Netflix: Unknown. Check torrents instead.

  • Reddit: Various tools

  • SoundCloud: youtube-dl / yt-dlp

  • Tumblr: TumblThreeApp (Windows). Viewers: 1, 2.

  • Twitter: ThreadReaderApp

  • Torrents: Use unblockit for a list of torrent sites. Official Twitter / Reddit.

  • Private torrent trackers: Might contain any TV or movie ever broadcat. It can be difficult to get an invite, and you may need to maintain an upload ratio.

  • Individual web pages:

    • Save as | Web Page, HTML Only
    • Save as | Web Page, Single File
    • Save as | Web Page, Complete
    • Print | Save as PDF
    • Chrome extension SingleFile <-- Recommended!
  • Websites generally: wget, httrack or ArchiveBot.

  • Youtube video/music: youtube-dl (see below for notes) / yt-dlp

  • Radio scrobbling / Music identification: Shazam or AHA Music finder

Scraping tools

  • Radio scrobbling
    • Play radio station with low quality playlist: La Mega, Malaga.
    • Install chrmoe browser extension Shazam or AHA Music finder
    • On Linux use xdotool to automate clicking on chrome browser extension icons to activate music identification: watch "xdotool mousemove 3442 90 click 1; sleep 20; xdotool mousemove 3476 90 click 1; sleep 20" (adjust coords as needed)
    • Does not require speakers to be on

Details of precise sets of commands.

wget \
    -e 'robots=off' \
    --accept '*.*' \
    --mirror \
    --wait 2 \
    --random-wait \
    --convert-links \
    --user-agent 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.7113.93 Safari/537.36' \
    'http://www.example.com/'
  • StreamRipper for music

    • Example: streamripper ###URL### -u "FreeAmp/2.x" -q -l 86400
  • Chrome DevTools for anything via a web browser

  • Mediawiki for wiki sites

    • For an XML dump containing wikitext...
    • Copy names of pages from /wiki/Special:AllPages...
    • Paste into /wiki/Special:Export
    • (optional) Parse resulting wikitext with mwparserfromhell.
  • youtube-dl / yt-dlp for Youtube and other video/audio

    • Video
yt-dlp \
    --ignore-errors \
    --format 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best' \
    --output "%(playlist_title)s/%(title)s.%(ext)s" \
    --throttled-rate 10K \
    ###URL###
  • Audio
yt-dlp \
    --ignore-errors \
    --extract-audio \
    --audio-quality 0 \
    --audio-format mp3 \
    --prefer-ffmpeg \
    --output "%(playlist_title)s/%(artist)s - %(title)s.%(ext)s" \
    --throttled-rate 10K \
    ###URL###
  • Audio album playlist
yt-dlp \
    ...etc... \
    --output "%(artist)s - %(album)s/%(artist)s - %(album)s - %(playlist_index)02d - %(track)s.%(ext)s" \
    ###URL###
  • Video playlist
yt-dlp \
    ...etc... \
    --output "%(playlist_title)s/%(playlist_index)03d - %(artist)s - %(title)s.%(ext)s" \
    ###URL###
  • Multiple playlists
for URL in $(cat list)
do
    yt-dlp ...etc... "$URL"
done
  • DiscordChatExporter + excellent wiki
    • Example: docker run --rm -v /var/www/zaphod/adhd:/app/out tyrrrz/discordchatexporter:stable export --channel ###ID### --token ###SECRET### --format Json
    • List guilds: docker run tyrrrz/discordchatexporter:stable guilds
    • List channels: docker run tyrrrz/discordchatexporter:stable channels --guild ###ID###

Processing tools

Techniques

Combine streamed .ts files and m3u8 playlist/chunklist into an mpeg/mp4 video

  • After extracting the .m4u8 and .ts files from HAR, run something like:
    • ffmpeg -i playlist.m3u8 -c copy -bsf:a aac_adtstoasc output.mp4

Extract playlist data from YouTube and YT Music

Input: https://music.youtube.com/library/playlists Goal: Extract a list of playlists suitable for feeding to youtube-dl / yt-dlp

These are all equivalent ways to achieve the same thing:

  1. Chrome: Save As | Web Page, HTML Only --> doesn't work, empty page
  2. Chrome: Save As | Web page, Single File --> works, full HTML, embeds images, uses "quoted printable encoding", i.e. = becomes =3D
  3. Chrome: Save As | Web page, Complete --> works, full HTML, not encoded, saves album/playlist covers as image files.
  4. Chrome: DevTools | Elements | | right-click | Copy | Copy element | Paste into text editor --> works, full HTML
  5. Chrome: Extensions | XPath Helper | Ctrl-Shift-X | Hover over element | Shift | Edit XPath to remove e.g. [409] | Append /@href --> works, list of URLs
  6. Chrome: DevTools | Console | | | $(document).xpathEvaluate('//body/div/foo')
  7. Chrome: DevTools | Elements | right-click | Copy | Copy JS | (paste into console and edit - see snippet below)
  8. Chrome: Extensions | AutoHAR | chrome --auto-open-devtools-for-tabs | ...etc
  9. Chrome: DevTools | Network | Filter | Fetch/XHR | https://music.youtube.com/youtubei/v1/browse/...etc... | (a) Save all as HAR with content, (b) (down-arrow near top-right) Export HAR...
  10. (Idea) Headless chrome + puppeteer or playwright

Javascript snippet:

items = document.querySelectorAll("#items > ytmusic-two-row-item-renderer");
items.forEach((item) => {
    drill = item.querySelector("div.details.style-scope.ytmusic-two-row-item-renderer");
    span = drill.querySelector('span > yt-formatted-string > span:nth-child(3)');
    if (! span) { return };
    console.log(
        drill.querySelector('a').toString()
        + "    " + span.innerHTML
        + "    " + drill.querySelector('a').text
    );
});

Shorter snippet:

var output = '';
document.querySelectorAll("h3 > div > div > a").forEach((item) => { output += item.text + "\n"; });
console.log(output);
console.save(output);

Save data out of console via clipboard or writing a file (provides console.save() command.

Case studies

  • naive-slack-scraper. Hypothetical code that cannot exist, as it potentially wouldn't follow terms of service. So don't look for it.
  • pokemon-data. jq examples.
  • moar jq examples

Discussion

  • If an archive of data is made, and that data cannot be viewed reasonably easily in a way similar to its original presentation by a person on the street, then it can be considered not to be viewable at all. It may as well not exist for public purposes. A possible retort is to assert "A viewer program could be built". But if that viewer program doesn't yet exist, then the data still can't be viewed. It's a Schroedinger's archive.

Communities

Similar projects

About

How to save everything online. Tools for for scraping, saving, downloading, hoarding, archiving, etc.

Topics

Resources

License

Stars

Watchers

Forks