Skip to content

Latest commit

 

History

History
35 lines (29 loc) · 901 Bytes

Atom_Wishlist.md

File metadata and controls

35 lines (29 loc) · 901 Bytes

Discovery

  • [] Articles from ONA selected via keywords/collection/dates
  • [] Sitemap ingestion
  • [] Generally any other API
  • [] Setup mediacloud directory proxy

Filtering

  • [] Deduplication.
  • [] Sentence level de-duplication.
  • [] Classifier threshold
  • Metadata subsets for final return

Data Augmentation

  • Entity Extraction (via API)
  • [] Scrapy NER
  • [] N-Grams
  • [] Byline Detection
  • [] Quote extraction, attribution
  • [] link extraction, network generation
  • [] NYT based topic/theme detection
  • [] sentence-level story splitting
  • [] train a word-2-vec model
  • [] media-to-media link count (ie: table of most linked-to sources)
  • [] media-to-document link count (ie: table of most linked-to documents)
  • [] Country-level tagging- what region is this about?

Outputs

  • CSV
  • [] Network Maps
  • [] Kibana Instance Export
  • [] Custom Tooling...?
  • S3 buckets.