Skip to content

philshem/zuerich_speaks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

author
@philshem
Aug 26, 2018
eb5a662 · Aug 26, 2018

History

25 Commits
Aug 26, 2018
Aug 25, 2018
Aug 26, 2018
Aug 26, 2018
Aug 26, 2018
Aug 26, 2018
Aug 26, 2018
Aug 26, 2018
Aug 26, 2018
Aug 26, 2018
Aug 26, 2018
Aug 26, 2018
Aug 26, 2018
Aug 26, 2018

Repository files navigation

Text mining 100+ years of Kanton Zürich's referenda and initiatives

TWIST2018 project

team

*Peter has some nice papers with previous research

main data sources:

  • https://opendata.swiss/de/dataset/abstimmungsarchiv-des-kantons-zurich

  • Kantonal level CSV contains URLs to machine-readable pdf voting information

  • Gemeinde level CSV contains per-Gemeinde historical voting records

  • CSVs are joined by unique vote ID (STAT_VORLAGE_ID)

  • PDF are converted to TXT via pdftotext and can be joined to CSV files by field ABSTIMMUNGSTAG

using the code and data

(mostly python 2.7 or bash)

  • get_pdfs.py scrapes the URLs from the Kantonal CSV file and saves them locally. (Actually we got the PDFs from the organizers on a usb stick, because the scraper was getting IP blocked.) Note that the files Bundesamt.pdf are not URL linked in the CSV files.

  • convert_pdf_to_txt.sh loops over the PDFs and converts them to TXT with pdftotext.

  • read_txt.py reads the individual TXT files, cleanups up the text a bit, and writes a CSV file with some keys for joining later: full_text.csv (zipped).

  • vote_mapping.py (experimental) reads the combined text from full_text.csv, and also the metadta from the Kantonal CSV file. It attemps to split the TXT file into multiple elements, one for each ballot measure, using some file-specific some keywords. The code then maps based on the rank of this split array. Output file is full_text_mapped.csv.

  • sentiment.py reads full_text_mapped.csv and calculates the polarity (-1,1), the subjectivity (0,1) with textblob_de and the readability. Output file is full_text_mapped_sentiment.csv, and the three scores are added as the last 3 columns.

voting

Releases

No releases published

Packages

No packages published