Skip to content

Commit 07a4be3

Browse files
Update readme
1 parent 9de42fa commit 07a4be3

File tree

1 file changed

+1
-18
lines changed

1 file changed

+1
-18
lines changed

pombola/south_africa/data/members-interests/NEW_README.md

Lines changed: 1 addition & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ There are several files in this directory:
77
The scraper currently scrapes `.docx` files.
88
To prepare the file:
99

10-
1. Split the `PDF` into seperate files small enough to open in Google Docs. PDF Arranger works well https://github.com/pdfarranger/pdfarranger
10+
1. Split the `PDF` into seperate files small enough to open in Google Docs. [PDF Arranger](https://github.com/pdfarranger/pdfarranger) works well
1111
2. Open the files in Google Docs and download each in `.docx` format
1212
3. Store the these files in `./docx_files/`
1313

@@ -20,23 +20,6 @@ Run the script `html_to_json.py` to scrape the HTML and compile into an easy to
2020

2121
The output should be `register.json`
2222

23-
## Raw data
24-
25-
2010.json
26-
2011.json
27-
2012.json
28-
2013.json
29-
2014.json
30-
2015.json
31-
2016.json
32-
2017.json
33-
2018.json
34-
35-
These are the JSON files provided to us by Geoff. They are unchanged and are (I
36-
believe) generated by scraping code that he has from the PDFs mentioned in
37-
them. For me these PDF urls 404ed so I was not able to look at the original
38-
source material.
39-
4023

4124
## Conversion script
4225

0 commit comments

Comments
 (0)