GitHub - cschnaars/FEC-Scraper: Scripts to scrape the FEC website and parse campaign filings

cschnaars / FEC-Scraper Public
Notifications You must be signed in to change notification settings
Fork 7
Star 45
Scripts to scrape the FEC website and parse campaign filings
Notifications
Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
AUTHORS		AUTHORS
FECParser.py		FECParser.py
FECScraper.py		FECScraper.py
FECScraper.sql		FECScraper.sql
README		README
Repository files navigation

FEC Scraper/FEC Parser:
-----------------------
By Christopher Schnaars and Anthony DeBarros, USA TODAY
Developed with Python 2.7.2



Contents:
---------
1 - What's in the box?
2 - Requirements
3 - Notes about FEC data
4 - What is FEC Scraper?
5 - What is FEC Parser?
6 - What don't FEC Scraper and FEC Parser do?
7 - How to incorporate a database manager
8 - Directories



1 - What's in the box?
----------------------
FEC Scraper and FEC Parser are separate Python scripts that work together to
download and parse data from the Federal Election Commission (FEC) provided in
ASCII-28 delimited format. Optionally, you can enable integration with a
database manager (SQL Server by default) to help you import the data and
maintain database integrity. The scripts can handle hundreds or even thousands
of files at once. The end result is eight to 11 tab-delimited text files
housing cleaned-up data ready for easy import into a database manager or
spreadsheet program.

On the github site, you'll find a file called FECScraper.sql, which contains
all the scripts you need to create tables and all other necessary database
objects. If you are using a database manager other than SQL Server, you can
use the schema contained in FECScraper.sql to build tables on a different RDMS.



2 - Requirements
----------------
If you want to enable functionality that integrates FEC Scraper and FEC Parser
with a database manager, you'll need a library to connect to a database manager
of your choice, if you don't use the SQLite functionality baked into Python.
Examples:
 * pyodbc (SQL Server, http://code.google.com/p/pyodbc/downloads/list)
 * mySQLdb (mySQL, http://mysql-python.sourceforge.net)
 * psycopg2 (PostgreSQL, http://initd.org/psycopg/download/)

Otherwise, the scripts just use standard libraries included with Python. These
scripts were built in Python 2.7.2.



3 - Notes about FEC data
------------------------
Every candidate or organization that has to file reports with the FEC is
assigned a committee ID number. FEC Scraper will download all Form 3 filings
for each committe you specify, including Forms 3A, 3N, 3PA, 3PN, 3XA and 3XN.
FEC Parser will churn through the data housed in these files and output
standardized, tab-delimited text files for easy import.

The Form 3 data files include monthly, quarterly and annual reports housing
itemized contributions, expenditures and loans, among other types of
information, so they probably contain everything you need. These reports are
required by the FEC for all PACs and other political organizations as well
as candidates for President and the House of Representatives (but not the
Senate, where members still engage in the archaic process of filing their
reports on paper to hide contributors and expenditures).

As of this writing, the FEC offers data files in both comma and ASCII-28
delimited formats. Because of inherent problems with .csv files (namely, commas
in the data itself), FEC Scraper uses the ASCII-28 delimited format. This ASCII
code is obscure and unlikely to show up in your data.

Each data file contains two (and sometimes three) header rows. Among other
information, these header rows specify the header version used by the data
file. FEC Scraper downloads all Form 3 filings regardless of the header
version, but FEC Parser supports only header versions 6.4, 7.0 and 8.0. Data
files that utilize other header versions are moved to the Review directory and
are not processed.

Please note that as of this writing, if a data file contains a third header
row, that row is ignored. A third header row occurs when data is included for
Column C on 30G reports. You can see what that form looks like at:
http://www.fec.gov/pdf/forms/fecfrm3post.pdf

The script also does not yet handle Schedule H records, which sometimes occur
on Form F3X. Support for this data will be added in an upcoming release. For
what it's worth, processing all data for more than 250 PACs and all candidates
for President and the House of Represenatives yielded just under 50 data rows,
so this issue is minimal. All Schedule H data rows presently are saved to
the Review file.



4 - What is FEC Scraper?
------------------------
FEC Scraper is a Python script you can run to find and download campaign
finance filings from the Federal Election Commission website for specified
committees, including candidates for president and the House of Representatives
as well as Super PACs and other organizations.

FEC Scraper is most powerful when you take advantage of its ability to interact
with a database manager. When this functionality is enabled (by setting the
usedatabaseflag variable to 1), the script will retrieve a list of committees
that already exist in the database as well as all files housed in the database,
then go to the FEC website to look for and scrape new filings for those
committees. Reports already housed in the database will not be downloaded.

You also can specify a "Processed" directory. FEC Parser will move files that
have been parsed successfully to this directory, and FEC Scraper will look in
this directory to determine whether a data file previously has been imported.
This helps ensure that even if the script doesn't interact with a database
manager, you can avoid importing the same file more than once as long as you
don't delete the processed files.

(Development note: In a future version, we could have FEC Parser create and
modify a text file housing a list of all Image IDs that have been processed.)

If you don't interact with a database or if you want to download reports for a
committe that does not exist in the database, you must specify in the script
which committees you want to scrape. This is handled in a Try block near line
100 of the code. There are two ways to specify committees:

First, the code will look for a text file called commidappend.txt. You can
specify a list of committees (one per line) in the file. Don't use commas
or any other separator, and make sure the committe IDs begin with a capital C.
The file should look something like this:
C00431445
C00500587
C00431171

Additionally, you can write code in the Finally block to append the commidlist
list directly. Examples:
commidlist.append('C00431445') # Obama for America
commidlist.append('C00500587') # RickPerry.org Inc.
commidlist.append('C00431171') # Romney for President Inc.



5 - What is FEC Parser?
-----------------------
FEC Parser is a Python script that converts ASCII-28 delimited data files from
the FEC into standard tab-delimited text files that easily can be imported into
any database manager or spreadsheet program.

When you download a Form 3 filing from the FEC, the data file contains two or
three header rows at the top of the file, followed by data housed on eight
separate forms (Schedules A, B, C, C1, C2, D and E and a long-text form). This
data is in no particular order, and each form has different numbers of columns,
and different column headings, making import tricky. Additionally, the FEC
routinely adds and removes columns.

If FEC Parser finds a file that is not a Form 3 filing (see the "Notes about
FEC data" section above), the file will be moved to a Review directory you
specify. In some cases, FEC Parser will append to the filename a brief
explanation of why the file was not parsed.

FEC Parser will go through all files with an .fec extension in whatever
directory you specify and output eight tab-delimited text files: one for each
schedule form. FEC Parser can process hundreds or even thousands of files at
once. Each data row will be prepended with the FEC's unique ID for that file
(called the ImageID) to create parent-child relationships between a header and
each data row.

Any tabs contained in the data are converted to spaces during parsing.

If you implement the functionality that allows FEC Parser to interact with a
database manager, it will check to make sure each report has not previously
been imported into the database. If not, it will load the header rows into
parent tables.

If database functionality is disabled (the default behavior), FEC Parser will
create three additional text files for the F3, F3P and F3X headers.

FEC Parser presently supports only header versions 6.4, 7.0 and 8.0. Data files
that utilize other header versions are moved to the Review directory and are
not processed.

Please note that due to the inherent messiness of FEC data (particularly
problems with missing or excess column delimiters, FEC Parser will add
delimiters to data rows that do not have the correct number of columns and
remove excess delimiters when there are too many. If FEC Parser finds data in
any excess columns, that row is instead written to the review file so no data
is lost.

This could cause data to be written to the wrong column if excess or missing
delimiters occur anywhere other than the end of a data row. I've tested the
data extensively, however, and as of this writing, I have never found a problem
other than at the end of a data row.



6 - What don't FEC Scraper and FEC Parser do?
---------------------------------------------
FEC Scraper downloads only Form 3 filings. (For more information about Form 3
and other filings, see the "Notes about FEC data" section above.) All other
filings are ignored.

Additionally, FEC Scraper does not download data for committees you don't
specify. For the script to download filings for a particular committee you must
do one of two things:
 * You must specify the committee directly in the code (commented-out examples
of how to do this are included in the script).
 * You must enable database integration (by setting usedatabaseflag to 1) AND
you must have data for that committee in the database. (When database
integration is enabled, FEC Scraper will get a list of existing committees from
the database and automatically download all new filings for them.)

If you have database integration enabled, FEC Parser will load the header rows
from the data files into your database. Otherwise, the script does not load any
data into your database. You must manually import the data files after the
script runs.

Additionally, FEC Parser handles only Form 3 filings, and those filings must
have header version 6.4, 7.0 or 8.0. All other form types and all Form 3
filings with other header versions are moved to the Review directory and are
not processed.



7 - How to incorporate a database with FEC Scraper
---------------------------------------------------
Near the top of the code for both FEC Scraper and FEC Parser, in the User
Variables section, you can set the usedatabaseflag variable to 1 if you want
the script to interact with a datatbase manager and 0 if you want the script to
run independent of a database.

Both scripts include code to import the pyodbc module
(http://code.google.com/p/pyodbc/downloads/list) to integrate them with a SQL
Server database. If you want to use a different database manager, you will need
to modify the code to use a different library. Here are a few suggestions:
 * mySQL: MySQLdb module (http://mysql-python.sourceforge.net)
 * PostgreSQL: psycopg2 module (http://initd.org/psycopg/download/)

For the default SQL Server code, you will need to specify the server and the
name of the database (if you've named it something other than FEC) in the
connection string. If necessary, you also can specify a username (UID) and
password (PWD).

When FEC Scraper runs with database integration enabled, it calls two stored
procedures that create lists of Committee IDs and Image IDs that already exist
in the database. FEC Scraper uses these lists to determine what committees to
scrape and what files do not need to be downloaded. The script also will not
download any files that already exist in the processed or review directories.

When FEC Parser runs with databse integration enabled, it will check the
database each time it processes a file to make sure that Image ID does not
already exist in the database. The user will be alerted any time it finds a
duplicate file, and that file will be omitted from the text files created
by the script. Additionally, the header rows are loaded into the
database for you.

When database integration is disabled, FEC Parser will output the header rows
to an additional three data files that you can import manually.

The FECScraper.sql file contains all the SQL Server scripts you need to create
the necessary tables, stored procedures and other database objects. You can use
this code to create similar objects for other database managers.



8 - Directories
---------------
You can specify up to four directories for various files:
 * savedir: This is where FEC Scraper saves files. FEC Parser will process all
.fed files housed in this directory.
 * reviewdir: FEC Scraper gets a list of files housed in this directory and
uses this list to avoid downloading previously downloaded files.
   FEC Parser moves all files that cannot be processed to this directory, such
as files that are not Form 3 or use an unsupported header version. For data
rows that are in a valid file but can't be parsed, FEC Parser writes those data
rows to a single timestamped review file and saves that file here.
   No files in this directory are altered by either script, though FEC Parser
may append to the file name a brief description of why the file could not be
parsed.
 * processeddir: FEC Scraper gets a list of files housed in this directory and
uses this list to avoid downloading previously downloaded files.
   FEC Parser moves all valid, processed files from the save directory to this
directory.
 * outputdir: FEC Scraper does not use this directory. FEC Parser saves the
time-stamped tab-delimited data files it creates in this directory.