This GIT repository contains a scrape of the database of companies on the Department of Commerce's Safe Harbor site, located at https://safeharbor.export.gov/list.aspx
For more information about the Safe Harbor Privacy Principles regulating EU-US and CH-US data exchange, please consult Wikipedia. Compliance with those principles is a pre-requirement for running a business in the European Union that collects personal data. Any European citizen, regardless of where they live, is supposed to be protected by those principles, as well as any resident of the EU or Switzerland.
The FTC database includes filings by Google, Facebook, Yahoo, Amazon, genetic testing company 23andMe, sharing economy accomodation provider Airbnb, sharing economy transportation provider Uber, data broker Axciom or MOOC provider Coursera.
The information here is mostly information self-reported by the companies. It is then spot checked by the FTC, but the EU is not always pleased with the level of oversight applied by the FTC.
The FTC database rolls the entries by key number: when a certification is updated, its key number is changed and the previous one is deleted.
All entries are collected here, even obsolete ones, while this information is simply overwritten on the FTC site. In this way, one can track how the submissions evolve with time, for instance how Uber recently updated its previous "business activities"
The json
folder contains all these entries. The format used is json
, and should be easily parsable into just about any tool.
The data
folder also contains a file in csv
format grouping the same information.this actually conforms to the datapackage specification as defined by the Open Knowledge Foundation. The different fields are defined in machine readable format in the file datapackage.json
.
For compatibility reasons, I have kept the same keys as in the original HTML, and added a few more:
asp_index
: This is actually the only key I have added, and also gives the name to the file:{asp_index}.json
. The FTC uses an ASP .NET database, hence the name. Several different files and keys could correspond to the same company.OrgName
: The official name of the registrant. Sometimes an "Inc" will change to a "Corp", but both records still represent the same company. Sometimes many different closely related companies are registered separately (for instance, all the Disney companies)latest
: Indicates if this is the latest entry for this organisation. Matches are based onOrgName
previous
: The asp_index of the previous entry for the same organisation, if any. Blank otherwise.CertificationStatus
: companies can let their certification lapse, but the Safe Harbor principles remain in force for data that was collected during their certification periodindustrysector
: big set of possibilities, with more than one possible. This is then harder to parse. Note that 'Leasing Services - (LES)' and 'Legal Services - (LES)' clash for the shorthand. Of interest: 'Advertising Services - (ADV)'PPLoc
: the location of the privacy policy, normally on the organisation websiteprivacyprograms
: a bit of a mix, indicates whether the organisation declares compliance with additional privacy programs, such as TRUSTe, HIPAA, SOC2, PCI, CJIS, IAPP,...recoursemech
: the recourse mechanism for arbitration, for instance TRUSTe, ICDR/AAA, BBB,...activities
: long form description of business activitiesaddress
,city
,state
,zip
,Phone
,Fax
: contact information for the companycontactname1
,contacttitle1
,contactoff1
,contactemail1
,contactfax1
,contactphone1
: contact person at the companycorpoffname1
,corpofftitle1
,corpoffemail1
,corpofffax1
,corpoffphone1
: corporate officer in charge, and contact informationeucountries
: relevant EU countries (and CH)euprotection
:companies have the option of delegating some authority to the EU rather than the FTChrdata
: used for HR dataorgverification
: the organ in charge of verification of compliance. Note that this could be in-house.nextcertification
: when is next certification duepersonaldata
: clarification of personal data collected (self-reported)ppdate
: date on the privacy policysignupdate
: sign up date to the Safe Harbor programstatutorybody
: Federal Trade Commission, Department of Transportation, etc.eucountries_parsed
: A more computer-friendly version of the eucountries field, formatted as an array of ISO3166-1-Alpha-3, coming fromcountry-codes
datapackageindustrysectors_parsed
: A more computer-friendly version of the industrysectors field, formatted as an array of sectors
The scripts/
folder contains the scripts necessary for scraping, and a little bit of documentation.