Skip to content

SOP for Composition Extractions

larger-allison edited this page Jun 19, 2020 · 4 revisions

1. Create a group.

If a group has not been created for the set of documents, follow the SOP for uploading new documents to Factotum

2. Download all relevant files.

If you do not have copies of the documents being extracted, you can download them directly from the datagroup page in Factotum by clicking the Document files button. You also need to download a csv that lists the filenames and associated Factotum datadocument IDs. For this you have two options: go to the Extracted tab and click Download extracted text CSV template (this gives you an empty upload template with the filenames and IDs of all unextracted files in that group filled out) or click on the Document records button on the right (this gives you a csv with the filenames, IDs, and other metadata for all documents in the group)

3. Write a script to extract document data

Write a script that generates a csv with the following columns:

  • data_document_id

    • The Factotum document ID associated with this file
    • You can get a list of the document IDs and filenames for all unextracted documents in a group by navigating to the extracted tab on a datagroup page and clicking the Download extracted text CSV template. You can also click the Document records button on the datagroup page to download a list of all document IDs, filenames, and other document metadata for both extracted and unextracted documents.
  • data_document_filename

    • The name of the file being extracted
  • prod_name

    • The name of the product associated with the document you are extracting
  • doc_date

    • The date the document was last updated. This is either the most recent revision date, or the creation date if no revision date is given. This should not be a print date
  • rev_num

    • The revision or version number of the document you are extracting. Not all documents will have a revision number, so this field will often be blank
  • raw_category

    • A short description of what type of product the document is for. For example, "shampoo" could be a raw category
  • raw_cas

    • The CAS number for the ingredient being extracted
  • raw_chem_name

    • The name of the ingredient being extracted
  • report_funcuse

    • The function an ingredient serves within a product (for example: fragrance, cleaning agent, filler, etc.). Most documents do not include functional use data, so this will typically be blank
  • raw_min_comp

    • If concentration is given in a range, this is the lower limit. If a single number is given or no concentration is specified, this field should be blank
  • raw_max_comp

    • If concentration is given in a range, this is the upper limit. If a single number is given or no concentration is specified, this field should be blank
  • unit_type

    • A code for the unit type listed for the concentration. This will either be a number corresponding to a specific unit type, or blank if there is no concentration specified. See the table below for a list of the unit type options and their codes. Unit type codes can also be found in the database in the table dashboard_unittype .
Code Unit type Description
1 weight fraction weight fraction as a decimal
2 unknown when text was extracted the unit type was not harvested
3 percent weight fraction as a percent (ie 3%)
4 ppm parts-per-million
5 psi pound per square inch
6 volume unit is a measure of volume
7 nCi Nanocurie
8 mg per m3 milligrams per cubic meter
9 g per m3 grams per cubic meter
10 Iron (Fe) by 3 used exclusively in data document 268481
11 MPN/g TS used exclusively in data document 268481
12 gm
13 mol percent percent by moles
14 percent volume percent by volume
15 other (note units in QA or document notes) If an unusual unit appears on a document that is not in this table and does not appear in other documents, use this code and specify what the unit type is in the document notes. If it appears in several documents, talk to Kathie and she may add it as a unit type
16 percent ppm Siri MSDS indicated, for chemicals, 'Fraction by weight: N % ppm'; unclear what this unit type means
  • ingredient_rank

    • The order the ingredient appears in in an ingredient list. For example, the first ingredient is 1, second is 2, etc. This should always be a digit. This can be left blank if there is not a clear order
  • raw_central_comp

    • This is the concentration if a single number is given. If a range is listed or no concentration is specified, this field should be blank. Alternatively, a concentration range can be extracted here and the raw_min_comp and raw_max_comp could be left blank. In this case, the range would be separated into min and max values when the composition is cleaned
  • component

    • This field is for if a document has multiple different components associated with it, and each component has its own ingredient section. For example, if there was an MSDS for a shampoo and conditioner combo pack, the ingredient data associated with the shampoo should all have shampoo listed as the component, and the ingredient data associated with the conditioner should all have conditioner listed as the component. This field is typically blank

A common strategy for extracting this data is by converting the batch of pdfs to text and then parsing the text files. One way to convert the files is by downloading and using the Xpdf command line tools.

When creating the csv, the values for data_document_id, data_document_filename, prod_name, doc_date, rev_num, and raw_category should be the same for all chemicals in a document, while the values for raw_cas, raw_chem_name, report_funcuse, raw_min_comp, raw_max_comp, unit_type, ingredient_rank, raw_central_comp, and component can be different for each chemical.

If a document contains no chemicals, the fields data_document_id, data_document_filename, prod_name, doc_date, rev_num, and raw_category can be filled out while all other fields should be left blank.

Example of an SDS extraction script using Python

Example of a csv used to upload composition data to Factotum

4. Upload script to GitHub

When the script is finished and the csv is ready to be pushed into Factotum, upload a copy of the script to the data_mgmt_scripts repository and send Kathie a link to the script to register.

5. Upload data to Factotum

Once Kathie has registered the script, you can upload the data by navigating to the datagroup page in Factotum, going to the Extracted tab, and completing the fields. For the first box, select your extraction script from the dropdown list. For the second box, select reported. The third box will open a file explorer where you have to select the csv with the data you are trying to upload. When this is done, press submit and the data should begin uploading to Factotum.

Clone this wiki locally