SOP for Composition Extractions

1. Create a group.

If a group has not been created for the set of documents, follow the SOP for uploading new documents to Factotum

2. Download all relevant files.

If you do not have copies of the documents being extracted, you can download them directly from the datagroup page in Factotum by clicking the Document files button. You also need to download a csv that lists the filenames and associated Factotum datadocument IDs. For this you have two options: go to the Extracted tab and click Download extracted text CSV template (this gives you an empty upload template with the filenames and IDs of all unextracted files in that group filled out) or click on the Document records button on the right (this gives you a csv with the filenames, IDs, and other metadata for all documents in the group)

3. Write a script to extract document data

Write a script that generates a csv with the following columns:

data_document_id
- The Factotum document ID associated with this file
- You can get a list of the document IDs and filenames for all unextracted documents in a group by navigating to the extracted tab on a datagroup page and clicking the Download extracted text CSV template. You can also click the Document records button on the datagroup page to download a list of all document IDs, filenames, and other document metadata for both extracted and unextracted documents.
data_document_filename
- The name of the file being extracted
prod_name
- The name of the product associated with the document you are extracting
doc_date
- The date the document was last updated. This is either the most recent revision date, or the creation date if no revision date is given. This should not be a print date
rev_num
- The revision or version number of the document you are extracting. Not all documents will have a revision number, so this field will often be blank
raw_category
- A short description of what type of product the document is for. For example, "shampoo" could be a raw category
raw_cas
- The CAS number for the ingredient being extracted
raw_chem_name
- The name of the ingredient being extracted
report_funcuse
- The function an ingredient serves within a product (for example: fragrance, cleaning agent, filler, etc.). Most documents do not include functional use data, so this will typically be blank
raw_min_comp
- If concentration is given in a range, this is the lower limit. If a single number is given or no concentration is specified, this field should be blank
raw_max_comp
- If concentration is given in a range, this is the upper limit. If a single number is given or no concentration is specified, this field should be blank
unit_type
- A code for the unit type listed for the concentration. This will either be a number corresponding to a specific unit type, or blank if there is no concentration specified. See the table below for a list of the unit type options and their codes. Unit type codes can also be found in the database in the table dashboard_unittype .

Code	Unit type	Description
1	weight fraction	weight fraction as a decimal
2	unknown	when text was extracted the unit type was not harvested
3	percent	weight fraction as a percent (ie 3%)
4	ppm	parts-per-million
5	psi	pound per square inch
6	volume	unit is a measure of volume
7	nCi	Nanocurie
8	mg per m3	milligrams per cubic meter
9	g per m3	grams per cubic meter
10	Iron (Fe) by 3	used exclusively in data document 268481
11	MPN/g TS	used exclusively in data document 268481
12	gm
13	mol percent	percent by moles
14	percent volume	percent by volume
15	other (note units in QA or document notes)	If an unusual unit appears on a document that is not in this table and does not appear in other documents, use this code and specify what the unit type is in the document notes. If it appears in several documents, talk to Kathie and she may add it as a unit type
16	percent ppm	Siri MSDS indicated, for chemicals, 'Fraction by weight: N % ppm'; unclear what this unit type means

ingredient_rank
- The order the ingredient appears in in an ingredient list. For example, the first ingredient is 1, second is 2, etc. This should always be a digit. This can be left blank if there is not a clear order
raw_central_comp
- This is the concentration if a single number is given. If a range is listed or no concentration is specified, this field should be blank. Alternatively, a concentration range can be extracted here and the raw_min_comp and raw_max_comp could be left blank. In this case, the range would be separated into min and max values when the composition is cleaned
component
- This field is for if a document has multiple different components associated with it, and each component has its own ingredient section. For example, if there was an MSDS for a shampoo and conditioner combo pack, the ingredient data associated with the shampoo should all have shampoo listed as the component, and the ingredient data associated with the conditioner should all have conditioner listed as the component. This field is typically blank

A common strategy for extracting this data is by converting the batch of pdfs to text and then parsing the text files. One way to convert the files is by downloading and using the Xpdf command line tools.

When creating the csv, the values for data_document_id, data_document_filename, prod_name, doc_date, rev_num, and raw_category should be the same for all chemicals in a document, while the values for raw_cas, raw_chem_name, report_funcuse, raw_min_comp, raw_max_comp, unit_type, ingredient_rank, raw_central_comp, and component can be different for each chemical.

If a document contains no chemicals, the fields data_document_id, data_document_filename, prod_name, doc_date, rev_num, and raw_category can be filled out while all other fields should be left blank.

Example of an SDS extraction script using Python

Example of a csv used to upload composition data to Factotum

4. Upload script to GitHub

When the script is finished and the csv is ready to be pushed into Factotum, upload a copy of the script to the data_mgmt_scripts repository and send Kathie a link to the script to register.

5. Upload data to Factotum

Once Kathie has registered the script, you can upload the data by navigating to the datagroup page in Factotum, going to the Extracted tab, and completing the fields. For the first box, select your extraction script from the dropdown list. For the second box, select reported. The third box will open a file explorer where you have to select the csv with the data you are trying to upload. When this is done, press submit and the data should begin uploading to Factotum.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SOP for Composition Extractions

1. Create a group.

2. Download all relevant files.

3. Write a script to extract document data

4. Upload script to GitHub

5. Upload data to Factotum

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally