Extract most important information from XML into simplistic format #13

okworx · 2022-09-26T18:02:58Z

Extract the information items from the XML into the data structure from #12.
See the "ILCD Format in a nutshell" guide for details on what to find where.

First, we need to parse all the flows in order to have the information ready for looking it up later when we read the process(es).

Process

default namespace http://lca.jrc.it/ILCD/Process
xmlns:common="http://lca.jrc.it/ILCD/Common"

Metadata

UUID (string)
/processDataSet/processInformation/dataSetInformation/common:UUID
name (string)
This consists of 4 parts:
/processDataSet/processInformation/dataSetInformation/name/baseName
/processDataSet/processInformation/dataSetInformation/name/treatmentStandardsRoutes
/processDataSet/processInformation/dataSetInformation/name/mixAndLocationTypes
/processDataSet/processInformation/dataSetInformation/name/functionalUnitFlowProperties
which we want to concatenate with a semicolon + a space ; as separator characters.
reference year (number)
/processDataSet/processInformation/time/common:referenceYear
valid until year (number)
/processDataSet/processInformation/time/common:dataSetValidUntil
geographical representativity (location code, string)
/processDataSet/processInformation/geography/locationOfOperationSupplyOrProduction/@location
reference product(s)
The exchange with the reference flow is this one /processDataSet/exchanges/exchange[@dataSetInternalID=/processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow]
- reference product internal id (integer)
  /processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow]
  we will need this internally for parsing and processing
- reference product name (string)
  /processDataSet/exchanges/exchange[@dataSetInternalID=/processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow]/referenceToFlowDataSet/common:shortDescription
- reference product amount (number)
  /processDataSet/exchanges/exchange[@dataSetInternalID=/processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow]/resultingAmount
  This will need to be multiplied with the amount from the flow.

Inventory

for each exchange:
here is the list of exchanges: /processDataSet/exchanges
Each of them is uniquely identified by its dataSetInternalID attribute.
One (or multiple, we only need to support one for now) of them is the reference product - the one whose @dataSetInternalID attribute matches the "reference product internal id" from above
- internal ID (integer)
  exchange/@dataSetInternalID
  we need that for internal processing
- flow name (string)
  exchange/referenceToFlowDataSet/common:shortDescription
- flow UUID (string)
  exchange/referenceToFlowDataSet/@refObjectId
- exchange direction (string)
  exchange/exchangeDirection
- exchange amount (double)
  exchange/resultingAmount
For each exchange, we'll need to look up the actual flow that is referenced (from the list of flows that we have parsed before) by its UUID and then read the flow's name, compartment, flow amount and unit.
The amount from the exchange and the amount from the flow need to be multiplied and yield the actual resulting amount for this exchange.

Flow

default namespace http://lca.jrc.it/ILCD/Flow
xmlns:common="http://lca.jrc.it/ILCD/Common"

name (string)
/flowDataSet/flowInformation/dataSetInformation/name/baseName
UUID (string)
/flowDataSet/flowInformation/dataSetInformation/common:UUID
compartment (string)
/flowDataSet/flowInformation/dataSetInformation/classificationInformation/common:elementaryFlowCategorization/common:category[@level=2]
type of flow (string)
/flowDataSet/modellingAndValidation/LCIMethod/typeOfDataSet
reference flow property amount (double)
/flowDataSet/flowProperties/flowProperty[@dataSetInternalID=/flowDataSet/flowInformation/quantitativeReference/referenceToReferenceFlowProperty]/meanValue
reference flow property UUID (string)
/flowDataSet/flowProperties/flowProperty[@dataSetInternalID=/flowDataSet/flowInformation/quantitativeReference/referenceToReferenceFlowProperty]/referenceToFlowPropertyDataSet/@refObjectId
With the UUID of the reference flow property, we can use the lookup function @grain11 wrote to lookup the unit.

The text was updated successfully, but these errors were encountered:

shirubana · 2022-10-13T21:33:06Z

@okworx is the ilcd importer mostly finished or worked out? Where can I test it or where is the working repo or fork for that?

mfastudillo · 2022-10-22T13:43:35Z

@shirubana in this fork: https://github.com/mfastudillo/brightway2-io.

try:

from pathlib import Path
import bw2io
import bw2calc
import bw2data
from bw2io.importers.ilcd import ILCDImporter
import pandas as pd

bw2data.projects.set_current('ilcd_import')
bw2io.bw2setup()

path_to_example = Path('bw2io/data/examples/ilcd-example.zip')
so = ILCDImporter(dirpath= path_to_example,dbname='example_ilcd')
so.apply_strategies()

so.match_database('biosphere3',fields=['database','code'])
so.match_database(fields=['database','code'])
so.statistics()

so.drop_unlinked(True)
so.write_database()

You can pick an example of ILCD from the GLAD website, and tried with a few and it works. Quite a number of elementary flows are not matched and need to be dropped.

JosePauloSavioli · 2023-02-21T02:20:31Z

I had to do a similar process in another project, the Lavoisier, an LCI data format converter (https://github.com/JosePauloSavioli/Lavoisier).

The process worked with a reading function and a mapping class. The mapping class would have an output dictionary to populate, and a mapping dictionary with keys as XML fields and values as function calls to modify the data from these fields and populate the output dictionary. It was something like this (the reading function would take out namespaces automatically):

mapping = {
      "/processDataSet/processInformation/dataSetInformation/UUID": lambda x: setattr(self.output_dict, "UUID", self.modify_UUID(x))
}

The mapping dictionary is passed to the reading function. The reading function parses the XML and verifies if the element is in the mapping. If it is, it organizes the XML data in a dictionary (like the xmltodict library does) and calls the function bound to the element within the mapping dict with the data. The reading function then modifies the data for the new format and returns it to the lambda function, which sets it in the output dictionary. This is the basic flow, but for Lavoisier, the output dict would be an abstraction of the output format of the conversion.

The process worked well with LCA inventory data as it could perform the conversion in unique fields or sets of fields (like passing all data from one exchange to a class that could modify data for the new format). Still, it had some minor drawbacks related to parsing (as it is treated as a continuous flow of information, so the dataset is not loaded in memory).

I saw @mfastudillo is ahead on developing an ILCD importer for Brightway. This interests me a lot since one of the issues I have with converting datasets is that there is no software where I can import a .spold and an ILCD .zip file to compare information about it (JosePauloSavioli/Lavoisier#3). I also had to study a lot the ILCD (and Ecospold 2) format to make the conversion possible, so I have an extensive knowledge of the format and on reading and working with data in it and have been through the struggle of mapping elementary flows between formats D:

@mfastudillo If it is in your interest, I would like to help you develop the importer. I'm open for a meeting or an exchange of emails if you want (my GitHub page has the email). I could fork it but I still have limited knowledge in Brighway, so I can help better in other ways.

mfastudillo · 2023-02-21T09:55:18Z

Hi @JosePauloSavioli , sure, contributions are more than welcome! I'll try to update the issues. The importer follows an extract - transform - load logic, and one of the most tricky things is the "extract" part where we parse different fields of the ilcd zip file into a list of dictionaries, this does not require much brightway knowledge but knowledge about the ilcd format is very useful

JosePauloSavioli · 2023-02-22T04:10:06Z

Hmmm, this is really a tricky part. I saw difficulties in 4 ways:

ILCD zip files can be single or multi process, so basically one can have the entire database in one file or separated between several of them. The directories also can be nested or in the zip file root folder
eILCD has an additional layer of information in the life cycle model dataset
References use the URI attribute, but sometimes it doesn't match or exist. There were cases of real world datasets where I got to use the refObjectId only to search because the referenced file was with a different version inside the zip file
Different software will add different 'flavours' to the ILCD. Examples: (i) OpenLCA, that adds an entire namespace of information on top of ILCD which can duplicate information and sometimes comes with EcoSpold 2 UUIDs (as the user can pick EcoSpold 2 flows to work with inside the software and the UUID is not modified upon export, really fun) and (ii) GaBi which adds new combinations of FlowProperties and UnitGroups

I think this can become pretty specific. @mfastudillo, would you mind sharing more about the difficulties that you are having in extracting? Do you prefer me to discuss this in the Issues of your fork?

mfastudillo · 2023-02-23T11:51:15Z

Hi @JosePauloSavioli , yes I think the issues in the forked repository are a better place to discuss the main issues

cmutel · 2023-09-04T08:46:27Z

Issue closed during cleanup for Brightcon 2023

okworx changed the title ~~extract most important information from XML into simplistic format~~ Extract most important information from XML into simplistic format Sep 26, 2022

cmutel added brightcon2023 Hackthons occuring at a Brightcon conference ilcd labels Sep 27, 2022

cmutel closed this as not planned Won't fix, can't repro, duplicate, stale Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract most important information from XML into simplistic format #13

Extract most important information from XML into simplistic format #13

okworx commented Sep 26, 2022 •

edited

Loading

shirubana commented Oct 13, 2022

mfastudillo commented Oct 22, 2022

JosePauloSavioli commented Feb 21, 2023

mfastudillo commented Feb 21, 2023

JosePauloSavioli commented Feb 22, 2023

mfastudillo commented Feb 23, 2023

cmutel commented Sep 4, 2023

Extract most important information from XML into simplistic format #13

Extract most important information from XML into simplistic format #13

Comments

okworx commented Sep 26, 2022 • edited Loading

Process

Metadata

Inventory

Flow

shirubana commented Oct 13, 2022

mfastudillo commented Oct 22, 2022

JosePauloSavioli commented Feb 21, 2023

mfastudillo commented Feb 21, 2023

JosePauloSavioli commented Feb 22, 2023

mfastudillo commented Feb 23, 2023

cmutel commented Sep 4, 2023

okworx commented Sep 26, 2022 •

edited

Loading