Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract most important information from XML into simplistic format #13

Closed
21 tasks
okworx opened this issue Sep 26, 2022 · 7 comments
Closed
21 tasks

Extract most important information from XML into simplistic format #13

okworx opened this issue Sep 26, 2022 · 7 comments
Labels
brightcon2023 Hackthons occuring at a Brightcon conference ilcd

Comments

@okworx
Copy link
Contributor

okworx commented Sep 26, 2022

Extract the information items from the XML into the data structure from #12.
See the "ILCD Format in a nutshell" guide for details on what to find where.

First, we need to parse all the flows in order to have the information ready for looking it up later when we read the process(es).

Process

default namespace http://lca.jrc.it/ILCD/Process
xmlns:common="http://lca.jrc.it/ILCD/Common"

Metadata

  • UUID (string)
    /processDataSet/processInformation/dataSetInformation/common:UUID

  • name (string)
    This consists of 4 parts:
    /processDataSet/processInformation/dataSetInformation/name/baseName
    /processDataSet/processInformation/dataSetInformation/name/treatmentStandardsRoutes
    /processDataSet/processInformation/dataSetInformation/name/mixAndLocationTypes
    /processDataSet/processInformation/dataSetInformation/name/functionalUnitFlowProperties
    which we want to concatenate with a semicolon + a space ; as separator characters.

  • reference year (number)
    /processDataSet/processInformation/time/common:referenceYear

  • valid until year (number)
    /processDataSet/processInformation/time/common:dataSetValidUntil

  • geographical representativity (location code, string)
    /processDataSet/processInformation/geography/locationOfOperationSupplyOrProduction/@location

  • reference product(s)
    The exchange with the reference flow is this one /processDataSet/exchanges/exchange[@dataSetInternalID=/processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow]

    • reference product internal id (integer)
      /processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow]
      we will need this internally for parsing and processing

    • reference product name (string)
      /processDataSet/exchanges/exchange[@dataSetInternalID=/processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow]/referenceToFlowDataSet/common:shortDescription

    • reference product amount (number)
      /processDataSet/exchanges/exchange[@dataSetInternalID=/processDataSet/processInformation/quantitativeReference/referenceToReferenceFlow]/resultingAmount
      This will need to be multiplied with the amount from the flow.

Inventory

  • for each exchange:
    here is the list of exchanges: /processDataSet/exchanges
    Each of them is uniquely identified by its dataSetInternalID attribute.
    One (or multiple, we only need to support one for now) of them is the reference product - the one whose @dataSetInternalID attribute matches the "reference product internal id" from above

    • internal ID (integer)
      exchange/@dataSetInternalID
      we need that for internal processing

    • flow name (string)
      exchange/referenceToFlowDataSet/common:shortDescription

    • flow UUID (string)
      exchange/referenceToFlowDataSet/@refObjectId

    • exchange direction (string)
      exchange/exchangeDirection

    • exchange amount (double)
      exchange/resultingAmount

    For each exchange, we'll need to look up the actual flow that is referenced (from the list of flows that we have parsed before) by its UUID and then read the flow's name, compartment, flow amount and unit.
    The amount from the exchange and the amount from the flow need to be multiplied and yield the actual resulting amount for this exchange.

Flow

default namespace http://lca.jrc.it/ILCD/Flow
xmlns:common="http://lca.jrc.it/ILCD/Common"

  • name (string)
    /flowDataSet/flowInformation/dataSetInformation/name/baseName

  • UUID (string)
    /flowDataSet/flowInformation/dataSetInformation/common:UUID

  • compartment (string)
    /flowDataSet/flowInformation/dataSetInformation/classificationInformation/common:elementaryFlowCategorization/common:category[@level=2]

  • type of flow (string)
    /flowDataSet/modellingAndValidation/LCIMethod/typeOfDataSet

  • reference flow property amount (double)
    /flowDataSet/flowProperties/flowProperty[@dataSetInternalID=/flowDataSet/flowInformation/quantitativeReference/referenceToReferenceFlowProperty]/meanValue

  • reference flow property UUID (string)
    /flowDataSet/flowProperties/flowProperty[@dataSetInternalID=/flowDataSet/flowInformation/quantitativeReference/referenceToReferenceFlowProperty]/referenceToFlowPropertyDataSet/@refObjectId
    With the UUID of the reference flow property, we can use the lookup function @grain11 wrote to lookup the unit.

@okworx okworx changed the title extract most important information from XML into simplistic format Extract most important information from XML into simplistic format Sep 26, 2022
@cmutel cmutel added brightcon2023 Hackthons occuring at a Brightcon conference ilcd labels Sep 27, 2022
@shirubana
Copy link

@okworx is the ilcd importer mostly finished or worked out? Where can I test it or where is the working repo or fork for that?

@mfastudillo
Copy link

@shirubana in this fork: https://github.com/mfastudillo/brightway2-io.

try:

from pathlib import Path
import bw2io
import bw2calc
import bw2data
from bw2io.importers.ilcd import ILCDImporter
import pandas as pd

bw2data.projects.set_current('ilcd_import')
bw2io.bw2setup()

path_to_example = Path('bw2io/data/examples/ilcd-example.zip')
so = ILCDImporter(dirpath= path_to_example,dbname='example_ilcd')
so.apply_strategies()

so.match_database('biosphere3',fields=['database','code'])
so.match_database(fields=['database','code'])
so.statistics()

so.drop_unlinked(True)
so.write_database()

You can pick an example of ILCD from the GLAD website, and tried with a few and it works. Quite a number of elementary flows are not matched and need to be dropped.

@JosePauloSavioli
Copy link

I had to do a similar process in another project, the Lavoisier, an LCI data format converter (https://github.com/JosePauloSavioli/Lavoisier).

The process worked with a reading function and a mapping class. The mapping class would have an output dictionary to populate, and a mapping dictionary with keys as XML fields and values as function calls to modify the data from these fields and populate the output dictionary. It was something like this (the reading function would take out namespaces automatically):

mapping = {
      "/processDataSet/processInformation/dataSetInformation/UUID": lambda x: setattr(self.output_dict, "UUID", self.modify_UUID(x))
}

The mapping dictionary is passed to the reading function. The reading function parses the XML and verifies if the element is in the mapping. If it is, it organizes the XML data in a dictionary (like the xmltodict library does) and calls the function bound to the element within the mapping dict with the data. The reading function then modifies the data for the new format and returns it to the lambda function, which sets it in the output dictionary. This is the basic flow, but for Lavoisier, the output dict would be an abstraction of the output format of the conversion.

The process worked well with LCA inventory data as it could perform the conversion in unique fields or sets of fields (like passing all data from one exchange to a class that could modify data for the new format). Still, it had some minor drawbacks related to parsing (as it is treated as a continuous flow of information, so the dataset is not loaded in memory).

I saw @mfastudillo is ahead on developing an ILCD importer for Brightway. This interests me a lot since one of the issues I have with converting datasets is that there is no software where I can import a .spold and an ILCD .zip file to compare information about it (JosePauloSavioli/Lavoisier#3). I also had to study a lot the ILCD (and Ecospold 2) format to make the conversion possible, so I have an extensive knowledge of the format and on reading and working with data in it and have been through the struggle of mapping elementary flows between formats D:

@mfastudillo If it is in your interest, I would like to help you develop the importer. I'm open for a meeting or an exchange of emails if you want (my GitHub page has the email). I could fork it but I still have limited knowledge in Brighway, so I can help better in other ways.

@mfastudillo
Copy link

Hi @JosePauloSavioli , sure, contributions are more than welcome! I'll try to update the issues. The importer follows an extract - transform - load logic, and one of the most tricky things is the "extract" part where we parse different fields of the ilcd zip file into a list of dictionaries, this does not require much brightway knowledge but knowledge about the ilcd format is very useful

@JosePauloSavioli
Copy link

Hmmm, this is really a tricky part. I saw difficulties in 4 ways:

  • ILCD zip files can be single or multi process, so basically one can have the entire database in one file or separated between several of them. The directories also can be nested or in the zip file root folder
  • eILCD has an additional layer of information in the life cycle model dataset
  • References use the URI attribute, but sometimes it doesn't match or exist. There were cases of real world datasets where I got to use the refObjectId only to search because the referenced file was with a different version inside the zip file
  • Different software will add different 'flavours' to the ILCD. Examples: (i) OpenLCA, that adds an entire namespace of information on top of ILCD which can duplicate information and sometimes comes with EcoSpold 2 UUIDs (as the user can pick EcoSpold 2 flows to work with inside the software and the UUID is not modified upon export, really fun) and (ii) GaBi which adds new combinations of FlowProperties and UnitGroups

I think this can become pretty specific. @mfastudillo, would you mind sharing more about the difficulties that you are having in extracting? Do you prefer me to discuss this in the Issues of your fork?

@mfastudillo
Copy link

Hi @JosePauloSavioli , yes I think the issues in the forked repository are a better place to discuss the main issues

@cmutel
Copy link
Member

cmutel commented Sep 4, 2023

Issue closed during cleanup for Brightcon 2023

@cmutel cmutel closed this as not planned Won't fix, can't repro, duplicate, stale Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
brightcon2023 Hackthons occuring at a Brightcon conference ilcd
Projects
None yet
Development

No branches or pull requests

5 participants