Skip to content

JGCRI/io_flows

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IO FLOWS DATA VISUALIZATION TOOL: README

Table of Contents

Disclaimer

This work was supported by the U.S. Environmental Protection Agency Contract No. EP-C-16-021. The views expressed in this article are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency.

Motivation

The IO Flows visualization tool is designed to provide insight into sectoral data from the GCAM model. The tool uses a network or graph structure to visualize the relationships between sectors in the GCAM framework. This structure allows users to gain a better understanding of how sectors compete for resources in the global economy.

Terminology

For this exercise, we will use the language of graph theory:

  • A graph is a data structure comprised of nodes and edges.
  • Nodes form the basis of a graph. Each node represents a single entity with associated information. In this case, nodes are sectors, subsectors, and technologies represented by GCAM.
  • Edges represent relationships between nodes. For this application, the exchanges of products between sectors are represented as edges. The networks described here use directed edges, meaning that not all relationships are symmetric. Products may move from sector A to sector B, but this does not imply that products are also moving from sector B to sector A.

Data information

The tool runs on data files output by the GCAM model, which can be queried from GCAM's embedded database. Specifically, to construct the network, the tool reads in two files: inputs_tech and inputs_resources, which can be generated by querying a GCAM output database. These large data files are not included in the repository by default, but the queries to create them are available and located in queries/Flow_viz_queries.xml. These queries have been tested on GCAM versions 5, 6, and 7. More detail on the generation of these input data files can be found in a later section. These files provide the structure of the network by detailing transfers from one sector, subsector, or technology to another.

The R script calc_nodes_and_edges.R performs the necessary data transformations and produces two output files: nodes_detailed.csv and edges_detailed.csv, which will be automatically read by the visualization tool. The following three sets of sample nodes and edges files are included in this repository and allow the user to launch the IO flows tool and begin exploring the GCAM network without downloading the model or running any queries. Instructions for selecting these datasets using the config file can be found in a later section.

  • gcam_v70/nodes_detailed and gcam_v70/edges_detailed: These files have been generated from GCAM 7.0, released in 2023.
  • gcam_v60/nodes_detailed and gcam_v60/edges_detailed: These files have been generated from GCAM 6.0, released in 2022.
  • gcam_vT/nodes_detailed and gcam_vT/edges_detailed: These files have been generated from GCAM-T, which is a modified version of the core GCAM model with an emphasis on transportation energy technologies, including biofuels. The data files included correspond with the version of GCAM-T used in the Model Comparison Document, published by the U.S. EPA in the July 2023 RFS Set rule (88 FR 44468).

These files can be used by the tool to skip the querying and data processing steps.

Additionally, the tool is set up to display logit values as part of the network for sectors and subsectors. If a local version of GCAM exists, the tool will attempt to search through the relevant files in this directory to find and process logit values. The R script process_logits.R is set up to read these data files and output a single table with all of the given logit values that can be joined with the table of nodes. Once this script is run, the table of logits can be found in io_flows/processed_data/full_logits_table.csv. If the tool cannot find GCAM data to search for logits, the data processing script will skip this step and all logits will be set to NA. It is important, especially if working on a computer with OneDrive or similar file sharing functionality, to save the local version of GCAM directly to the harddrive, separately from OneDrive. This will ensure that all relevant logit files get read by R.

An important caveat with the logits data is that these files are inconsistent both in formatting and naming conventions. The process_logits.R script is capable of handling some inconsistencies, including differing number of comment lines and certain arrangements of column names. Additionally, much of the script is dedicated to manually correcting naming mismatches between the sectors, subsectors, and technologies used in the network and the sector names given in the logit files. Much of this is done automatically. The code can also be supplemented to add additional logits manually if the user has insight into these values. The file resources/logits_manual.csv can be used to manually add logits values, though the code is set to ignore this by default.

Set-up instructions

Installing R and RStudio

The visualization tool is written as an RShiny app and requires R installed to run locally. We recommend running the R programming language in the RStudio IDE. Users can download and install the latest version of R from https://cran.rstudio.com/. The tool was developed using R version 4.1.1 and we recommend using this version or later. After installing the R language, RStudio can be downloaded and installed from https://www.rstudio.com/products/rstudio/download/#download. RStudio will indicate which version of R it is running each time it starts up. If you have multiple versions of R installed, you might need to switch between versions. This can be done by selecting Tools>Global Options and selecting the proper version as indicated in the image below.

Selecting the proper version of r

Setting up the repository

Clone this repository to your local machine to be able to run the tool. There are several ways to accomplish this. The easiest is to download github desktop. When this software has been downloaded and installed you can either:

  • In the desktop app, go to file > clone repository and enter the URL given in this github repository (red box in the image below)
  • In your web browser, open the github page for this repository, select the <> Code button, and select Open with Github Desktop (blue box in the image below)

Either way, you will need to select a location on your local machine to clone the repository to. All code and data will then be copied to this location.

cloning

Setting up the virtual environment and installing required libraries

We use an R virtual environment (via the package renv) to manage package installation and dependencies. To run the IO Flows visualization tool, the following R packages are required

  • rgcam
  • dplyr
  • tidyr
  • configr
  • tcltk
  • visNetwork
  • shiny
  • shinyWidgets

Rather than installing packages via the standard process, loading an renv environment satisfies these requirements by replicating the developer's environment on your local R instance. The renv operates via a .lock file, which has been uploaded to the github directory. If there are issues, check to make sure your working directory is set correctly. It should be pointing to the base io_flows folder.

If necessary, install renv, use the command below. This should be the only installation command necessary to run. If trouble is encountered, more information on the renv package can be found at this site.

install.packages("renv")

Once the renv package is installed and loaded, to initialize renv, run

renv::init(bare = TRUE, force = TRUE)

then

renv::restore()

If there are issues with the lock file being out of sync, these may be resolved by explicitly adding the system path to the lock file on your computer as an argument to this function. For example:

renv::restore(lockfile = "C:/i-o_flows/gcam-mcs/io_flows/renv.lock")

Other issues with the renv package may be encountered on Mac operating systems, specifically when attempting to install the rgcam package (see next section). If any issues are encountered, users can attempt to resolve this by manually deleting the renv folder in the directory, then re-initializing the renv package in the current directory.

You should see this initialize the process of installing packages and dependencies to match the environment under which the tool was developed. If prompted by the installer, enter "Y" in the console to confirm the installation options.

Setting up the rgcam package

Once all of the standard packages installations are handled by renv, there's one final installation step left. The rgcam package needs to be installed to perform queries on a GCAM database. rgcam is a packaged developed by JGCRI at PNNL, and the package can be found here. There are instructions for setup in the ReadMe file in that repository, but the necessary installation steps are included here as well.

  1. To start, make sure the package devtools is installed and loaded. renv should have installed/updated the correct version of this package.

  2. Create a github personal access token (PAT). Run

usethis::create_github_token()

That command should open github in your browser and walk you through the process of creating this token. If you get prompted by R or github to enter a password, enter the new token in the password field, not your github password. Then, to link to the new token, run

gitcreds::gitcreds_set()
  1. Now run the install command from the JGCRI installation instructions:
install_github("JGCRI/rgcam", build_vignettes=TRUE)

The installation may prompt the user to install some dependencies if the renv package encountered any issues. If this doesn't install properly, try again with

build_vignettes=FALSE

You will also need to have Java installed on your machine to be able to run gcam queries in R. The rgcam instructions suggest Java version 7 or later. This step is only necessary if the rgcam package is going to be used to query a local GCAM database. If the default Java linked above is not able to install (due to licensing or permission issues or otherwise), an opensource java version can be used. For windows, download the linked zipfile and unzip it somewhere on your local machine. You will need to manually edit your system environment path to be able to call Java on the commmand line. See the image below for an example.

Adding Java executable to your path.

To ensure that this worked, open a command line in the jdk-21.0.1/bin folder you just unzipped (your version may be different). Execute the command java -version and if the path is properly set up, Java will display your current version. (Note: after updating your path on windows, you must close and re-open your command line program, e.g. gitbash).

Now that the installation and set up are complete, the next step is to initiate a config file.

Setting up your config file

The visualization tool is operated via a config file. The user will be guided through the creation of this config file by prompts with detailed instructions and therefore should not need to interact with the R code or the raw config file at all except to run the RShiny app. The config file will allow the user to select the data series to be used and then point the tool to the location of this data. The visualization tool will automatically look for a created config file in the base directory of the github with the name io_flows.cfg. No config file is included in the repository by default. This will need to be generated by the user using the process described below prior as part of the first run. Each of the required fields in the config file are described below, in the order in which the prompts will request them. The contents of an empty config look like this:

{
  "series": "series_X",
  "nodes_and_edges_directory":,
  "inputs_tech_path":,
  "inputs_resources_path":,
  "gcam_path":,
  "gcam_queries": "Flow_viz_queries.xml"
}

When the RShiny app is launched, the code will look for a config file, titled io_flows.cfg in the working directory. If one is found, the following steps will be skipped and the tool will launch based on the settings in the config file found. To start from scratch, just delete the config file. If no config file is found, then the app will guide the user through the creation of a new config file. The following prompt will appear: config creation welcome prompt.

Each subsequent prompt will include details to properly chose a relevant file or folder. More details on each field are included here.

  • series: This field is used exclusively for labeling. It will be applied to the names of data files generated and also included in the visualization tool to keep runs of different GCAM versions distinct
  • nodes_and_edges_directory: This should be a path to a location in the io_flows directory. When the tool is initializing, it will look in this folder first. It will be looking for two files titled nodes_detailed.csv and edges_detailed.csv. If it finds these files, it will open the tool and construct the network from these nodes and edges. The rest of the config fields will not be used. If the path points to an empty directory, the tool will attempt to generate these csvs and save them here for future runs. If the field is left blank (select cancel in the pop-up window), the path will default to ["./processed_data"] and it will save nodes and edges files here.
  • inputs_tech_path: If the nodes and edges files are not found in the directory indicated by the previous field, the visualization tool will need to construct those files. To do this, it needs the two csv files mentioned in the previous section: inputs_tech.csv and inputs_resources.csv. This config field should point to the former and the next config field will point to the latter. Note that the tool is expecting the full name of the document (including ".csv") in these fields, not just a path to a directory. If these data files are found, the tool will run the data processing script to generate the nodes and edges files described above. If the previous field is specified, these files will be stored there and the tool will run. If the previous field is blank, these files will be stored at the default location and the config file will be updated so that this step only needs to occur once. If these fields are left blank (select cancel in the pop-up window), the tool will move on and attempt to build these files from scratch; they will be named automatically and saved to the default location ./raw_data.
  • inputs_resources_path: See above.
  • gcam_path: If the three previous fields are left blank or if the files indicated cannot be found, the tool will search for a GCAM output database. It will query this database, construct the nodes and edges files, and then run the tool. This will take a few minutes. This field should point to the GCAM base directory. Within this directory, there should be an output folder that contains the gcam database. The user will need to have the GCAM model downloaded and complete a run to generate this database. See the following section for more details on accessing, using, and querying GCAM. The tool will apply the query specified in the following config field to this database, which will produce the two raw data csvs required to construct the network. These csvs will be saved to the ./raw_data folder and the config file will be updated automatically so that this step only needs to be completed once. Additionally, the data processing script will look within this folder for csvs containing logit values even if it doesn't need to query the GCAM database to generate the input files.
  • gcam_queries: This field contains the name of the xml file containing the two queries needed to construct the input csvs. This field is filled by default and the xml file is included in the repository at ./io_flows/queries/. Copy and paste this xml file into the queries folder in your local GCAM setup: gcam_path/output/queries/Flow_viz_queries.xml. The tool will execute this query on the gcam database, generate the input files, build the network, then run the tool.

An example filled out config file is included below.

{
  "series": "series_70",
  "nodes_and_edges_directory": ".\\processed_data",
  "inputs_tech_path": ".\\raw_data\\series_70_inputs_tech.csv",
  "inputs_resources_path": ".\\raw_data\\series_70_inputs_resources.csv",
  "gcam_path": "C:\\Users\\sweisberg\\OneDrive - Research Triangle Institute\\Documents\\gcam-mcs\\s70",
  "gcam_queries": "Flow_viz_queries.xml"
}

The tool includes built-in error checking functionality for the generated config file, but not all errors will be caught. If invalid information is passed to the config file, it will be automatically renamed to io_flows_ERROR.cfg and the tool will not be able to read it in. The user should attempt to regenerate the config file by following the instructions above.

GCAM

Default GCAM data files are included in this repository, but it is useful to be able to query and examine additional versions of the model. To do this, the user will need to download the desired version of GCAM, set it up, run a reference case, and then update the IO tool config file to point to this latest version. The IO tool will be able to execute queries on a GCAM database using the rgcam package. Running and querying GCAM requies a working version of 64-bit Java. See the rgcam section above for recommendations for setting up rgcam and Java.

The latest release of the GCAM model can be found here. Set-up and run instructions can be found here. Download and unzip the preferred version of the GCAM release. If your computer is set up to sync with OneDrive or a similar networked drive, make sure to unzip GCAM onto your local machine separate from OneDrive. Otherwise, R might not be able to recognize all of the input files and will miss data. The IO tool has been tested on GCAM versions 5, 6, and 7. Follow the quickstart instructions to run a reference case. Pay attention to the common failure reasons pertaining to Java. Solving a GCAM reference case will take significant computational resources. According to the instructions: "A GCAM model simulation will utilize over 8 GB of system RAM and storing the full results of the simulation will take around 3 GB of disk space per scenario. The run will take several minutes, but only needs to be done once.

We provide a query to run on the GCAM database, which can be found in the io_flows/queries/ directory. This xml file must be copied into the output/queries/ directory within the GCAM instance that was just downloaded and run. The IO tool will know to look for this file and use it to query the results. It will then automatically process this data and launch the visualization tool.

Using the tool

Once the config file has been set up, the visualization tool will automatically launch. In the R script, App.R, which is located in the base io_flows directory, click the button that says Run App to launch the visualization tool. If a valid config file already exists, the tool will skip the above steps. In the dropdown beneath the Run App button, the user can select where the tool will be launched (a separate window, in the console, etc). We recommend checking the options "Run in Window" and "In Background Job":

Choosing how R will launch the visualization tool

Upon starting the tool, you'll see the following display. It can be divided into three parts as illustrated on the diagram below.

The visualization tool

  • I. Interactive Network. The focus of the tool is the network diagram itself, where nodes are sectors, subsectors, and technologies, and edges represent transfers between these in the direction indicated by the arrows. The nodes can be divided into three general groups: upstream nodes, downstream nodes, and the singular focus node. The focus node is set to 'regional oil' by default but can be set by the user and determines which subset of nodes and edges surrounding this focus node will be shown in the tool. Upstream nodes are defined as nodes that have tranfers to the focus node or to another upstream node. Downstream nodes are defined as nodes that receive transfers from the focus node or from another downstream node. The number of upstream and downstream nodes shown in the display can be controlled by the user. We will return to this in section III. The network will automatically update when any of the controls described in section III are changed. The network supports the following interactions.
    • Users can zoom in and out on portions of the network.
    • Users can click and drag on white space on the network diagram to move the view around.
    • Users can click and drag on individual nodes to reposition them within the limits of the positioning algorithm.
    • Users can hover over a node or edge to highlight its immediate neighbors. Other, more distant, nodes and edges will be grayed out. This is useful for identifying areas of focus in the network when displaying a large number of nodes and/or edges.
    • When not viewing logits on the screen (more on this in section III), hovering over a node will also display a tooltip with the logit value.
  • II. Legend. The legend is shown in section II and describes the color scheme chosen to represent relationships within the network. Upstream nodes are shown in red, downstream nodes in blue, and the focus sector is yellow. Within these categories, a gradient is used to distinguish between sectors, subsectors, and technologies. The graph shows only sectors by default, but subsectors and technologies can be added by the user. Subsectors will always be shown directly upstream of the expanded node; specifically, nodes will be inserted between the expanding sector and all of its upstream neighbors. When a subsector is expanded, technologies will then be inserted directly between the expanding subsector and its upstream neighbors. For example, in the diagram below, we start with the default network with upstream and downstream distances set to 1. Then in the second panel, we elect to expand the focus sector 'regional oil.' This causes two subsectors--'crude oil' and 'unconventional oil'-- to appear directly upstream of the focus sector. Note that there are no longer edges connecting 'regional oil' to its upstream sectors. All edges pass through one or both of the subsectors as defined by the relationships in the GCAM data. In the third panel, we elect to expand the subsector 'crude oil.' This causes a single technology, also labeled 'crude oil' to appear directly upstream of the expanded subsector. Note that there are no longer edges connecting the 'crude oil' subsector to its upstream sectors. All edges pass through the technology node. Note that it is possible to expand any sector or subsector currently displayed as part of the network, not just the focus sector.

network at sector level network at subsector level network at technology level

  • III. Controls. Beneath the network are several tools that allow the user to control what is displayed in the network. The radio buttons in the first columns allow the user to toggle between displaying subsectors and technologies and also between displaying logit values as part of the label text in nodes. In the second column, the top dropdown allows the user to select the focus node from all available sectors in the network. The user can scroll through this list or begin typing to go directly to a sector of particular interest. The network will immediately update each time a new sector of focus is selected. Once this is complete, the user can also select sectors and subsectors to expand using the dropdowns below this one. These dropdowns are updated dynamically to only show the nodes currently on the screen. If a sector is selected in the middle dropdown, the bottom dropdown will dynamically update to include newly available subsectors to expand even if the network shown on the screen has not been updated yet. Lastly, the sliders on the right control the maximum distance between nodes and the focus sector to be displayed on the screen. Note that these distances only pertain to top-level sectors. For example, in the third diagram shown above, 'seawater' is still only a distance of 1 away from the focus node because these distances do not count the technology 'crude oil' or the subsector 'crude oil.' Not shown in the images above is a button that allows the user to export the image of the network displayed (sections I and II) to pdf, which will be downloaded to the user's default location automatically and saved as io_flows_network.pdf.

Technical details

This section contains details on some of the technical considerations and design choices that went into constructing the network visualization tool.

  • Network Layout. As with all network visualization problems, displaying the network in a clear and intuitive way without overloading the user with too much information was a core problem in this task. We partially pass the resolution of this problem onto the user by allowing control of the max upstream and downstream distance as well as which nodes get expanded. This allows the user to focus on the portion of the network that is of interest. By using the package's built-in hover functionality, the user can also use the mouse to readily identify direct neighbors if the graph does get too cluttered. The coloring scheme chosen also leads to clear separation of the upstream and downstream portions of the network, thus elucidating trends in the macro structure of the network. Algorithmically, we use the package's force directed setting to position nodes on the screen. This is a gravitational simulation where the center of the network attracts all nodes, but the nodes themselves are given a repulsive force on one another. This works to provide a clear layout in many cases, but we also allow users the ability to drag and drop nodes into better positions, though the algorithm doesn't always respond well to this. In general, we are held back here and in a few other places by the limitations of the visNetwork package in R.

  • Positioning of subsectors and technologies. The default state of the network is to have only sector-level details displayed. A core feature of this tool was to be able to select any sector on the screen to expand and display subsectors within that sector. An important question, then, was how exactly to show the relationship between these sectors and subsectors. The subsectors could replace the sector on the screen or be shown as upstream or downstream neighbors. In the end, as described previously, we elected to show subsectors as upstream neighbors of their corresponding sectors (and the same for technologies). A look at the sample data shown below can help explain this choice. We'll focus on the example of beef here. The sector beef has several upstream sectors and two downstream sectors, as shown in the network screenshot. If the user elected to expand the focus sector, beef, the tool would need to display the two subsectors--'Mixed' and 'Pastoral'--in one of the three ways described above. By looking at the data, it becomes clear that there is a relationship between inputs (which are upstream sectors of the focus node) and subsectors. The relationship between subsectors and downstream sectors is unclear. We have no way of knowing which edges to draw between the two new subsector nodes and the existing downstream nodes. Therefore, inserting these subsectors downstream of the focus node or replacing it altogether are both poor options. We elected to insert the new nodes upstream of the focus node as shown in the second network diagram below.

data_snippet network_beef network_beef_subsectors

  • Underlying data structures. Data representation is another important consideration in network analysis with adjacency matrices and adjacency lists being among the most common data structures utilized. Generally sparse networks like the one we are examining here are better represented with adjacency lists. The visNetwork R package expects a modified version of an adjacency list. We pass it two data frames: one with node names and IDs, and one with edge IDs. This representation is slightly memory inefficient but is well-tailored to R, which is limited with respect to data structures. An advantage, though, is that including supplementatl information for nodes and edges is as simple as adding a column to one or both of these data frames. This allows us to label nodes as either level 1, level 2, or level 3 nodes (and the same for edges). Level 1 nodes are sectors and are shown on the screen by default. Level 2 nodes are subsectors and are only displayed if the corresponding sector is expanded by the user. Level 3 nodes are technologies and are only displayed if the corresponding subsector is expanded by the user. Again, edges are labeled and displayed in the same way. These data structures also contain information on node position, which is relevant for coloring, and also include logits values for nodes, when available.

  • Mechanics of sector expansion. The main goal in expanding the functionality of this tool was the inclusion of subsectors and technologies, which were not available in the first iteration of this tool. The original objective was to be able to double-click on a sector to immediately expand it and display the relevant subsectors. This would have been much cleaner and more intuitive than the current solution, which includes a system of dropdowns and toggles. The choice to move away from a double-click mechanism was due to limitations of the visNetwork package, which would have required passage of a function call via javascript and html rather than through R and Rshiny. To keep the process simple and expedient, we elected to work within the package's limitations. Given more time, more robust visualization web tool could be built in d3.js, which is the primary language for building customized network visualizations. Check out the gallery online. Instead, we allow the users to select sectors and subsectors to expand using the dropdowns and then initiate expansion (or turn off expansion) using the 'Show Subsectors/Techs' toggle. This allows for most of the same functionality, if in a slightly less intuitive way.

  • Regional naming. The last thing to note is that many of the sectors, subsectors, and technologies have regional names in the GCAM data. For example, Africa_Eastern traded beef vs. Central Asia traded beef vs. Taiwan traded beef. For the purpose of cutting down the number of nodes and edges in the network and because we felt this regionalized naming scheme does not add useful information, all region names have been removed, replaced with an 'X,' and consolidated into a single node. For the traded beef example, the visualization tool would display a single node labeled 'X_traded beef.' Certain nodes, mainly technologies, have their entire name replaced with an X, so context is important for identifying the good represented by these nodes.

About

Input-Output visualization tool for GCAM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages