Idea for restructuring input files #285

tsmbland · 2024-04-29T10:06:04Z

tsmbland
Apr 29, 2024
Maintainer

I've been thinking about ways that we could restructure the model inputs, and I think we could massively improve things by using a database structure, whilst still maintaining a set of human-readable csv files. See details below.

Database schema

This is one possible way that the model inputs could be represented as a database:

Each box is a table, with fields and data types indicated. Arrows represent relationships between tables. I've coloured the tables as follows:

Orange: Properties of regions
Purple: Properties of commodities (e.g. wind)
Yellow: Properties of technologies (e.g. wind turbine)
Green: Properties of agents
Grey: System state

All parameters are related to existing parameters in the current input files, so I won't include full explanations here (although see further down for a basic description of each table). There are a few parameters I'm not sure about (either I don't understand what they're for, or not sure which table they should belong in), which I've coloured in red.

One point to note is the distinction between a technology (e.g. wind turbines) and a technology installation, which is an installation of a technology in a specific region (e.g. wind turbines in the UK). The Technology table holds core properties of the technology (which won't differ between regions), and the TechnologyInstallation table holds properties related to a specific installation of a technology in a region (which may differ between regions). Some parameters are allowed to vary between timeslices, which can be indicated in the TechnologyInstallationTimeslice table. This schema no longer contains distinct tables for different sectors. Instead, Sector is added as a field in the Technology table.

A few of the tables would contain time series data (black border) for parameters that can change over the course of the simulation. However, in reality, basic users may want to keep most parameters fixed over the course of the simulation, and just define them for the base year (which would be mandatory). Therefore, and alternative way of representing the data could be as follows:

In this case, the main tables only store data for the base year. If any parameters change over time, these changes can be stored in various 'trend' tables (blue). For many simple simulations, in which most parameters are held constant over time, these tables will be all/mostly empty. It may seem like an unnecessary duplication of tables, but I think it could simplify things for most users, and also make clearer what's mandatory to define (base year) and what's optional (future years).

Input file schema

Many users won't be comfortable with using databases directly, so we'll want to keep using a system of csv files that users can manually edit with Excel or a text editor. This can be done without too many deviations from the above schema. I've grouped tables into folders according to the colouring scheme above (although I haven't included a table for regions, as these are already defined as a list in the config file).

(See example files at the bottom, which is probably easier than reading through the schema)

Agent/

Agent.csv

Defines the quantity of each agent, and links each agent to the appropriate agent shares (new and retrofit)
I'm not 100% sure I understand the distinction between agents and agent shares, so this may need changing (e.g. does each agent always have one new and one retrofit share?)
Derived from original file technodata/Agents.csv

Name: str
AgentShareNew: str
AgentShareRetrofit: str
Quantity: float

AgentShare.csv

Defines properties of the new and retrofit agent shares for each agent
Must be two rows for each agent, one with New type and one with Retrofit type (is this appropriate?)
I don't 100% understand all the fields. Perhaps some of these would be better in the table above
Derived from original file technodata/Agents.csv

Name: str
Type: str["New", "Retrofit"]
MaturityThreshold: float
Budget: float
SearchRule: str["same_enduse", "all", "similar", "fueltype", "existing", "currently_referenced_tech", "maturity", "spend_limit"]
DecisionMethod: str["mean", "weighted_sum", "lexo", "retro_lexo", "epsilon_con", "retro_epsilon", "single_objective"]

Objective.csv

Defines the objectives of the agent shares defined above
Each agent share can have any number of objectives (unlike before which was limited to three max)
Derived from original file technodata/Agents.csv

AgentShare: str
Type: str ["comfort", "efficiency", "fixed_costs", "capital_costs", "emission", "fuel_consumption_cost", "LCOE", "NPV", "EAC"]
ObjData: float
ObjSort: bool

Commodity/

Commodity.csv

Core properties of each commodity
Derived from original file input/GlobalCommodities.csv. I've excluded CommodityEmissionFactor_CO2 and HeatRate as these aren't used by the model, but can add back if helpful

Name: str
Type: str["energy", "service", "material", "environmental"]
Unit: str

CommodityPrice.csv

Lists the price of each commodity in each region (base year)
Prices can be in any currency, but they must all be in the same currency. Even though currency isn't used in the model, we could have a field in the settings file to define the currency just so there's a record, or add an optional column here
All combinations of commodity and region are required
Derived from original file input/Projections.csv

Region: str
Commodity: str
Price: float

CommodityTrade.csv

Net import/export of commodities into each region (base year)
Positive equals net import, negative equals net export
Lack of a row for a commodity/region combination implies zero net import
Derived from original files input/BaseYearExport.csv and input/BaseYearImport.csv. I've combined imports and exports to have a single field NetImport as I assume this is all that matters, but I may be wrong

Commodity: str
Region: str
NetImport: float

Technology/

Technology.csv

Core properties of a particular type of property (e.g. wind turbine)
Derived from original files technodata/{sector}/Technodata.csv

Name: str
Type: str
Fuel: str
EndUse: str
Level: str["fixed", "flexible"]
Sector: str

TechnologyInstallation.csv

Properties of a particular installation of a property in a region (base year)
Lack of a row for a technology/region combination implies that the technology isn't installed in that region
Derived from original files technodata/{sector}/Technodata.csv

Technology: str
Region: str
CapPar: float
CapExp: float
FixPar: float
FixExp: float
VarPar: float
VarExp: float
MaxCapacityAddition: float
MaxCapacityGrowth: float
TotalCapacityLimit: float
TechnicalLife: float
ScalingSize: float
Efficiency: float
InterestRate: float
UtilizationFactor: float, null
MinimumServiceFactor: float, null

Flow.csv

Documents the flow of commodities through each technology installation (base year)
Negative equals net consumption, positive equals net production
Lack of a row for a particular technology/region/commodity combination implies that the commodity is not consumed/produced by the technology installation
Derived from original files technodata/{sector}/CommIn.csv and technodata/{sector}/CommIn.csv. I've combined inputs and outputs into a single field for net flow as I assume that's all that matters, but again I may be wrong

Technology: str
Region: str
Commodity: str
Flow: float

Allocation.csv

Allocation of each technology installation to agents
Share for each technology installation must add to one
Lack of a row for a particular technology/region/agent combination implies an allocation of zero
Derived from technodata/{sector}/Technodata.csv

Technology: str
Region: str
Agent: str
Share: float

Consumption.csv

Consumption of a technology installation in the base year
Derived from files in technodata/preset/

Technology: str
Region: str
Timeslice: str|int
Quantity: float

ExistingCapacity.csv

Existing technology installation capacity
Users can define capacity for the current year, and decommissioning profile over future years
Lack of a row for a particular technology/region/year combination implies zero capacity
Derived from technodata/{sector}/ExistingCapacity.csv

Technology: str
Region: str
Year: int
Capacity: float

Trend/

This folder allows the user to add data for future years
Anything in here is optional
The currently-used rules of flat-forward extension and linear interpolation will apply for all tables. For example, if no data is defined for future years, the model will assume flat-forward extension from the base year.
If data for the base year is redefined here, this will override anything defined in the tables above

AllocationTrend.csv

Allocation data for future years

Year: int
Technology: str
Region: str
Agent: str
Share: float

CommodityPriceProjection.csv

Price projection for future years

Year: int
Region: str
Commodity: str
Price: float

CommodityTradeTrend.csv

Commodity trade for future years

Year: int
Commodity: str
Region: str
NetImport: float

ConsumptionTrend.csv

Technology consumption for future years

Year: int
Technology: str
Region: str
Timeslice: str
Quantity: float

FlowTrend.csv

Flow data for future years

Year: int
Technology: str
Region: str
Commodity: str
Flow: float

TechnologyInstallationTrend.csv

Gives the ability to change any parameters for a technology installation in future years
You would have one row per parameter change, indicating the relevant field

Year: int
Technology: str
Region: str
Field: str["CapPar", "CapExp", "FixPar", "FixExp", "VarPar", "VarExp", "MaxCapacityAddition", "MaxCapacityGrowth", "TotalCapacityLimit", "TechnicalLife", "ScalingSize", "Efficiency", "InterestRate", "UtilizationFactor", "MinimumServiceFactor"]
Value: float

TechnologyInstallationTimeslice.csv

Allows the parameters UtilizationFactor and MinimumServiceFactor to vary during different timeslices (e.g. specific months, times of day etc.)
Any timeslices not defined here will fall back to the default as defined in TechnologyInstallation and/or TechnologyInstallationTrend
Derived from original file technodata/{sector}/TechnodataTimeslices.csv

Year: int
Technology: str
Region: str
Month: str
Day: str
Hour: str
Field: ["UtilizationFactor", "MinimumServiceFactor"]
Value: float

I've had a go at restructuring the default model into this format, and attached below. This should clarify what I have in mind. I've also included the default model in the original format for comparison:

new_default.zip
old_default.zip

I haven't modified the settings file yet, but this should be mostly unchanged.

Advantages compared to current system

Database format ensures self-consistency and prevents redundancy
Users never have to worry about adding columns to tables, or creating new tables. All changes are made by adding rows to existing tables.
It's clearer what's mandatory and what's optional, especially in relation to time-series data
No need to pad tables with zeros
Each table has a clearer definition/purpose
No need to have different tables for each sector

Notes

This may be simplified further if certain features are deprecated in the future (e.g. price projections, sectors)
I haven't thought much about how to implement this, so any ideas are welcome. If we wanted to use this for MUSE v1, we could just write something that converts from this format to the original format, and leave the rest of the code unchanged (at least as a first step).
The original model format also includes the ability to use a regression function to define service demand. I haven't looked too much into this yet so I haven't included this in the current schema, but it shouldn't be too difficult
There are a few things I still need to wrap my head around, so there might be some things that need to be changed. Especially around agents vs agent shares, and the "fixed" vs "flexible" level fields.

Discussion points

Does this make sense?
Is it an improvement on what we currently have?
How might we implement this?
Do we like having separate tables for the base year and future years, or should these be combined?
Do we want to consider this for MUSE v1, or leave until MUSE v2?
Any other comments/suggestions?

tsmbland · 2024-04-29T16:02:12Z

tsmbland
Apr 29, 2024
Maintainer Author

@dalonsoa @alexdewar Thoughts on this?

0 replies

dalonsoa · 2024-05-03T07:45:59Z

dalonsoa
May 3, 2024
Maintainer

Wow! This is a big change!

To be honest, I - we all - will need to seat together and think on the implications of these changes. For a long time.

A few comments that come to my mind:

As inputs are still provided as CSV files, I feel this affects much more the inner workings of MUSE, which will use a DB to store all of that information, than the input layer. So, I feel this will definitely need to be left for MUSEv2. The amount of things that will need changing are enormous and equivalent to re-writing the code from scratch. A DB is, in practice, a global variable and when MUSE was being re-written from v0, an explicit effort was made to not use global variables to keep the state of the simulation, but rather have the MCA own most of the information and pass the relevant bits to one or other function. I don't have any problem with the DB idea, I think it makes sense - at least for static stuff - but just commenting on the possible implications.

Having said that, MUSEv1 uses pandas and xarray to store all of these information, which are already tables and mutidimensional tables with much of the DB-style queries (not the relational aspect, of course), so I'm not entirely sure of what problem using a DB will solve, in practice.

On the aspect of re-organising the input structure, probably your version makes more sense for the user making things simpler for them, but we need to consider how those files map to the internal data structures of MUSE. They have been designed with the original input structure in mind - they were developed simultaneously - so changing one without changing the other will require some interface layer to make them work together. MUSEv1, for example, relies heavily on the concept of sector, which Adam says is not really that relevant, while the new version a sector is just one more entry on a table with, otherwise, no much relevance.

Considering all of this, I would suggest we park the full implementation of this for the design of MUSEv2, and we try to figure out how (or which of) the advantages you describe we can be retrofitted into MUSEv1.

0 replies

alexdewar · 2024-05-03T15:08:41Z

alexdewar
May 3, 2024
Maintainer

@tsmbland This looks v well considered and seems to make sense (at least from my fairly shallow understanding of the current input format).

I think I agree with @dalonsoa that this is probably something that should wait till MUSE v2 though. A major change in the input file format is essentially a breaking change, so if we were doing semantic versioning that would be a major version bump in any case. Given that MUSE v1 is likely to live on as the "legacy" version for a while, we should probably aim to maintain backwards compatibility for the foreseeable future.

Once I'm a bit more familiar with the codebase I might have some more thoughts about this and we should def keep it in mind when reworking the input layer.

0 replies

tsmbland · 2024-05-03T15:53:17Z

tsmbland
May 3, 2024
Maintainer Author

Thanks @dalonsoa and @alexdewar

I agree that we definitely don't want to break backwards compatibility in v1. In theory, we could support both formats in v1 by implementing a layer that converts input files from the new format back to the original format, and none of the inner workings would need to change. Obviously this would have no benefits in terms of performance, but would allow users to get comfortable with the new format, and lower the barriers for switching to v2 (for both users and developers). Not saying we need to do this, just an option.

In terms of implementation, we wouldn't necessarily need to use an actual database (although it may be a good idea to). I like the database-style format as I think it's an intuitive way of structuring data, but the data could still be read in and stored using pandas and xarray without building an actual database (Update: or an in-memory database).

0 replies

tsmbland · 2024-05-07T14:04:56Z

tsmbland
May 7, 2024
Maintainer Author

@ahawkes @sharwaridixit

0 replies

ahawkes · 2024-05-13T13:03:00Z

ahawkes
May 13, 2024
Maintainer

This looks good! I think we need a meeting about it too though.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea for restructuring input files #285

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Idea for restructuring input files #285

tsmbland Apr 29, 2024 Maintainer

Database schema

Input file schema

Agent/

Agent.csv

AgentShare.csv

Objective.csv

Commodity/

Commodity.csv

CommodityPrice.csv

CommodityTrade.csv

Technology/

Technology.csv

TechnologyInstallation.csv

Flow.csv

Allocation.csv

Consumption.csv

ExistingCapacity.csv

Trend/

AllocationTrend.csv

CommodityPriceProjection.csv

CommodityTradeTrend.csv

ConsumptionTrend.csv

FlowTrend.csv

TechnologyInstallationTrend.csv

TechnologyInstallationTimeslice.csv

Advantages compared to current system

Notes

Discussion points

Replies: 6 comments

tsmbland Apr 29, 2024 Maintainer Author

dalonsoa May 3, 2024 Maintainer

alexdewar May 3, 2024 Maintainer

tsmbland May 3, 2024 Maintainer Author

tsmbland May 7, 2024 Maintainer Author

ahawkes May 13, 2024 Maintainer

tsmbland
Apr 29, 2024
Maintainer

tsmbland
Apr 29, 2024
Maintainer Author

dalonsoa
May 3, 2024
Maintainer

alexdewar
May 3, 2024
Maintainer

tsmbland
May 3, 2024
Maintainer Author

tsmbland
May 7, 2024
Maintainer Author

ahawkes
May 13, 2024
Maintainer