Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use serial update for each upgrades data frame #186

Merged
merged 49 commits into from
Sep 11, 2024
Merged

Conversation

wenyikuang
Copy link
Collaborator

@wenyikuang wenyikuang commented Jun 18, 2024

Pull request overview

This PR is intended to fix the #149 .

How do I test it works?

With cycle3_euss_10k_df_2 dataset and there are 3 upgrades. I confirmed the results in output parquets
are matching without the change.
With cycle3_euss_full_350k_combined dataset and there are 32 upgrades. I could finished the run in my local laptop.

This pull request makes changes to (select all the apply):

  • Documentation
  • Infrastructure (includes apptainer image, buildstock batch, dependencies, continuous integration tests)
  • Sampling
  • Workflow Measures
  • Upgrade Measures
  • Reporting Measures
  • Postprocessing

Author pull request checklist:

  • Tagged the pull request with the appropriate label (documentation, infrastructure, sampling, workflow measure, upgrade measure, reporting measure, postprocessing) to help categorize changes in the release notes.
  • Added tests for new measures
  • Updated measure .xml(s)
  • Register values added to comstock_column_definitions.csv
  • Both options_lookup.tsv files updated
  • 10k+ test run
  • Change documentation written
  • Measure documentation written
  • ComStock documentation updated
  • Changes reflected in example .yml files
  • Changes reflected in README.md files
  • Added 'See ComStock License' language to first two lines of each code file
  • Implements corresponding measure tests and indexing path in test/measure_tests.txt or/and test/resource_measure_tests.txt
  • All new and existing tests pass the CI

Review Checklist

This will not be exhaustively relevant to every PR.

  • Perform a code review on GitHub
  • All related changes have been implemented: data and method additions, changes, tests
  • If fixing a defect, verify by running develop branch and reproducing defect, then running PR and reproducing fix
  • Reviewed change documentation
  • Ensured code files contain License reference
  • Results differences are reasonable
  • Make sure the newly added measures has been added with tests and indexed properly
  • CI status: all tests pass

ComStock Licensing Language - Add to Beginning of Each Code File

# ComStock™, Copyright (c) 2023 Alliance for Sustainable Energy, LLC. All rights reserved.
# See top level LICENSE.txt file for license terms.

TODO:

  • Test with downloadable dataset and verify the results are matches.
  • Cleanup the comments and write documentation.

@wenyikuang wenyikuang added postprocessing PR improves or adds postprocessing content Pull Request - Ready for CI labels Jun 18, 2024
@wenyikuang wenyikuang changed the title Use serial update for each upgrades data frame [WIP] Use serial update for each upgrades data frame Jun 18, 2024
@wenyikuang wenyikuang changed the title [WIP] Use serial update for each upgrades data frame Use serial update for each upgrades data frame Jul 17, 2024
assert isinstance(self.monthly_data, pl.LazyFrame)

# self.data = pl.concat(annual_dfs_to_concat, join='inner', ignore_index=True)
common_columns = set(annual_dfs_to_concat[0].columns)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my approach to implement the join='inner' , find all the shared columns and select them out.

@@ -2571,7 +2643,7 @@ def export_data_and_enumeration_dictionary(self):
col_enums = []
if col_def['data_type'] == 'string':
str_enums = []
for enum in self.data.select(col).unique().to_series().to_list():
for enum in self.data.columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code isn't looping through column names, it's getting the unique set of values across all rows within the column.

Copy link
Member

@asparke2 asparke2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenyikuang let's have a discussion about this, I'm not convinced that the scaling weights are working as expected.


self.add_metadata_index_col(upgradIdcount)
self.get_comstock_unscaled_monthly_energy_consumption()
self.add_weighted_energy_savings_columns()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method requires the self.BLDG_WEIGHT column to exist...but that column is not added until self.add_national_scaling_weights() is called. How is the code working now?


color_map = {'Baseline': self.COLOR_COMSTOCK_BEFORE, upgrade_name: self.COLOR_COMSTOCK_AFTER}
df_upgrade = df_upgrade.collect().to_pandas()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we really collect the dataframe at this point without first downselecting to only the columns actually used in the comparison plots? This needs to be tested with a full-sized dataset to make sure the memory usage is reasonable. Otherwise, it seems like it should be kept as a pl.LazyFrame and only collected inside the plotting functions.

@@ -0,0 +1,23 @@
FUNC __main__.main 227.7188 1717631807.5411 254.3594 1717631808.2755 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this file, seems like accidentally committed from memory profiling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@rHorsey
Copy link
Collaborator

rHorsey commented Sep 8, 2024

This now also includes sampling v2. I'm pulling in changes directly here.

@rHorsey
Copy link
Collaborator

rHorsey commented Sep 8, 2024

@ChristopherCaradonna This is the branch to test the 10k with! I haven't merged main into this however, so I'm unsure if there are conflicts esp w/ options lookup.

@rHorsey
Copy link
Collaborator

rHorsey commented Sep 11, 2024

Big (and somewhat breaking) merge to speed work towards Nov EUSS Release with new sampling and much better postprocessing. Huge thanks @wenyikuang !!!

@rHorsey rHorsey merged commit b2fb2a3 into main Sep 11, 2024
0 of 3 checks passed
@rHorsey rHorsey deleted the postproc_per_upgrade branch September 11, 2024 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
postprocessing PR improves or adds postprocessing content Pull Request - Ready for CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants