Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔧 Define process to teardown dormant deployments #1103

Open
shankari opened this issue Jan 10, 2025 · 8 comments
Open

🔧 Define process to teardown dormant deployments #1103

shankari opened this issue Jan 10, 2025 · 8 comments

Comments

@shankari
Copy link
Contributor

We now have a well-defined process to create new deployments: potential partners:

  • fill out the form
  • submit the signed MOU
  • once the MOU is signed by NREL, we create an environment for them

However, we don't have a process for tearing down deployments, so we keep accumulating data in the running system.

This has several problems:

  • we cannot support unbounded growth of the online data
  • for many programs, we are not even receiving new data, so we are just wasting resources keeping data available online with nobody to access it
  • our MOUs are for fixed time periods, and once the MOU expires, we need to tear down the data collection unless it is extended.

We should come up with a process for teardown.

The process should be initiated by:

  • detecting that data collection has effectively stopped, OR
  • by the MOU expiring

Once the environment is torn down, we need to send the data over to the TSDC for archival
We also need to think about how to accommodate people who want to continue collecting data; I assume we should move them over to open-access. Ideally we would also copy over their historical data so that they could see their history @JGreenlee

@shankari
Copy link
Contributor Author

To detect that data collection has effectively stopped, we need to write a script that checks how many active users each deployment has.

At a high level, this involves:

  • marking deployments as active or archived
  • periodically checking all active deployments to determine which ones are
    dormant

Checking for dormancy involves iterating over all active deployments and seeing how many active users they have.

This, and potentially copying over the data for participants whose deployments are being torn down, but who want to continue collecting data as part of open-access, can also be viewed as creating a federated dataset over all the deployments.

@shankari
Copy link
Contributor Author

shankari commented Jan 10, 2025

While discussing federation, I also want to revisit the idea of creating an https://openpath.nrel.gov (instead of https://nrel.gov/openpath) that we maintain.

This could contain:

  • list of currently active deployments with some high level summary of how active they are
  • list of archived deployments with links to their TSDC pages
  • maybe an overall running total of the number of deployments and users/data collected over time
  • a blog where we can put release notes instead of emailing all deployers

I'm a bit concerned about getting distracted by this instead of working on scalability or other user-facing improvements, though.

@Abby-Wheelis @JGreenlee @iantei @TeachMeTW thoughts?

@TeachMeTW
Copy link

TeachMeTW commented Jan 10, 2025

This teardown process seems like a good thing to work on. Right now, scalability-wise we are making pretty good progress:

  • The homepage is quite speedy thanks to the profiledb changes.
  • As for batch loading, Jack has made good progress on upgrading it, and it seems to be nearly done — only minor bug fixes I believe?
  • Regarding server changes, there's nothing glaring in terms of points of concern. My interpretation of the analysis and observations is that there is only 1 or 2 areas to look at.

On the user-facing side, I’m working on improvements now—picking up on Jack's branch (e.g., the work on #1066 (comment)). But a majority of them are already done — possibly all done within the next week or so.

I believe a teardown process wouldn’t bog anyone down. It’s a defined task that we can shift focus to, alongside our current scalability and user-facing improvements.

@JGreenlee
Copy link

The immediate concerns of pipeline and DB scalability have been eating my time but I recognize that having a teardown process is also important and directly affects the long-term scalability of the platform.
I think we can plan it out in the coming weeks.

As I'm thinking about it, transferring users to open-access seems tricky.
Location data is universal so that's easy, but what if a user was on a configuration that significantly differs from open-access? (may not be a big issue yet but bound to be later)
And how do we make the process go smoothly? Would we send them notifications or prompts inviting them to opt in to the migration? Or would we silently move over anyone who's still active?

As for creating and setting up openpath.nrel.gov, I think it could be a fantastic project for an intern but not until after a teardown process is established and the "federated DB" idea is built out. So maybe spring or summer?

@shankari
Copy link
Contributor Author

@TeachMeTW with respect, I disagree with #1103 (comment)

None of the changes that you have listed are in production yet, so they are not "fixed" for actual users. The home screen on any production environment is still going to take minutes to load. Just because you have written some code doesn't mean that the problem is fixed.

Regarding server changes, there's nothing glaring in terms of points of concern. My interpretation of the analysis and observations is that there is only 1 or 2 areas to look at.

I sense a lack of urgency around the pipeline changes on the server. Of the people in our team, everybody who has been collecting data for more than 6 months has their pipeline stuck and all their trips in draft mode. This is definitely an area of concern. Looking at one area (loading composite trips) took Jack and me, working together, a week to identify the cause. We still haven't fixed it fully.

I would like to highlight again that this is not a class project where the work is done if you write some code. The code has to be reviewed, merged, and deployed to production for the problems to be fixed.

@TeachMeTW
Copy link

@shankari Understood. I get that nothing's in production yet. What I meant was that the improvements to the Home Screen look promising in terms of testing and could really speed things up once deployed. I’m not downplaying the urgency—I know we need to push these changes through review and into production.

On the pipeline issue, I haven’t personally run into the exact problems you mentioned, so my comments were more of a guess than firsthand experience. That said, I now understand the severity of the pipeline clogging -- I will tackle it with urgency and try to work closely with Jack to get it nailed down.

@iantei
Copy link
Contributor

iantei commented Jan 10, 2025

I think the idea of creating https://openpath.nrel.gov/ with all the information about active, dormant deployment is a great idea. I really like the idea of having a blog for communication over passing message to the deployers over email. This overall site will give a lot of transparency and insight towards people who are interested in utilizing the project.
I feel, in a way this will enhance the user facing improvement in certain way.
But as @JGreenlee mentioned, it would be a good idea to implement this post the implementation of the federation DB is built.

Though I have not worked much on the pipeline processing personally, but I am inclined to believe that if we distinguish the active and dormant data collection, and process only the active deployment. That would give some breather for the server pipeline processing too.

@Abby-Wheelis
Copy link
Member

I like the ideas presented about the website that we would maintain, and it would support some of the concerns that I have had with updating the NREL websites (mostly, things going out of date and being hard to update). However, I do also agree that scalability is a high priority given the issues we've been encountering recently. It sounds like archiving programs will help with the scalability, so that may be a good place to start.

As far as the user-transfer process goes, I agree that a sudden change in configuration could be really confusing to users. We could probably prevent issues on the backend by not porting their old data over (thinking of how we broke the public dashboard for laos by changing the label options mid-deployment) but it might still confuse users, and they could lose their travel history. If we keep their travel history, that has the potential to break at least the public dashboard, which does not expect mixed data sets ... I'm sure we could update the dashboard to future-proof it against mixed data types (labels, surveys) and label sets, but that would take some time.

I think for that reason we should prompt them to see if they want to keep collecting data. I would imagine (based on patterns I've seen in collected data) that some people are completely disengaged with the app, but still happen to have it running. How important is it to us that we keep these people on as users? We could prompt people to transfer over, but if they don't, especially if they haven't labeled in a while, the users might be "dormant".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants