-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement exclude_none_values in DataCollector #1702
Conversation
196b5c9
to
f6abceb
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1702 +/- ##
==========================================
+ Coverage 81.45% 81.47% +0.01%
==========================================
Files 18 18
Lines 1402 1414 +12
Branches 272 278 +6
==========================================
+ Hits 1142 1152 +10
- Misses 214 215 +1
- Partials 46 47 +1
☔ View full report in Codecov by Sentry. |
f6abceb
to
43e9c20
Compare
mesa/datacollection.py
Outdated
def get_reports(agent): | ||
_prefix = (agent.model.schedule.steps, agent.unique_id) | ||
reports = (rep(agent) for rep in rep_funcs) | ||
reports_without_none = tuple(r for r in reports if r is not None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just too unpack this a little.. if the agent does not have the attribute, None
is returned and and there will be no tuple and then no row in the DataFrame
This prevents the issues we saw in Sugarscape with Traders where the Sugar and Spice information was retained as None and due to the number of sugar and spice agents this created memory issues.
However, if the attribute is typed as a number then it will be collected as a NaN. But based on the dynamics of this should have less memory issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just too unpack this a little.. if the agent does not have the attribute, None is returned and and there will be no tuple and then no row in the DataFrame This prevents the issues we saw in Sugarscape with Traders where the Sugar and Spice information was retained as None and due to the number of sugar and spice agents this created memory issues.
Yes, attributes which values are retrieved as None
are excluded.
However, if the attribute is typed as a number then it will be collected as a NaN. But based on the dynamics of this should have less memory issues.
This is not true, as long as you define the function to return None
when appropriate. The NaN values only appear when the report get converted to a DataFrame. This is because a DF is constrained to be a table, so a N/A value for a number is NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just too unpack this a little.. if the agent does not have the attribute, None is returned and and there will be no tuple and then no row in the DataFrame This prevents the issues we saw in Sugarscape with Traders where the Sugar and Spice information was retained as None and due to the number of sugar and spice agents this created memory issues.
Yes, attributes which values are retrieved as
None
are excluded.However, if the attribute is typed as a number then it will be collected as a NaN. But based on the dynamics of this should have less memory issues.
This is not true, as long as you define the function to return
None
when appropriate. The NaN values only appear when the report get converted to a DataFrame. This is because a DF is constrained to be a table, so a N/A value for a number is NaN.
Ahhh got it, very cool
I would say it should be part of datacollector as it is a fundamental feature. |
One reason that I separated it was because dropping The naming |
This supersedes #1701 and #1481. From the user perspective, if they want to collect This implementation is also versatile enough that it enables another use case: say if the user wants such that the agent record is not collected all the time, they can write it in their conditional, e.g. collect record only for every 10th time step, by having the record function to return Edit: clarify 2nd paragraph |
There is no change in the declarative UI, it's still the same |
Isn't @GiveMeMoreData concern still valid? This doesn't seem to handle the case where attributes don't exist in all agents. Also in Python None is often a very valid value and it seems kind of strange to always exclude it from data collection |
No, it's not a problem. When the attribute doesn't exist in all agents, then the reporter is an empty tuple. Also, no
If you don't want to exclude #1701 has another problem that it requires more RAM that sometimes would cause models to not run in a Binder/Colab environment. The Sugarscape with trading model is one example of this situation. |
Yes, sorry, you are correct I was misreading the code.
No I cannot use vanilla DataCollector if I want to collect data from multiple agents. This is the reason #1701 exists.
I haven't dug into the details of #1701, but why would it's memory footprint be higher? It should only collect the attributes specified, same as this PR (I know this one also removes any other None values, but this should hardly matter). I share your concerns about code complexity though, but I think #1701 could be improved in that respect. |
I think what we are actually missing is an interface to the underlying _agent_records. Currently the only public Api to that is getting the data frame out. Then this whole PR could be solved in userland by using
(Kind of what is done in sugarsacpe example) But no one wants to mess with private variables. But again this is different from #1701 |
Can you explain this situation? Under which model would you want
#1701's DF is much more complex, and hence it consumes more memory than the temporary workaround @tpike3 did in the Sugarscape with trading, which was already having not enough memory problem, before he manually removed the unwanted values from the internal attributes.
This requires a 2nd loop. This PR (#1702) scans through the entire agents, entire agent reporter in 1 loop. You should consider this PR to be a multipurpose tool, not just for agent-specific attributes. Just as in a DF, there are various ways to interpret and deal with N/A values, your interpretation of |
I am fascinated by this problem as it represents an interesting development and user dynamic. So my view (at least right now until the discussion changes it).
My suggestion:
Let me know what you think. |
I suppose the 2 main contenders are #1701 and #1702? Should we close #1481, given that it is most complex for the user to use? Yeah I will merge the 2 classes and add a flag to drop
This can be remedied by 1. documenting in the useful code snippets section, 2. examples, in particular the trading Sugarscape. I can do point 1 in a separate PR afterward. |
I'm concerned that this is an implicit behavior. |
59f580d
to
47118de
Compare
I have merged the 2 classes into 1. The net diff of this PR is now smaller. |
47118de
to
d91e45e
Compare
That's fair, I am good with the way you set it up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- Due to the high volume of discussion on this, Collecting data from multiple differentiable classes of agents #1701, and Fix #1419. DataCollector accepts an arbitrary schedule at creation (d… #1481, I will merge in 48 hours if no further comments.
- @Corvince , @GiveMeMoreData, @woolgathering, @EwoutH please provide any thoughts or concerns you may have.
d91e45e
to
3c52958
Compare
I am not sure about this one. In comparison #1701 provides such a nice API where you can provide an explicit list of attributes you want to collect for each agent. This allows a fine-grained control over what data is collected. Whereas with this PR we have to tell users: yeah just write all attributes you want to collect in one list. Don't worry, If you turn on this flag it will rely on the fact that attributes will default to None if they don't exist and remove them. Oh but please don't collect any attributes that might truly be None, those will also be removed. I also don't think this is needed for memory issues. I think a more logical approach would be to write out/save _agent_records every n steps and flush the _agent_records dictionary. So something like def step():
...
self.dc.collect()
df = self.dc.get_agent_vars_df()
df.save(...)
self._agent_records = {} Then the Mesa API would only require a method to flush the agent records (or add a flag to get_agent_vars()). But this is another discussion. |
So this was my final comment. I won't continue spamming this PR. I am not completely opposed to merging this but I definitely don't see this as superseding #1701 |
@tpike3 fyi @rht I like the changes you made and I think they almost solve the problem I had in my case. But I belive there is still a functionality missing. |
@Corvince These are great points and I absolutely agree on the API intuitiveness. I put this in the category of so easy its hard. I think we are at a fundamental trade off between code elegance and user understanding. My thought right now is to try and have the best of both worlds, and mitigate the lack of intuitiveness by putting examples of agents_by_type_collection in the (1) tutorial, (2) putting the Complexity Explorer and (3) making a gist, so users have lots of examples to see how to do it. Feel free to provide a counter argument and one way to assess is if we merge and keeping getting questions on data collection by type then we know we need to make explicit. |
@GiveMeMoreData ohhh interesting, I am not able to think through this use case. If one is collecting different agent attributes they will be collected with Let me know (For the record I think your code is great! It reminds me of when I rewrote batchrunner_MP and then @Corvince seeing what I did created batch_run in like 1/3 the lines of code, so regardless of how this moves forward your code was absolutely instrumental in developing this solution.) |
@tpike3 I'm not aware of all the use cases for lambda functions, but they are absolutelly allowed by the library and at least now there is no hint that using them in case of differentiable agents will result in AttributeError, which is currently the case. One simple usecase would be to collect the results of a method implemented in agent. I think that we should change code to either handle lambdas without errors or improve documentation to make it clear that in case of differentiable agents data collection is possible only when described as {"value": "val"}. Also thank you for the very kind words :) |
This still has memory issue where the DF is huge. And the DF is more complex. because you need extra column to tell which agents have the particular reports.
The Besides, I don't see why one would pick |
I'd say the trade-off is that you have an explicit agent-specific API, but complex inner working of the DF, vs a multi-purpose feature that happens to solve #1419, with a simple inner working. At the end of the day, for #1701, I still need to see an example/tutorial in order to know the structure of the agent-specific DF. |
3c52958
to
d4e173a
Compare
Again, my point is that this is not a suitable solution for multi agent data collection. This thread already identified two limitations: 1) not possible to collect None values and 2) not possible to use lambda/function reporters if they use unavailable attributes. Honestly I think multi agent data collection is such a fundamental feature that it will be very strange to have different constraints from collecting data from a single agent type. But, and this is a big but, this PR is mergeable as is and as @rht showed has also a general purpose. So it's better to have some immediate solution than to have a perfect solution that will never be build. So I would say feel free to merge. But this whole thread got me thinking about the purpose of data collection. I have an idea for a fundamentally different approach, but I will post it as a discussion topic in the next few days. |
Does that mean you can no longer write a model that collects agent-specific attributes if the 2 features you requested are unavailable? Somehow these features matter more than other more pressing issues that have already been said several times. I have already asked an example of such model, but haven't been provided any. |
@Corvince I am intrigued to see what you are going to come up with, based on @GiveMeMoreData's comments I did start playing around with integrating lambda functions for different agent types and that also started me thinking about the the idea of datacollection that optimizes APIs, computational efficiency and readable code. I however did not have and breakthrough thoughts. Regardless, this code is a great improvement and to @EwoutH comments about contributions we got to be better so I am going to merge. |
This this issue ever get resolved? Especially in the case of:
Since I’m touching the DataCollector in #1838 again, I’m considering if it needs to be tested with multiple agent classes, each with multiple types of agents reporter syntaxes (strings, lambda, etc.) |
Following again (this is @woolgathering) since I've returned from tooling-land to get back into simulations. In my mind, it would be nice to test with multiple agent classes and reporter syntaxes so cover all the bases. My feeling is that if we're mucking in it now we might as well cover as much as is reasonable. I would expect some of my use cases to run into those problems. |
There is a solution proposed in #1813 (comment), but haven't been implemented. |
Is there consensus about that solution being the right way to go (among @tpike3 @jackiekazil @Corvince @GiveMeMoreData)? If so, would you be willing to open a PR? I feel we have talked so much about configuring the DataCollector for multiple agent types, we should figure it out once and for all (looked it up, literally since early 2017, nearly 7 years ago, this discussion has been ongoing on and of). |
Fixes #1419.
Came up with this solution while reviewing #1701; thank you @GiveMeMoreData. This has 2 benefits:
In this case, if you create a DataFrame out of the agent records, the agents that don't have the said attribute will have nan values in the column (if it the column is typed as a number).
One point to discuss is whether I should merge
DataCollectorWithoutNone
intoDataCollector
, and have a flag inDataCollector
to enable/disable of the exclusion ofNone
values.@woolgathering what do you think of this solution for your model?