Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fromiterator withcapacity #44

Closed

Conversation

chris-ha458
Copy link
Contributor

@chris-ha458 chris-ha458 commented Sep 20, 2023

This implements iterator.size_hint() based bounds to call with_capacity.

The last commit 5f5aa09 uses the new capacity based implementation to change the for based implementation to a fold() based one.

Arguably it is more idiomatic, but it is not necessarily more concise or clear compared to the for based version. However, as an objective benefit beyond the subjective, it can make a potential rayon implementation with into_par_iter easier.

If the 5f5aa09 seem out of scope or irrelevant to with_capacity, it can easily be reverted and prepared as part of a new PR.

@coriolinus
Copy link
Owner

I'm sorry, I know I previously said that this was a good idea, but on further thought I think it might not be, particularly if we're targeting high-scale data operations. The semantics of size_hint do not match the semantics of Counter::with_capacity, and I am increasingly of the opinion that it would be a mistake to try to conflate them.

Consider the input vec![0; 1024].into_iter().collect::<Counter<_>>(). That will allocate a Counter whose internal map is sized for 1024 items, and then it will use precisely one of those slots.

What matters for Counter is the number of distinct possible values for its parametric type T: 2 for a bool, 256 for any 8-bit sized value, n for an n-ary unit enum, etc. That's the real factor that governs the reasonable preallocation, not the size of the input.

Now imagine the usage in a data processing center. It is not unreasonable to expect data streams with terabytes flowing through. We absolutely do not wish to attempt to allocate an interior map which can handle tera-scale distinct values.

If Rust supported negative trait bounds or specialization, we could write our own trait, implement it for some common types, and let end-users implement it for their own types. We could then base the default with_capacity in these implementations not on size_hint but on <T as ExpectedDistinctValues>::expected_distinct_values(). However, those features are not supported in stable Rust.

Perhaps you have a convincing counter-argument. In that case, I'd like to hear it! But as of now, my inclination is to close this un-merged.

@chris-ha458
Copy link
Contributor Author

What matters for Counter is the number of distinct possible values for its parametric type T: 2 for a bool, 256 for any 8-bit sized value, n for an n-ary unit enum, etc. That's the real factor that governs the reasonable preallocation, not the size of the input.

This makes sense to me.

I think for the potential HPC usages, we could implement something that is relevant when it becomes necessary and the constraints and bottlenecks are clearer. Before that, the current new() could suffice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants