Generic Collections #201

jackfirth · 2021-12-16T04:20:46Z

jackfirth
Dec 16, 2021
Collaborator Sponsor

Rhombus needs proper collections, such as lists and sets. This discussion is an informal summary of my design thoughts thus far and the direction I'm heading in.

Terminology

When speaking of collections generally and not specific Racket APIs, here's what I mean by the following terms:

Sequence — a homogeneous container of values, called the sequence elements, that can be accessed one at a time in series
Collection — a sequence whose size is finite
Ordered collection — a collection where the order of elements in the collection matters, and different orderings of the same elements are considered different collections
Unordered collection — a collection where the order of elements in the collection doesn't matter, and different orderings of the same elements are considered the same collection
Sorted collection — an ordered collection where the collection's elements are always in a specific sorted order
List — an ordered collection (example: a checklist of instructions to follow)
Set — an unordered collection that does not allow duplicate elements (example: the collection of packages installed on your machine)
Multiset — an unordered collection of values that allows duplicates (example: a shopping cart full of items)
Mutable collection — a collection that can change over time, gaining or losing elements
Immutable collection — a collection that cannot ever change
Destructive update — a change to a mutable collection
Functional update — a change to an immutable collection where a new, changed collection is created and the original collection is left unchanged
Persistent collection — an immutable collection that supports efficient functional updates

Background

Today Racket has the following collection abstractions:

"Sequences" — a generic interface (using the prop:sequence struct type property, not racket/generic) that's similar to the definition of sequence I used above, except Racket sequences allow each "element" to be an arbitrary number of values, similar to how functions can return multiple values by using the (values ...) form.
"Lists" — persistent linked lists that support efficient inserts and removals at the beginning of the list. Racket lists do not support efficient random access; retrieving the Nth element takes time linear in N. Even just determining the size of a Racket list takes time linear in the length of the list.
Improper lists — linked lists where the last link is another element instead of null, like '(1 2 3 . 4). Their existence mainly arose as an artifact of Racket's representation of lists as cons pairs without any enforcement that the cdr of the last cons pair is null. They have all the problems of Racket lists and a handful more and come with very few advantages to make up for it.
Vectors — fixed-size contiguous lists that can be either immutable (but not persistent) or mutable. Vectors support constant-time random access.
mlists — mutable linked lists that exist mostly as a legacy API for porting Scheme code. Like ordinary racket lists, random access is linear.
gvectors, or growable vectors — mutable contiguous lists that support efficient insertion at the end of the list via amortization.
Sets — a generic interface (in the sense of racket/generic) for mutable and immutable sets. The default implementation is an immutable set backed by a hash table.

Racket lists, improper lists, vectors, mlists, gvectors, and sets can only be manipulated generically through the prop:sequence protocol and racket/sequence APIs. This protocol only allows iterating list elements one at a time, so determining the size of a list can't be done efficiently through the sequence APIs.

Racket's list implementations also force the user to choose between efficient random access and immutability. In theory one could use immutable vectors, but all of Racket's vector APIs produce mutable vectors by default. Even if you stick with mutable vectors, you're still forced to choose between vectors and gvectors. The latter supports insertion, but the former is better supported. For example, there's no gvector-sort! operation like vector-sort!, so if you want to collect an unknown number of elements one at a time and then sort them, you have to copy the elements from a gvector into a vector partway through.

Proposed interfaces

My general proposal for Rhombus is that we define a few collection interfaces and offer multiple implementations. To start with, I propose this interface hierarchy:

To address mutability, I propose that each collection type — List, Set, and Multiset — come in four flavors: an immutable interface, a mutable interface, a read-only view interface, and a write-only builder interface. The division of operations between these interfaces would look like this:

Queries and other read-only operations go on the view interface.
Functional updates go on the immutable interface, which is a subtype of the view interface.
Destructive updates go on the mutable interface, which is a subtype of the view interface.
The builder interface is for efficiently building immutable collections using destructive updates. Each builder is a mutable object with operations for inserting collection elements into it, but no operations for reading the collection. The builder.build() operation creates an immutable collection from a builder. Builders offer the benefits of immutable collections without the performance costs of persistent updates, provided users are fine with performing all of their insertions in the collection upfront.

For lists, this results in an interface hierarchy like so:

Further thoughts

There's a lot more I've put thought into but haven't written down here yet. Most of these I've explored or implemented in my Rebellion package, so I'll include links to the relevant bits for each item:

Implementation choices, especially for persistent collections. RRB trees are one example.
Sorted collections, comparators, and ranges. See rebellion/collection/sorted-set for examples.
Maps and other "bicollections", like multimaps. See rebellion/collection/multidict for examples.
Collection views, e.g. list.sublist(3, 8) should return a List implementation backed by the original list. map.keys() would similarly return a Set implementation backed by the map. Views make sorted collections far more powerful; see sorted-subset in Rebellion for an example. Read-only views of mutable collections are another example of usefulness here.
Optimized implementations of empty collections and singleton collections.
Range-based collections, like sets of ranges (e.g. a set of the numbers in the ranges [3, 8), [10, 14], and (18, 25)) and maps from ranges to values. See rebellion/collection/range-set for examples.
Thread safe usage of mutable collections.
Expressing stream pipeline operations on sequences in terms of transducers (see rebellion/streaming/transducer) and reducers (see rebellion/streaming/reducer).
for comprehensions.
Map entry views, similar to Rust's Entry API.

samth · 2021-12-17T00:05:26Z

samth
Dec 17, 2021
Maintainer Sponsor

You've given multisets a prominent place here (and multimaps in rebellion), which suggests that you or others find them useful a lot more than I have. Can you give some examples of use cases for them that you've run into?

1 reply

jackfirth Dec 17, 2021
Collaborator Author Sponsor

Any time I'm counting things, really. For example in a recent Advent of Code puzzle I built a list of multisets to store the counts of how many ones and zeros there were at each position in a collection of bitstrings. They're also handy for figuring out how often different elements occur in some dataset. There's more examples in the Guava (Google's core Java libraries) wiki documentation on Guava's multisets. The basic idea is to look for anywhere you have a Map<E, PositiveInteger> and check if you're representing counts of elements - if you are, odds are using a Multiset<E> instead will be more ergonomic.

rocketnia · 2022-01-02T08:10:58Z

rocketnia
Jan 2, 2022

To address mutability, I propose that each collection type — List, Set, and Multiset — come in four flavors: an immutable interface, a mutable interface, a read-only view interface, and a write-only builder interface.

Where is a builder needed? I can imagine using a mutable collection at first and then creating an immutable copy of it once all the elements are in place. Would a builder be more efficient or something?

Views make sorted collections far more powerful; see sorted-subset in Rebellion for an example.

Views are a bit unusual to me. Even though sorted-subset is an interesting example of a view, I'm not sure where I would use sorted-subset. Have you made use of it a lot? Do you have any other examples of where views come in handy?

I might have an example in mind. I've lately been thinking of views as a compelling technique for working with strings and other data of arbitrary size without doing so many copies.

(In particular, I'm thinking of using views in Punctaffy so that I can iterate over the faces of a higher-dimensional snippet without constructing each face as I go along. With a view approach, the faces could be cursors referring to landmarks in the snippet's representation, rather than being fully copied snippets of their own.)

Perhaps builders are justified for the same reason: Once you're done with a builder, you can get an immutable collection from it without doing a deep copy of its contents; you just move its whole internal buffer over.

13 replies

sorawee Jan 4, 2022

Builders make it impossible to skip that step.

That sounds like a disadvantage rather than an advantage.

Can we have a copy-on-write collection? Users can naturally use mutable collection to add stuff, and the conversion from mutable collection to immutable collection can then be done in constant time.

jackfirth Jan 4, 2022
Collaborator Author Sponsor

It sounds like both binary heaps and sorted sets benefit from being built all at once from an array. So why define an incremental builder type for those collections at all, if all it does in the incremental part is building an array? What does it give you that an array doesn't?

Builders encapsulate that for you, and they take care of other details like checking for duplicates and sorting. They can also use better strategies for choosing the initial array capacity and the growth rate as elements are added to it, for instance reserving additional capacity more aggressively and choosing a larger initial capacity since unused space can be freed when build() is called.

That sounds like a disadvantage rather than an advantage.

The advantage is that less code will use mutable collections when it doesn't actually need to mutate them. If users want mutable collections, they can just use mutable collections. But if they only want to mutate a collection locally while they build it, then read from it afterwards, builders fit that pattern better and give better performance and a better API for it.

Can we have a copy-on-write collection? Users can naturally use mutable collection to add stuff, and the conversion from mutable collection to immutable collection can then be done in constant time.

I'm not sure what implementation you're envisioning here. Do you mean something like implementing the MutableList interface with a mutable box containing a persistent immutable list?

rocketnia Jan 4, 2022

However, users often won't bother with that last step and will just pass around the mutable collection they built instead. Builders make it impossible to skip that step.

How so? If a user doesn't want to pass around mutable collections as though they're immutable, then why does the mere existence of builders in the language stop them from doing that? Is there a reason they have to use builders? Is there a different kind of convenience they get from builders that makes up for the fact they can't skip this step anymore?

rocketnia Jan 4, 2022

Builders encapsulate that for you, and they take care of other details like checking for duplicates and sorting. They can also use better strategies for choosing the initial array capacity and the growth rate as elements are added to it, for instance reserving additional capacity more aggressively and choosing a larger initial capacity since unused space can be freed when build() is called.

I think this is getting somewhere for me! :)

I wanted to believe some deduplication or sorting was being done incrementally like that, but it sounds like binary heaps and sorted sets weren't good examples. Is there another example of a data structure where it would happen?

As for the array capacity growth rate, that's interesting. It sounds like the advantage there isn't quite as simple as I was imagining back when I said a builder might let us "get an immutable collection from it without doing a deep copy," but it sounds like there is some other substantial advantage there.

The advantage is that less code will use mutable collections when it doesn't actually need to mutate them.

I do like this kind of encapsulation in principle. If a lot of code interacts with a builder, it'll probably be easier to reason about that code if it's encapsulated behind a builder type rather than just being represented as a mutable array. I haven't quite come up with a situation where a lot of code interacts with a builder...

...But I suppose it can be a pretty common thing in code that iterates over collections and performs side effects along the way. The builder pattern probably tidies up that code so that it's like a transducer pipeline, iterating over some set of collections and building up another set of collections along the way. Some collections participating in this technique might be ideal to build incrementally, and some might be ideal to build all at once from an array, but what they all have in common is this imperative iteration style.

When I put it that way, it seems to clash a little with the functional iteration style Racket has with its for constructs and named let loops. Should there be a kind of immutable builder for use in this kind of looping code? Would those be Rebellion's reducers?

jackfirth Jan 4, 2022
Collaborator Author Sponsor

How so? If a user doesn't want to pass around mutable collections as though they're immutable, then why does the mere existence of builders in the language stop them from doing that? Is there a reason they have to use builders? Is there a different kind of convenience they get from builders that makes up for the fact they can't skip this step anymore?

What I meant was "using builders makes it impossible to skip that step". They don't do anything if you don't use them. The docs can certainly recommend using them however, and Resyntax could suggest replacing temporary mutable collections with builders.

Should there be a kind of immutable builder for use in this kind of looping code? Would those be Rebellion's reducers?

Yup, those are reducers. Builders encapsulate the best algorithms for iteratively building collections. Stream pipelines built out of sequences, transducers, and reducers encapsulate the mutable processing performed during iteration. My vision is that most users get the benefits of the builders "for free" by just using a standard intoSet() reducer, which internally uses a set builder. The builders are there for use cases that don't quite fit the usual stream pipeline structure, or for users who are building new data structures on top of existing collections.

olopierpa · 2022-01-03T12:40:48Z

olopierpa
Jan 3, 2022

Il giorno 3 gennaio 2022, alle ore 13:08, Ross Angle

Assuming that we can use mutable collections as builders this way, what advantage do we get by adding a dedicated builder type to the mix? It sounds like @jackfirth's saying there are some collection types that have builders that are even more efficient than mutable collecitons since they don't have to allow read access until the collection is fully built, but I've had trouble coming up with an example of that.

For example, a binary heap can be built at once in O(n) time, in contrast to inserting elements one by one in a mutable data structure which requires O(n log n) time.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic Collections #201

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Generic Collections #201

jackfirth Dec 16, 2021 Collaborator Sponsor

Terminology

Background

Proposed interfaces

Further thoughts

Replies: 3 comments · 14 replies

samth Dec 17, 2021 Maintainer Sponsor

jackfirth Dec 17, 2021 Collaborator Author Sponsor

rocketnia Jan 2, 2022

sorawee Jan 4, 2022

jackfirth Jan 4, 2022 Collaborator Author Sponsor

rocketnia Jan 4, 2022

rocketnia Jan 4, 2022

jackfirth Jan 4, 2022 Collaborator Author Sponsor

olopierpa Jan 3, 2022

jackfirth
Dec 16, 2021
Collaborator Sponsor

Replies: 3 comments 14 replies

samth
Dec 17, 2021
Maintainer Sponsor

jackfirth Dec 17, 2021
Collaborator Author Sponsor

rocketnia
Jan 2, 2022

jackfirth Jan 4, 2022
Collaborator Author Sponsor

jackfirth Jan 4, 2022
Collaborator Author Sponsor

olopierpa
Jan 3, 2022