Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Default impls that load data #5554

Open
sffc opened this issue Sep 18, 2024 · 11 comments
Open

Remove Default impls that load data #5554

sffc opened this issue Sep 18, 2024 · 11 comments
Labels
needs-approval One or more stakeholders need to approve proposal

Comments

@sffc
Copy link
Member

sffc commented Sep 18, 2024

Currently we have a number of impl Default on classes that have a new() function that load data. The data is usually singleton but doesn't need to be (see SentenceSegmenter for instance).

This is bad because:

  1. The contract of impl Default is to fill in default values for fields when there are reasonable default values. But these are opaques, not bag-of-field structs, and impl Default is just not in the right spirit.
  2. The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.
  3. impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.
  4. It's also very confusing and misleading to allow anyone to write code such as that in TimeZone + ResolvedTimeZone #5549, where an unnamed positional argument that isn't an options bag is elided with &Default::default().

We never had these impls until we added compiled_data constructors, and then we added them only because Clippy was telling us to. I don't think we ever really had a thorough discussion about it.

I hope this is a no-brainer and non-controversial.

@Manishearth @robertbastian @zbraniecki @hsivonen

@sffc sffc added the needs-approval One or more stakeholders need to approve proposal label Sep 18, 2024
@sffc sffc added this to the ICU4X 2.0 ⟨P1⟩ milestone Sep 18, 2024
@robertbastian
Copy link
Member

robertbastian commented Sep 18, 2024

  1. The contract of impl Default is to fill in default values for fields when there are reasonable default values. But these are opaques, not bag-of-field structs, and impl Default is just not in the right spirit.

That's not what that says. It says

A trait for giving a type a useful default value.

Sometimes, you want to fall back to some kind of default value, and don’t particularly care what it is. This comes up often with structs that define a set of options

"This comes up often with structs that define a set of options", but is in no way limited to that. Compiled data is reasonable default data, that's why we're publishing it.

  1. The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.

So? Some APIs are gated, and that includes impls.

  1. impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.

It only exists when new has no arguments. In that case we're in the singleton path and there is no work.

  1. It's also very confusing and misleading to allow anyone to write code such as that in TimeZone + ResolvedTimeZone #5549, where an unnamed positional argument that isn't an options bag is elided with &Default::default().

Call sites can be written using Foo::default() instead. This applies the same way to all types implementing Default: you can write &mut Default::default() when you need to pass a &mut Vec<T>, that's not a reason for Vec not to implement Default.

@robertbastian
Copy link
Member

The issue title misrepresents what's happening. There is never any "data loading" in a Default impl. Even in SentenceSegmenter which you use as an example, there is not:

impl Default for SentenceSegmenter {
    fn default() -> Self {
        Self::new()
    }
}

impl SentenceSegmenter {
    pub fn new() -> Self {
        Self {
            payload: DataPayload::from_static_ref(
                crate::provider::Baked::SINGLETON_SENTENCE_BREAK_DATA_V2_MARKER,
            ),
            payload_locale_override: None,
        }
    }
}

@Manishearth
Copy link
Member

I don't think I agree with any of these points.

  • The contract of impl Default is to fill in default values for fields when there are reasonable default values. But these are opaques, not bag-of-field structs, and impl Default is just not in the right spirit.

I don't agree with this. I've seen Default for all kinds of types, and the docs seem to agree with that idea, they just give "bag-of-stuff" structs as a prominent example.

  • The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.

Seems fine. A lot of things are gated on compiled_data. I don't see Default as special. If people are using the Default impl and it gets disabled, in a different world they'd be calling ::new() and having it be disabled.

  • impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.

There's no expectation of Default to be cheap.

  • It's also very confusing and misleading to allow anyone to write code such as that in Concept: TimeZone + FormattableTimeZone #5549, where an unnamed positional argument that isn't an options bag is elided with &Default::default().

This is a rather subjective style choice and i don't see a need to force it on our clients. They can always call Normalizer::default() if they wish to be explicit (as can we). I'm generally not a fan of blocking off ways of writing code for clients because of our own style preferences.

@sffc
Copy link
Member Author

sffc commented Sep 18, 2024

Ok so it's not noncontroversial

I'm reasonably convinced by the counter-arguments on points 1. On point 4, I'm mainly concerned when it is used for passing these things as unnamed positional arguments, but we can avoid that by not adding such APIs. I'm not convinced by the counter-arguments on points 2 and 3, and there are more I'd like to add:

  1. The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.

I'll add to this: the fact that it is gated on compiled_data means that it is not a universal fixed default! Like, I don't want people's Default::default() to break when they enable --no-default-features. Default is sometimes used in bounds for other functions, too. Programmers can write code in a style that makes it difficult for them to break out of using the Default impl for a particular type.

  1. impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.

I used SentenceSegmenter as an example because it has two fields, one singleton and one non-singleton. Currently the non-singleton has no data in und, but if it did, it would need to be assembled. I recall seeing some code landing in maybe Collator or Normalizer that loaded und data from a key with non-singleton data in a const constructor.

To respond to @Manishearth's "There's no expectation of Default to be cheap": my expectation is that it is cheap. For example, it is used in bounds on things like constructing an array and other operations that where the impl is called a large number of times. Is your argument an "in my experience" argument or "this is a documented convention" argument?

New points:

  1. These Default impls are not stable and are not in fact universal default values. They depend on data version, producing different behavior under cargo update. The fact that they only exist under compiled_data could be used as evidence that we believe that these values are not universal.
  2. Not all types have a pub fn new(). There's nothing "special" separating types that support pub fn new() and pub fn new(&DataLocale). It is inconsistent to have these impls on some types but not others. We could add impl Default that loads data for "und" on all other types to add consistency, but that drives home the point that these should not be considered universally default values.
  3. We never approved as a group of adding the Default impls; we only approved adding pub const fn new() functions. So I would really like to take this discussion from the angle of "we don't have these impls; are they motivated to add?" Other than the clippy lint, what is the motivation?

@sffc
Copy link
Member Author

sffc commented Sep 18, 2024

On the point 7: we have generally been conservative with adding impls to our types, and I think it has served us well. We have a very large API surface, and every impl adds cost. The only impl we agreed to universally add has been impl Debug. We add other impls such as PartialEq, Ord, Hash, Clone, ... only when independently motivated. That yardstick of "independently motivated" should apply here, too.

@Manishearth
Copy link
Member

I'll add to this: the fact that it is gated on compiled_data means that it is not a universal fixed default! Like, I don't want people's Default::default() to break when they enable --no-default-features. Default is sometimes used in bounds for other functions, too. Programmers can write code in a style that makes it difficult for them to break out of using the Default impl for a particular type.

Right, but if they're doing that they presumably need some default, without Default they'd be calling ::new().

"There's no expectation of Default to be cheap": my expectation is that it is cheap.

Okay, you should not hold on to that expectation because it is not one followed by the ecosystem.

My argument is one based on the ecosystem, and the fact that it is not documented as being cheap. I would consider this something that needs to be explicitly documented were it an expectation.

Worth noting that our Default impls still are pretty cheap, baked data loading is extremely fast. They're just not necessarily entirely devoid of runtime behavior.

5. These Default impls are not stable and are not in fact universal default values. They depend on data version, producing different behavior under cargo update. The fact that they only exist under compiled_data could be used as evidence that we believe that these values are not universal.

I don't consider this to be a requirement of the Default trait either.

  • Not all types have a pub fn new(). There's nothing "special" separating types that support pub fn new() and pub fn new(&DataLocale). It is inconsistent to have these impls on some types but not others. We could add impl Default that loads data for "und" on all other types to add consistency, but that drives home the point that these should not be considered universally default values.

I'm open to having some types have pub fn new_und() / pub fn new(&DataLocale) or something and not having Default on them. I think this argument works for those types but not for the general case of singletons.

7. We never approved as a group of adding the Default impls; we only approved adding pub const fn new() functions. So I would really like to take this discussion from the angle of "we don't have these impls; are they motivated to add?" Other than the clippy lint, what is the motivation?

I can provide the motivation for the clippy lint: Default impls enable a whole bunch of convenient functions like mem::take and stuff. It's annoying when you're working with a type that could have a Default but chooses not to.

I definitely do not think this is a case in which the clippy lint is wrong. My position is that deviating from clippy here is something that needs justification, not the new impl.

@sffc
Copy link
Member Author

sffc commented Sep 19, 2024

I can provide the motivation for the clippy lint: Default impls enable a whole bunch of convenient functions like mem::take and stuff. It's annoying when you're working with a type that could have a Default but chooses not to.

Example developer journey: they are using postcard data. They apply mem::take to a field involving a property value or some other type implementing Default. Everything compiles. Then two months later they realize that they should disable the compiled_data feature. But doing so causes mem::take to fail to compile. They don't fully understand why so they leave it alone. In the end, they are linking both compiled data and postcard data, all implicitly.

@Manishearth
Copy link
Member

Then two months later they realize that they should disable the compiled_data feature. But doing so causes mem::take to fail to compile.

This ... seems like exactly what we want to happen? They now are forced to consider where their data should come from. That's good. That should exactly be a factor of the decision to disable the feature.

I don't really think "they don't fully understand so they leave it alone" is that likely. It doesn't seem that confusing. We could add a "enabled by feature foo" tag on the impl if we really want to improve this experience.

@sffc
Copy link
Member Author

sffc commented Sep 19, 2024

This ... seems like exactly what we want to happen? They now are forced to consider where their data should come from. That's good. That should exactly be a factor of the decision to disable the feature.

My point is that they should be forced to consider where their data comes from up front by using an API explicitly documented as "with compiled data", which was key to us agreeing to have the compiled data functions have short, convenient names.

@sffc
Copy link
Member Author

sffc commented Sep 19, 2024

We could add a "enabled by feature foo" tag on the impl if we really want to improve this experience.

The problem with docs on trait impls is that they are not discoverable. Often, Default is used implicitly, as you noted with mem::take, so these docs won't get read often, and by the time they do, the programmer has already accidentally linked the compiled data implicitly.

@Manishearth
Copy link
Member

The problem with docs on trait impls is that they are not discoverable

I mean, if you get a "missing Default impl" when you play with features I don't really see why it would be hard to figure out what happened. The docs on the trait impl are just an added bonus.

I'm not convinced of the sketched developer story being particularly onerous.

the programmer has already accidentally linked the compiled data implicitly.

I think that's fine? For singletons if someone is using a type they almost certainly need the data and 99% of users will get the data from the compiled data; the main benefit of loading data from a file for singletons is if it's some large program that wishes to lazily load everything. I can see a potential additional benefit of wishing to load the latest data.

I want most of our singleton users to be able to just derive(Default) where necessary.

People who care about data to that granularity are going to be caeful about bounds anyway.

Just out of curiosity, this is the current list of gated Default impls:

[15:10:32] मanishearth@manishearth-glaptop2 ~/dev/icu4x ^_^ 
$ rg -U "compiled_data.*\nimpl Default for"
components/normalizer/src/uts46.rs
38:#[cfg(feature = "compiled_data")]
39:impl Default for Uts46MapperBorrowed<'static> {
141:#[cfg(feature = "compiled_data")]
142:impl Default for Uts46Mapper {

components/locale/src/directionality.rs
40:#[cfg(feature = "compiled_data")]
41:impl Default for LocaleDirectionality {

components/locale/src/expander.rs
212:#[cfg(feature = "compiled_data")]
213:impl Default for LocaleExpander {

components/normalizer/src/properties.rs
51:#[cfg(feature = "compiled_data")]
52:impl Default for CanonicalCompositionBorrowed<'static> {
122:#[cfg(feature = "compiled_data")]
123:impl Default for CanonicalComposition {
196:#[cfg(feature = "compiled_data")]
197:impl Default for CanonicalDecompositionBorrowed<'static> {
463:#[cfg(feature = "compiled_data")]
464:impl Default for CanonicalDecomposition {
553:#[cfg(feature = "compiled_data")]
554:impl Default for CanonicalCombiningClassMapBorrowed<'static> {
644:#[cfg(feature = "compiled_data")]
645:impl Default for CanonicalCombiningClassMap {

components/segmenter/src/grapheme.rs
134:#[cfg(feature = "compiled_data")]
135:impl Default for GraphemeClusterSegmenter {

components/segmenter/src/sentence.rs
114:#[cfg(feature = "compiled_data")]
115:impl Default for SentenceSegmenter {

components/locale/src/canonicalizer.rs
199:#[cfg(feature = "compiled_data")]
200:impl Default for LocaleCanonicalizer {

components/casemap/src/titlecase.rs
207:#[cfg(feature = "compiled_data")]
208:impl Default for TitlecaseMapper<CaseMapper> {

components/casemap/src/closer.rs
43:#[cfg(feature = "compiled_data")]
44:impl Default for CaseMapCloser<CaseMapper> {

components/casemap/src/casemapper.rs
41:#[cfg(feature = "compiled_data")]
42:impl Default for CaseMapper {

components/timezone/src/windows_tz.rs
41:#[cfg(feature = "compiled_data")]
42:impl Default for WindowsTimeZoneMapper {

components/timezone/src/zone_offset.rs
19:#[cfg(feature = "compiled_data")]
20:impl Default for ZoneOffsetCalculator {

components/timezone/src/ids.rs
96:#[cfg(feature = "compiled_data")]
97:impl Default for TimeZoneIdMapper {
468:#[cfg(feature = "compiled_data")]
469:impl Default for TimeZoneIdMapperWithFastCanonicalization<TimeZoneIdMapper> {

components/timezone/src/metazone.rs
21:#[cfg(feature = "compiled_data")]
22:impl Default for MetazoneCalculator {
[15:10:45] मanishearth@manishearth-glaptop2 ~/dev/icu4x ^_^ 
$ rg -U "compiled_data.*\nimpl Default for"
components/normalizer/src/properties.rs
51:#[cfg(feature = "compiled_data")]
52:impl Default for CanonicalCompositionBorrowed<'static> {
122:#[cfg(feature = "compiled_data")]
123:impl Default for CanonicalComposition {
196:#[cfg(feature = "compiled_data")]
197:impl Default for CanonicalDecompositionBorrowed<'static> {
463:#[cfg(feature = "compiled_data")]
464:impl Default for CanonicalDecomposition {
553:#[cfg(feature = "compiled_data")]
554:impl Default for CanonicalCombiningClassMapBorrowed<'static> {
644:#[cfg(feature = "compiled_data")]
645:impl Default for CanonicalCombiningClassMap {

components/normalizer/src/uts46.rs
38:#[cfg(feature = "compiled_data")]
39:impl Default for Uts46MapperBorrowed<'static> {
141:#[cfg(feature = "compiled_data")]
142:impl Default for Uts46Mapper {

components/segmenter/src/grapheme.rs
134:#[cfg(feature = "compiled_data")]
135:impl Default for GraphemeClusterSegmenter {

components/segmenter/src/sentence.rs
114:#[cfg(feature = "compiled_data")]
115:impl Default for SentenceSegmenter {

components/locale/src/directionality.rs
40:#[cfg(feature = "compiled_data")]
41:impl Default for LocaleDirectionality {

components/locale/src/expander.rs
212:#[cfg(feature = "compiled_data")]
213:impl Default for LocaleExpander {

components/locale/src/canonicalizer.rs
199:#[cfg(feature = "compiled_data")]
200:impl Default for LocaleCanonicalizer {

components/casemap/src/titlecase.rs
207:#[cfg(feature = "compiled_data")]
208:impl Default for TitlecaseMapper<CaseMapper> {

components/casemap/src/casemapper.rs
41:#[cfg(feature = "compiled_data")]
42:impl Default for CaseMapper {

components/casemap/src/closer.rs
43:#[cfg(feature = "compiled_data")]
44:impl Default for CaseMapCloser<CaseMapper> {

components/timezone/src/ids.rs
96:#[cfg(feature = "compiled_data")]
97:impl Default for TimeZoneIdMapper {
468:#[cfg(feature = "compiled_data")]
469:impl Default for TimeZoneIdMapperWithFastCanonicalization<TimeZoneIdMapper> {

components/timezone/src/windows_tz.rs
41:#[cfg(feature = "compiled_data")]
42:impl Default for WindowsTimeZoneMapper {

components/timezone/src/metazone.rs
21:#[cfg(feature = "compiled_data")]
22:impl Default for MetazoneCalculator {

components/timezone/src/zone_offset.rs
19:#[cfg(feature = "compiled_data")]
20:impl Default for ZoneOffsetCalculator {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-approval One or more stakeholders need to approve proposal
Projects
None yet
Development

No branches or pull requests

3 participants