Skip to content
This repository has been archived by the owner on Feb 27, 2024. It is now read-only.

List cabal file revisions #9

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ElvishJerricco
Copy link

This is obviously a breaking change, so this would necessitate a version bump.

@@ -30,7 +31,7 @@ type HackageDB = Map PackageName PackageData

type PackageData = Map Version VersionData

data VersionData = VersionData { cabalFile :: !GenericPackageDescription
data VersionData = VersionData { cabalFileRevisions :: NonEmpty GenericPackageDescription
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing different revisions in a list means that I cannot easily find a given revision. Before this change, you had essentially a lookup PackageName -> Version -> GenericPackageDescription, which would always give you the latest available revision. If we'd like to make the code aware of revisions, then IMHO it should come in the form of a lookup:

PackageName -> Version -> Revision -> GenericPackageDescription

One possible alternative interface could be to extend our notion of Version with something that is aware of revisions. I suppose that would end up being more space efficient that to create a whole lot of singletons Maps for the majority of package versions, which have never been revised.

@ElvishJerricco
Copy link
Author

Yea. Revisions in the Hackage db are not very well defined; a revision is defined by just abusing the tar format and adding another entry with the same path, with the order of said entries being the order of the revisions. So it's very tempting to think of revisions as a list. But I think you're right that hackage-db should be the improvement on this :)

The other thing to consider is that we always know there's at least one revision (which I failed to represent in the unparsed module), and many people are always going to want the latest one. So in addition to the collection of revisions, we may want guaranteed shortcut access to the latest one. What would this look like in code?

@ElvishJerricco
Copy link
Author

Though now that I think about it, what kind of lookup could we possibly have other than an integer lookup? And isn't that ergonomically the same thing as a list, given !!? Plus NonEmpty has a well defined last, which can be the convenient way to get the latest revision.

@peti
Copy link
Member

peti commented May 29, 2018

I don't know. When I think of a container that's good for indexed access, then I don't think of a linked list. I think of an array. An array can access every element in constant time, but a list needs O(n). Also, the type "list" does not capture the fact well that this container is ordered.

@ElvishJerricco
Copy link
Author

Switching to an array would be good, though I'm not sure if there's anything like NonEmptyVector. What kind of data structure would imply the ordering you expect?

@ElvishJerricco
Copy link
Author

Also, what array type should I use? It appears my use of NonEmpty is the reason this failed travis, because it wasn't there in older GHCs.

@peti
Copy link
Member

peti commented Jun 2, 2018

I'm not sure if there's anything like NonEmptyVector.

I am not aware of a non-empty vector type either.

What kind of data structure would imply the ordering you expect?

In an array, it's obvious that revision i will be store in the ith location. In a linked list, that's not 100% obvious, IMHO, because one might make a case that the most frequently needed revision -- the latest one -- should be first so that it's quick to access. Also, a reversed list is more efficient to construct. With an array, you can access all revisions in O(1) time anyway, so "ordering" is a non-issue there, IMHO.

Personally, I'd go with whatever is the simplest choice from the array package.

@domenkozar
Copy link
Member

domenkozar commented Jul 29, 2018

Doesn't a package always have r0? That means you can just maybeTail and error the package name if it happens.

@Ericson2314
Copy link
Member

Ericson2314 commented Oct 18, 2018

Is the tar tricked used for new versions, or just new revisions? If the later, it can't be more than like 5-ish, so Array is quite a premature optimization. I would much prefer NonEmpty for correctness.

@ElvishJerricco
Copy link
Author

@Ericson2314 only revisions.

@Ericson2314
Copy link
Member

OK then my vote is reverse ordered NonEmpty. And we can fix those ancient GHCs by importing semigroup, though I'd just drop them from CI.

@qrilka
Copy link

qrilka commented Dec 4, 2018

@peti @ElvishJerricco what's the state of this PR? We're looking into adding proper support of custom Hackage revisions into stack2nix and this appears to be a step in that direction

@qrilka
Copy link

qrilka commented Dec 4, 2018

Actually it appears that http://hackage.haskell.org/package/cabal2nix-2.12/docs/Distribution-Nixpkgs-Haskell-Hackage.html duplicates hackage-db a bit so revisions could be added there too, I see that cabal2nix also includes hashes but couldn't those be added to hackage-db so this duplication could be removed?

@ElvishJerricco
Copy link
Author

@qrilka The hashes contained there are the same as what's implemented in hackage-db without this PR. The problem is that they only tell you the hash of one revision, not all revisions.

@qrilka
Copy link

qrilka commented Dec 4, 2018

@ElvishJerricco I understand that, it was just a side question why Distribution.Hackage.DB.Parsed doesn't include those hashes already so parsing could be done in one pass without doing extra lookups in unparsed HackageDB.
Actually I'm doing a branch of cabal2nix taking use of this PR as it is, any idea about its current fate?

@qrilka
Copy link

qrilka commented Dec 4, 2018

Simple compatibility is quite trivial - qrilka/cabal2nix@5b5f25f and my larger task is to allow something like cabal://acme-missiles-0.3@sha256:2ba66a092a32593880a87fb00f3213762d7bca65a687d45965778deb8694c5d1 and cabal://acme-missiles-0.3@rev:0

@peti
Copy link
Member

peti commented Dec 4, 2018

Why doesn't include Distribution.Hackage.DB.Parsed those hashes already?

cabal2nix needs the hash of the revised Cabal file (if such an revision exists). That information is not included in the tarball released by Hackage. Therefore, hackage-db does not offer that information. So cabal2nix computes that hash itself and adds it into its own variant of the HackageDB type.

@ElvishJerricco
Copy link
Author

@peti Which information is not included in the tarball? The cabal file hashes certainly are included.

@peti
Copy link
Member

peti commented Dec 4, 2018

The cabal file hashes certainly are included.

They are?

$ json_pp </tmp/extract-cabal-install-tarball/zlib/0.6.1.1/package.json
{
   "signatures" : [],
   "signed" : {
      "_type" : "Targets",
      "version" : 0,
      "expires" : null,
      "targets" : {
         "<repo>/package/zlib-0.6.1.1.tar.gz" : {
            "hashes" : {
               "md5" : "d841c86562542a55dd086df69764ee17",
               "sha256" : "c5f5b4285473657a7997d74f7642f3e7bda40f92c3c5d49471a899e27a4ba735"
            },
            "length" : 142902
         }
      }
   }
}

Where can I find that hash?

@ElvishJerricco
Copy link
Author

@peti Oh my bad. Yea, it's not recorded conveniently. But the cabal files themselves are present, and you can simply hash the file yourself.

@peti
Copy link
Member

peti commented Dec 4, 2018

But the cabal files themselves are present, and you can simply hash the file yourself.

No kidding. That's exactly what cabal2nix is doing.

@ElvishJerricco
Copy link
Author

@peti Oh, so you're suggesting that hackage-db should leave it to the users to compute the hashes, just like the tarball?

@peti
Copy link
Member

peti commented Dec 4, 2018

You're suggesting that hackage-db should leave it to the users to compute the hashes, just like the tarball?

Yes. That is what the current implementation does. We have the split between parsed and unparsed representations precisely so that users can implement their own types that capture exactly that amout of data that they need and nothing else.

@qrilka
Copy link

qrilka commented Dec 11, 2018

Sorry for raising this again but what's the state of this PR? Could I help getting it merged?

@qrilka
Copy link

qrilka commented Dec 11, 2018

@ElvishJerricco that PR (if you mean NixOS/cabal2nix#393 ) depends on this one and it's not a problem to update it - it shouldn't be a blocker
As for the array type, I'm inclined to agree with @Ericson2314 that array here (when there should be very few revision per version) is a bit premature optimization.

@ElvishJerricco
Copy link
Author

@qrilka I don't think the array type choice is about performance. I think it's more about providing a satisfying API that makes it clear what we're representing and how to get the Cabal file you want. For instance it needs to be immediately obvious how to get the latest revision, but a list doesn't tell you which order it's in

@peti
Copy link
Member

peti commented Dec 14, 2018

Sorry for raising this again but what's the state of this PR? Could I help getting it merged?

I am not happy about the API that this PR implements and I am not going to merge it in its current state. I don't like the use of a list to store the different revisions; I'd prefer an array or some other container that has an intuitive Int -> GenericPackageDescription lookup, preferably with O(1) complexity when accessing the latest revision, which is the one most people will want.

Secondly, I am not convinced that the normal Hackage DB API should expose revisions at all. The way I understand revisions, there should be no reason to prefer a Cabal file in revision n if n+1 exists, i.e. you generally want the latest one. I believe cabal-install doesn't support choosing revisions either. There's only the choice between --pristine or the latest version, but nothing in between.

Carrying all revisions of all Cabal files in the HackageDB increases its size quite a bit and I'm not sure what is gained by adding all that infomation.

Now, I realize that some tools may need access to revisions. I would argue, though, that these tools would be better off implementing whatever it is they need, exactly, on top of the "unparsed" API, which exists precisely so that such a thing is possible (and cabal2nix does it that way, too).

@ElvishJerricco
Copy link
Author

The hackage db tarball explicitly exposes all revisions. If we mean to reflect the tarball exactly, we must expose them as well. It is cabal-install who ignores earlier revisions. Other tools need earlier revisions, and are based on the hackage tarball.

I don't really care one way or the other about the array type (I just need a concrete suggestion), but suggesting that we shouldn't even expose the revisions is definitely wrong.

@peti
Copy link
Member

peti commented Dec 14, 2018 via email

@ElvishJerricco
Copy link
Author

The directory structure of the tarball when extracted is decidedly not the schema of the tarball. The tar file format is he schema, and hackage keeps the revisions recorded there explicitly and intentionally. I think this should absolutely be respected, because anything which consumes the tarball should be capable of consuming this library instead. Does that sound like a reasonable goal?

@peti
Copy link
Member

peti commented Dec 14, 2018 via email

@Ericson2314
Copy link
Member

commercialhaskell/stack@cbf7986 this commit shows two things.

  1. The stack world tries to pin an exact revision.

  2. There is in fact extant two tarball versions, indicating a schema change. The schemas differ in including revisions (and perhaps other details).

@ElvishJerricco
Copy link
Author

@peti yea, the 01-index.tar version of the index came about to add revisions, leaving the old tar schema untouched in 00-index.tar.

Would an API like this be acceptable?

data VersionData = VersionData { cabalFile         :: ByteString
                               , metaFile          :: ByteString
                               , previousRevisions :: Vector ByteString -- ^ Oldest is first; excludes latest.
                               }

This probably requires no changes from consumers of this library, provides safe access to the latest cabalFile, and constant time access to all revisions that came before it.

@ElvishJerricco
Copy link
Author

Alright, I've got two different approaches to this. There's a minor difficulty in the code that reads the tar file; if it encounters a package.json first, it just sets cabalFile to the empty bytestring and vice versa. So you have to be able to detect whether that cabal file is real or a dummy.

I've got two approaches I've just pushed.

  1. master...ElvishJerricco:cabal-file-revisions-validation-inplace

    Just ignores the file if it is the empty bytestring.

  2. master...ElvishJerricco:cabal-file-revisions-validation-pass

    Creates an intermediate type that allows Nothings to represent unfound files. It is validated and converted to the regular Unparsed type.

The former is probably the better one to merge, just because it's a lot simpler and probably more performant. But the latter is more technically satisfying to me, since it seems a little more "correct", so to speak.

@peti
Copy link
Member

peti commented Dec 21, 2018

Would an API like this be acceptable?

data VersionData = VersionData { cabalFile         :: ByteString
                               , metaFile          :: ByteString
                               , previousRevisions :: Vector ByteString -- ^ Oldest is first; excludes latest.
                               }

Unfortunately, that API is somewhat wasteful for users who don't care about revisions (like most of my tools, which always use the latest revision).

The more I think about this issue, the more it becomes apparent that there can never be a single HackageDB type that is good for all purposes. I have identified the following conflicting use-cases so far:

  • Some tools are happy processing the contents of Hackages as a stream of entries. These tools can run in constant space no matter how large the tarball is. Other tools need access to Hackage in a way that cannot be streamed. For example, if you want to honor "preferred-versions" in some way, then you have to parse the entire tarball before you know the proper contents of those files, so your space requirements grow linearly with the size of the tarball.

  • Some tools need all of Hackage, maybe because they want to render its contents to HTML or because they want to compute statistics. Other tools need only a subset of Hackage, e.g. all available information about one particular package. Or maybe they need only the latest version of every package, etc...

    Now, a data structure that makes the first kind of tool happy is wasteful for the second kind. One could argue that constructing a Map of Hackage which is then filtered to contain only those entries that actually matter is fine in a lazily evaluated language, but my experience is that this is not true in practice, though, because the compiler cannot possibly optimize the code to the extend that would be possible theoretically, and so you'll most probably end up having crazy memory requirements unless you litter the code with explicit INLINE statements and all kinds of other pragmas.

  • Some tools care about revisions and need them explicitly represented in the HackageDB, others don't. If you store all revisions, then a lot of space is wasted for those tools that don't need them. If you don't store them, then the data structure is useless for those that do need them. In some occasions tools may need support for selecting one particular revision, but they may not need access to all of them. In that case, a revision-less data structure will suffice, but you'll need a good way to choose efficiently which revision you're interested in.

  • Some tools need access to the additional meta information found in package.json files these days. Other tools don't care about those at all.

  • Some tools want to honor "preferred-versions", i.e. deprecated versions of a package shouldn't show up in HackageDB, even. Such a tool must necessarily parse the entire tarball before it can do anything. Other tools don't care for "preferred-versions" and don't need it represented implicitly or explicitly in the data structure.

  • Some tools need access to the Cabal file in ByteString form, e.g. because they want to compute a hash of that file (like cabal2nix does). Other tools don't care and would rather see the parsed GenericPackageDescription type -- or the finalized PackageDescription type, even.

Now, it seems to me that the only way to cover those use cases is to make it very easy to define your own HackageDB type and parse the tarball into that format as efficiently as possible. The t/new-api branch contains to experimental versions of the code that's currently found in Unparsed.hs. Both modules implement parsing and traversal of the tarball, but defer to user code to actually handle the contents of the data files as they are encountered. https://github.com/peti/hackage-db/blob/t/new-api/new-callback-api.hs does so in a straight-forward callback function style. https://github.com/peti/hackage-db/blob/t/new-api/new-class-api.hs implements virtually the same approach but using statically typed type-classes.

I've not yet decided which one is nicer. The callback function Builder can be more general, it appears, because I can easily define an instance that works for all MonadState HackageDB m. It seems like I cannot define the same class instance without relying on UndecidableInstances or type families, but I might be wrong.

peti added a commit that referenced this pull request Dec 21, 2018
See #9 (comment) for
details on the motivation behind new-callback-api.hs and new-class-api.hs.

Dropped support for older compilers. Building with GHC versions prior to 7.10.x
is too much effort.

Added parseIso8601 function to Distribution.Hackage.DB.Utility.
peti added a commit that referenced this pull request Dec 21, 2018
See #9 (comment) for
details on the motivation behind new-callback-api.hs and new-class-api.hs.

Dropped support for older compilers. Building with GHC versions prior to 7.10.x
is too much effort.

Added parseIso8601 function to Distribution.Hackage.DB.Utility.
@ElvishJerricco
Copy link
Author

What about a stream style API, like the tar library?

data HackageEntry
  = PreferredVersions BSL.ByteString
  | CabalFile Version BSL.ByteString
  | MetaFile Version BSL.ByteString

data HackageEntries
  = Done
  | Next PackageName EpochTime HackageEntry HackageEntries
  | Fail SomeException

parseTarball :: BSL.ByteString -> HackageEntries

This gives the same basic concept as your branches, but a little simpler. Also makes it easy to make a Conduit BSL.ByteString HackageEntry m ()

That said, I think the scope you're trying to achieve here might be much greater than it needs to be. I don't think this library needs to be the optimal algorithm for every possible use case. In my opinion, the purpose of this library is to simply provide an in-memory representation of the tarball's schema. Just because cabal2nix doesn't need revisions doesn't mean hackage-db shouldn't have them. IMO this library should just faithfully represent the tarball with a data structure, and doing that means representing the revisions. Anyone who doesn't need them can just ignore them.

@peti
Copy link
Member

peti commented Dec 22, 2018 via email

@ElvishJerricco
Copy link
Author

@peti To me, the stream style API is just a lot simpler to work with. For instance, if the user wants to terminate early, they can just return from their function. Whereas with your Builder implementation, they have to wrap and unwrap all their code with MaybeT. More complex goals are likely going to require creating new Applicatives altogether, which is burdensome. I just don't find the use of Applicatives to be a particularly ergonomic way to deal with this. Seems a lot easier to just case on an Entries type.

We can easily define convenient APIs on top of 'parseTarball' that provide multiple, efficient views into the data

But is that API an actually useful library? At that point, we're basically talking about a thin wrapper around the tar library that just tags each entry with the metadata about what it is in the Hackage DB and fails on invalid Hackage tarballs. That's kind of useful, but the stateful process of building the large Map representing the whole tarball is the main reason I find this library useful to begin with, not simply reading the tarball.

@peti
Copy link
Member

peti commented Dec 22, 2018

The stream style API is just a lot simpler to work with. For instance, if the user wants to terminate early, they can just return from their function. Whereas with your Builder implementation, they have to wrap and unwrap all their code with MaybeT.

I don't expect that many users will ever use the Builder API. It's low-level code intended to create a custom HackageDB representation very efficiently. That HackageDB is what most users will use. The point is, though, that users who have specialized requirements -- like cabal2nix does -- can use Builder to fulfill them if they like to. Most people won't, I suppose, but I will.

CPS is more complicated than processing a sequence of data types, no doubt. The advantage is, though, that it gets the job done without creating any intermediate data structures, which is a nightmare memory-wise given the size of the Hackage tarball.

We can easily define convenient APIs on top of 'parseTarball' that provide multiple, efficient views into the data

But is that API an actually useful library?

As I said, the point of hackage-db is not Builder. It's those multiple efficient views into the data that are implemented on top of Builder.

@ElvishJerricco
Copy link
Author

My understanding is that the space overhead of the intermediate data structure is O(1), not O(size of db). And that constant should be very tiny; the size of a single constructor. I guess you could argue it pressures the GC a little and might increase pause frequency, but one constructor allocation per entry is surely negligible compared to the allocation the user code is likely to be doing for each entry.


Regardless, none of this addresses my criticism that the view this library provides should be representative of the hackage tarball schema, which does include revisions. Whatever HackageDB type we export should include them IMO.

@Ericson2314
Copy link
Member

Ericson2314 commented Dec 23, 2018

Yeah until we actually run into empirical performance problems, this entire discussion seems wildly premature. Let's just merge something simple that supports both use-cases. If performance issues arise in practice, I'm sure the rest of us would be happy finish the streaming API, or whatever else is needed.

@peti
Copy link
Member

peti commented Dec 23, 2018

@ElvishJerricco,

My understanding is that the space overhead of the intermediate data structure is O(1), not O(size of db).

Yes, space overhead is probably O(1) if the code is structured correctly and doesn't fall into any lazy evaluation traps. Time overhead is O(n) though.

Anyhow, I like the CPS Builder API, and since it's supposed to be an implementation detail anyway, I see no strong argument against it.

None of this addresses my criticism that the view this library provides should be representative of the hackage tarball schema, which does include revisions. Whatever HackageDB type we export should include them IMO.

I see no reason why hackage-db has to have one view. I think it's perfectly fine to have more than one. It already does have more than one, i.e. Parsed and Unparsed. I don't understand why that would be a problem.

@Ericson2314,

Yeah until we actually run into empirical performance problems, this entire discussion seems wildly premature.

I have run into performance problems. Keep in mind that I have been using this code for production purposes for a long time.

@ElvishJerricco
Copy link
Author

Yes, space overhead is probably O(1) if the code is structured correctly and doesn't fall into any lazy evaluation traps. Time overhead is O(n) though.

It's rather difficult to create problems with the streaming api due to laziness in my opinion. And the O(n) time overhead is multiplied by an extremely small constant. User code is certain to make this negligible.

Anyway, the discussion of the implementation detail seems, at this point, off topic. That change can be made separately, before or after this PR.

As for this PR, @peti are you suggesting that you would be ok with having another doubling of modules, half with revisions and half without?

@ElvishJerricco
Copy link
Author

It'd also be good to know what your actual performance constraints are, so we can tell what is and isn't negligible. But I guess thats also off topic.

@peti
Copy link
Member

peti commented Dec 24, 2018 via email

@ElvishJerricco
Copy link
Author

Yes, that's exactly what I'm suggesting.

Cool, I can get behind that.

As for why my API idea is better, as I said, I just think it's far simpler and easier to use, with effectively no drawbacks.

@Ericson2314
Copy link
Member

@peti and second, to been if there are empirical performance problems, they could well be unaffected by all of this.

That said, if you want to double the interface for other reasons I agree that's fine. Do your have any PR for that work—something Will and I can contribute to? It would be a shame for this more ambitious design to also keep Will blocked on it landing for far longer.

@qrilka
Copy link

qrilka commented Feb 15, 2019

@peti are there any plans to resolve this issue? In December you said that you have some API that you like and at the same time I can't find any published code like that. Is there some way to help revisions appear in hackage-db and in cabal2nix in the end?

@peti
Copy link
Member

peti commented Feb 15, 2019 via email

@qrilka
Copy link

qrilka commented Feb 15, 2019

Oh, that's great to hear, I'll try to create something.

sternenseemann pushed a commit to NixOS/cabal2nix that referenced this pull request Jul 21, 2022
See NixOS/hackage-db#9 (comment) for
details on the motivation behind new-callback-api.hs and new-class-api.hs.

Dropped support for older compilers. Building with GHC versions prior to 7.10.x
is too much effort.

Added parseIso8601 function to Distribution.Hackage.DB.Utility.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants