-
-
Notifications
You must be signed in to change notification settings - Fork 18
List cabal file revisions #9
base: master
Are you sure you want to change the base?
Conversation
@@ -30,7 +31,7 @@ type HackageDB = Map PackageName PackageData | |||
|
|||
type PackageData = Map Version VersionData | |||
|
|||
data VersionData = VersionData { cabalFile :: !GenericPackageDescription | |||
data VersionData = VersionData { cabalFileRevisions :: NonEmpty GenericPackageDescription |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing different revisions in a list means that I cannot easily find a given revision. Before this change, you had essentially a lookup PackageName -> Version -> GenericPackageDescription
, which would always give you the latest available revision. If we'd like to make the code aware of revisions, then IMHO it should come in the form of a lookup:
PackageName -> Version -> Revision -> GenericPackageDescription
One possible alternative interface could be to extend our notion of Version
with something that is aware of revisions. I suppose that would end up being more space efficient that to create a whole lot of singletons Map
s for the majority of package versions, which have never been revised.
Yea. Revisions in the Hackage db are not very well defined; a revision is defined by just abusing the tar format and adding another entry with the same path, with the order of said entries being the order of the revisions. So it's very tempting to think of revisions as a list. But I think you're right that hackage-db should be the improvement on this :) The other thing to consider is that we always know there's at least one revision (which I failed to represent in the unparsed module), and many people are always going to want the latest one. So in addition to the collection of revisions, we may want guaranteed shortcut access to the latest one. What would this look like in code? |
Though now that I think about it, what kind of lookup could we possibly have other than an integer lookup? And isn't that ergonomically the same thing as a list, given |
I don't know. When I think of a container that's good for indexed access, then I don't think of a linked list. I think of an array. An array can access every element in constant time, but a list needs O(n). Also, the type "list" does not capture the fact well that this container is ordered. |
Switching to an array would be good, though I'm not sure if there's anything like NonEmptyVector. What kind of data structure would imply the ordering you expect? |
Also, what array type should I use? It appears my use of NonEmpty is the reason this failed travis, because it wasn't there in older GHCs. |
I am not aware of a non-empty vector type either.
In an array, it's obvious that revision i will be store in the ith location. In a linked list, that's not 100% obvious, IMHO, because one might make a case that the most frequently needed revision -- the latest one -- should be first so that it's quick to access. Also, a reversed list is more efficient to construct. With an array, you can access all revisions in O(1) time anyway, so "ordering" is a non-issue there, IMHO. Personally, I'd go with whatever is the simplest choice from the |
Doesn't a package always have r0? That means you can just maybeTail and error the package name if it happens. |
Is the tar tricked used for new versions, or just new revisions? If the later, it can't be more than like 5-ish, so Array is quite a premature optimization. I would much prefer NonEmpty for correctness. |
@Ericson2314 only revisions. |
OK then my vote is reverse ordered |
@peti @ElvishJerricco what's the state of this PR? We're looking into adding proper support of custom Hackage revisions into |
Actually it appears that http://hackage.haskell.org/package/cabal2nix-2.12/docs/Distribution-Nixpkgs-Haskell-Hackage.html duplicates |
@qrilka The hashes contained there are the same as what's implemented in hackage-db without this PR. The problem is that they only tell you the hash of one revision, not all revisions. |
@ElvishJerricco I understand that, it was just a side question why |
Simple compatibility is quite trivial - qrilka/cabal2nix@5b5f25f and my larger task is to allow something like cabal://acme-missiles-0.3@sha256:2ba66a092a32593880a87fb00f3213762d7bca65a687d45965778deb8694c5d1 and cabal://acme-missiles-0.3@rev:0 |
|
@peti Which information is not included in the tarball? The cabal file hashes certainly are included. |
They are?
Where can I find that hash? |
@peti Oh my bad. Yea, it's not recorded conveniently. But the cabal files themselves are present, and you can simply hash the file yourself. |
No kidding. That's exactly what |
@peti Oh, so you're suggesting that |
Yes. That is what the current implementation does. We have the split between parsed and unparsed representations precisely so that users can implement their own types that capture exactly that amout of data that they need and nothing else. |
Sorry for raising this again but what's the state of this PR? Could I help getting it merged? |
@ElvishJerricco that PR (if you mean NixOS/cabal2nix#393 ) depends on this one and it's not a problem to update it - it shouldn't be a blocker |
@qrilka I don't think the array type choice is about performance. I think it's more about providing a satisfying API that makes it clear what we're representing and how to get the Cabal file you want. For instance it needs to be immediately obvious how to get the latest revision, but a list doesn't tell you which order it's in |
I am not happy about the API that this PR implements and I am not going to merge it in its current state. I don't like the use of a list to store the different revisions; I'd prefer an array or some other container that has an intuitive Secondly, I am not convinced that the normal Hackage DB API should expose revisions at all. The way I understand revisions, there should be no reason to prefer a Cabal file in revision n if n+1 exists, i.e. you generally want the latest one. I believe Carrying all revisions of all Cabal files in the HackageDB increases its size quite a bit and I'm not sure what is gained by adding all that infomation. Now, I realize that some tools may need access to revisions. I would argue, though, that these tools would be better off implementing whatever it is they need, exactly, on top of the "unparsed" API, which exists precisely so that such a thing is possible (and |
The hackage db tarball explicitly exposes all revisions. If we mean to reflect the tarball exactly, we must expose them as well. It is cabal-install who ignores earlier revisions. Other tools need earlier revisions, and are based on the hackage tarball. I don't really care one way or the other about the array type (I just need a concrete suggestion), but suggesting that we shouldn't even expose the revisions is definitely wrong. |
The hackage db tarball explicitly exposes all revisions.
No, it does not. Revisions are by no means "explicit" in the tarball.
Newer revisions of the same file simply overwrite the old one. If you
extract that tarball into a directory tree, you won't see a single
"revision" anywhere; you'll just see the respective latest version of
every file.
I don't really care one way or the other about the array type (I just
need a concrete suggestion)
I don*t have a concrete suggestion for you, I'm afraid. If I had one,
I'd just implement it myself. If you would like to continue working on
this feature, then the only feasible way forward is to be willing to
explore the design space together with me. This means that there has to
be a willingness to try and fail and to take each others opinions
seriously.
Suggesting that we shouldn't even expose the revisions is definitely
wrong.
Talking about design decisions in terms of "right" and "wrong" is not
very meaningful. There is no correct choice. There are different choices
that achieve different goals. If your goal is to expose revisions in the
default HackageDB module at all costs, then we have different goals.
Neither cabal2nix nor cabal2obs -- which are the main consumers of this
library -- need explicit support for revisions. Therefore, I don't
perceive this as an essential feature. If you are willing to accept that
and work constructively to find a way forward to makes both of us happy,
then let's do it. If you think that my point of view is wrong, however,
than you are probably better of forking this library.
|
The directory structure of the tarball when extracted is decidedly not the schema of the tarball. The tar file format is he schema, and hackage keeps the revisions recorded there explicitly and intentionally. I think this should absolutely be respected, because anything which consumes the tarball should be capable of consuming this library instead. Does that sound like a reasonable goal? |
The tar format was chosen *long* before revisions existed. I don't see how that choice constitutes a mandate for HackageDB to expose revisions in its default API.
Anyhow, I don't want revisions exposed in the default API simply because most tools that use this library (and that I am aware of) don't need them. The general consensus appears to be that you always use the latest one.
|
commercialhaskell/stack@cbf7986 this commit shows two things.
|
@peti yea, the 01-index.tar version of the index came about to add revisions, leaving the old tar schema untouched in 00-index.tar. Would an API like this be acceptable? data VersionData = VersionData { cabalFile :: ByteString
, metaFile :: ByteString
, previousRevisions :: Vector ByteString -- ^ Oldest is first; excludes latest.
} This probably requires no changes from consumers of this library, provides safe access to the latest cabalFile, and constant time access to all revisions that came before it. |
Alright, I've got two different approaches to this. There's a minor difficulty in the code that reads the tar file; if it encounters a I've got two approaches I've just pushed.
The former is probably the better one to merge, just because it's a lot simpler and probably more performant. But the latter is more technically satisfying to me, since it seems a little more "correct", so to speak. |
Unfortunately, that API is somewhat wasteful for users who don't care about revisions (like most of my tools, which always use the latest revision). The more I think about this issue, the more it becomes apparent that there can never be a single
Now, it seems to me that the only way to cover those use cases is to make it very easy to define your own I've not yet decided which one is nicer. The callback function |
See #9 (comment) for details on the motivation behind new-callback-api.hs and new-class-api.hs. Dropped support for older compilers. Building with GHC versions prior to 7.10.x is too much effort. Added parseIso8601 function to Distribution.Hackage.DB.Utility.
See #9 (comment) for details on the motivation behind new-callback-api.hs and new-class-api.hs. Dropped support for older compilers. Building with GHC versions prior to 7.10.x is too much effort. Added parseIso8601 function to Distribution.Hackage.DB.Utility.
What about a stream style API, like the data HackageEntry
= PreferredVersions BSL.ByteString
| CabalFile Version BSL.ByteString
| MetaFile Version BSL.ByteString
data HackageEntries
= Done
| Next PackageName EpochTime HackageEntry HackageEntries
| Fail SomeException
parseTarball :: BSL.ByteString -> HackageEntries This gives the same basic concept as your branches, but a little simpler. Also makes it easy to make a That said, I think the scope you're trying to achieve here might be much greater than it needs to be. I don't think this library needs to be the optimal algorithm for every possible use case. In my opinion, the purpose of this library is to simply provide an in-memory representation of the tarball's schema. Just because cabal2nix doesn't need revisions doesn't mean hackage-db shouldn't have them. IMO this library should just faithfully represent the tarball with a data structure, and doing that means representing the revisions. Anyone who doesn't need them can just ignore them. |
What about a stream style API, like the tar library?
Well, I don't know. I have experimented with the CPS API now and quite like it. What are the respective advantages and disadvantages of such an API compared to the one I've created in your opinion?
I think the scope you're trying to achieve here might be much greater than it needs to be. I don't think this library needs to be the optimal algorithm for every possible use case.
I agree trying to consider every possible use case is hopeless. And I am not trying to do that. The use cases I'm considering are the ones I, personally, have in my own projects that use hackage-db.
Just because cabal2nix doesn't need revisions doesn't mean hackage-db shouldn't have them. [...] Anyone who doesn't need them can just ignore them.
There is no need for cabal2nix to ignore revisions. We can easily define convenient APIs on top of 'parseTarball' that provide multiple, efficient views into the data: one view that's suitable for those who want revisions and one view for those who don't want them. That recognition that one size does not fit all is the whole point of this exercise.
|
@peti To me, the stream style API is just a lot simpler to work with. For instance, if the user wants to terminate early, they can just return from their function. Whereas with your Builder implementation, they have to wrap and unwrap all their code with
But is that API an actually useful library? At that point, we're basically talking about a thin wrapper around the |
I don't expect that many users will ever use the CPS is more complicated than processing a sequence of data types, no doubt. The advantage is, though, that it gets the job done without creating any intermediate data structures, which is a nightmare memory-wise given the size of the Hackage tarball.
As I said, the point of hackage-db is not |
My understanding is that the space overhead of the intermediate data structure is O(1), not O(size of db). And that constant should be very tiny; the size of a single constructor. I guess you could argue it pressures the GC a little and might increase pause frequency, but one constructor allocation per entry is surely negligible compared to the allocation the user code is likely to be doing for each entry. Regardless, none of this addresses my criticism that the view this library provides should be representative of the hackage tarball schema, which does include revisions. Whatever |
Yeah until we actually run into empirical performance problems, this entire discussion seems wildly premature. Let's just merge something simple that supports both use-cases. If performance issues arise in practice, I'm sure the rest of us would be happy finish the streaming API, or whatever else is needed. |
Yes, space overhead is probably O(1) if the code is structured correctly and doesn't fall into any lazy evaluation traps. Time overhead is O(n) though. Anyhow, I like the CPS Builder API, and since it's supposed to be an implementation detail anyway, I see no strong argument against it.
I see no reason why
I have run into performance problems. Keep in mind that I have been using this code for production purposes for a long time. |
It's rather difficult to create problems with the streaming api due to laziness in my opinion. And the O(n) time overhead is multiplied by an extremely small constant. User code is certain to make this negligible. Anyway, the discussion of the implementation detail seems, at this point, off topic. That change can be made separately, before or after this PR. As for this PR, @peti are you suggesting that you would be ok with having another doubling of modules, half with revisions and half without? |
It'd also be good to know what your actual performance constraints are, so we can tell what is and isn't negligible. But I guess thats also off topic. |
Will Fancher writes:
It's rather difficult to create problems with the streaming api due to laziness in my opinion. And the O(n) time overhead is multiplied by an extremely small constant. User code is certain to make this negligible.
Be this as it may, I have come up with an internal API that I like and which works well. If you want to sway me to use something else, then you'll have to argue that your preferred API has significant advantages over mine. Arguing that you preferred API is not significantly worse than mine is not going to convince me.
Are you suggesting that you would be ok with having another doubling of modules, half with revisions and half without?
Yes, that's exactly what I'm suggesting.
|
Cool, I can get behind that. As for why my API idea is better, as I said, I just think it's far simpler and easier to use, with effectively no drawbacks. |
@peti and second, to been if there are empirical performance problems, they could well be unaffected by all of this. That said, if you want to double the interface for other reasons I agree that's fine. Do your have any PR for that work—something Will and I can contribute to? It would be a shame for this more ambitious design to also keep Will blocked on it landing for far longer. |
@peti are there any plans to resolve this issue? In December you said that you have some API that you like and at the same time I can't find any published code like that. Is there some way to help revisions appear in |
Hi Kirill,
In December you said that you have some API that you like and at the same time I can't find any published code like that.
the code is in 'master'. I have tested it against my own packages the rely on hackage-db and it has worked well, but I haven't made a Hackage release yet because I already know of other changes that I'd like to make to the API to facilitate features like efficient filtering. I haven't been able to work on it as much as I should, unfortunately, because of "real-life" time constraints, but I still care about this subject and would like to develop it further.
Is there some way to help revisions appear in hackage-db and in cabal2nix in the end?
If you could suggest an API for a revision-aware hackage-db that works on top of the new "builder" API, then I'd be happy to review and eventually merge it.
Best regards,
Peter
|
Oh, that's great to hear, I'll try to create something. |
See NixOS/hackage-db#9 (comment) for details on the motivation behind new-callback-api.hs and new-class-api.hs. Dropped support for older compilers. Building with GHC versions prior to 7.10.x is too much effort. Added parseIso8601 function to Distribution.Hackage.DB.Utility.
This is obviously a breaking change, so this would necessitate a version bump.