-
Notifications
You must be signed in to change notification settings - Fork 67
Is uproot.update not yet supported? #381
Comments
The update method which is meant to take an existing ROOT file and write objects to it is not in our current roadmap of development. |
Ok I see, thanks. Then I would suggest to update the README, since it reads like all of create, recreate, update is supported. |
Actually, this shouldn't be closed because the existence of |
+1 for would use |
I definitely would use this as well. |
In the scale of easy features vs hard features, this is a very hard one, unfortunately. We'd have to be able to pick up any ROOT file, regardless of how its internal structure is configured, and work with it in our scheme. A poor man's solution would be for The stub exists because it's a very natural thing to want. In general, the desirability of features has no correlation with their ease of implementation, such that some extremely trivial implementations have enormous benefits and some insurmountably difficult implementations have only marginal value. It sounds like this feature is both desirable and hard, so it's worth considering, but we'd need an interested developer to spend (what might turn out to be) a few months on it. |
Would it simplify the problem if you only accept files that were made by uproot and raise an exception in other cases? I suspect this would cover most use cases and it leaves the option of future extensions without risking backwards incompatibility. |
That would simplify the problem. It hadn't occurred to me that this is the desirable use-case: opening and re-opening with uproot, as opposed to opening with ROOT, re-opening with uproot. It could be a check for structure, rather than explicitly for who created it, so that a file made by ROOT might fit. But then, that could be a confusing error for the user: one ROOT file can be updated in uproot while another can't, and there doesn't seem to be any significant difference between the ROOT files. (We're talking about invisible-to-the-user differences in where blocks of data are allocated, which can turn on minor details like how many times it's been opened, how many or what types of objects have been written, whether anything has ever been replaced in-place or deleted at any point in the file's history, etc.) |
I think the simplest solution of |
I think that sounds reasonable. Copying whatever objects don't have uproot equivalents, and allowing to update those that do. I don't know if what I was trying to do was necessarily smart, but my use case was needing histograms in a certain dir structure in the file to make stuff compatible with existing code. I couldn't figure out how to create directories in uproot, so I figured I could open a file that works and just update the histograms, but I got stuck because "recreate" didn't work and I couldn't update the TH1 names. |
The reason is because you can't create directories with uproot (#138). It's one thing to read a ROOT file, jumping to where the pointers take you, skipping the parts that aren't relevant for what you're trying to do, and it's another thing entirely to write a structure, byte for byte, that ROOT will accept. You have to understand all the structures, at least well enough to make one valid state. To update a ROOT file, you have to pick up any (or a large set of) valid states and continue where they left off. The reason we punted on adding directories is because it multiplied the number of things we had to think about. ("What if the user adds directories first, then a histogram?" "What if they write a histogram, then add directories, then another histogram?" "What if some baskets of a TTree are interleaved with both of them?" It was a combinatoric problem.)
It's not (just) an issue of types of objects we don't recognize. What I'm thinking about is the fact that a ROOT file is a filesystem; different regions of bytes correspond to different objects. If you add and remove objects, this space gets fragmented: "sliding everything to the left" when you delete an object would be prohibitively expensive, and anyway it would require a lot of pointers to be updated, so like a good filesystem, ROOT doesn't do this. Instead, it has a serialized linked list of the objects that do exist with a table of free space ( The simplification that we're talking about is not dealing with all of the possible states that this filesystem can get into, but only the subset that we've already understood. I think in practice this would mean that 99% of ROOT files produced by ROOT would not be updatable in uproot. That would definitely include the use-case you're talking about: using ROOT to create the directories and uproot to add additional objects. We definitely want to be on the same page about what counts as "done" before launching into a project like this. (Personal rant at the bottom of this comment: the "filesystem in a file" technology existed prior to ROOT. Zip files have been ubiquitous since 1989 and are the basis for many types of files that have to store a lot of objects, such as Java JARs and Python wheels. Zip doesn't have the "update" feature, to delete objects and recover their space, but all the dbm implementations do. In fact, dbm is a protocol established in 1979 so that you can swap out different implementations. sdbm is public domain, still used by Perl and Ruby, and BerkeleyDB, written in 1991, is a high-quality variant with users like sendmail, RPM, Bitcoin, and Oracle NoSQL. Maybe there was a technical reason one of these standards couldn't be adopted, but I really wish one had. It's not that our objects, like |
For any readers, it might be important to know that uproot cannot do this yet and assigns new objects at the end of the file (#135). |
I am confused. Does this mean that uproot can actually update a file with new objects as long as they are appended at the end of the file and this could "simply" be enabled for the update method? If that is true, couldn't the problem of managing data blocks in the best possible way be solved at a later stage, so that people could already use this feature? |
Even if we don't plan to delete objects from a ROOT file or take advantage of empty spaces and keep it defragmented, appending does require changing pointers on a variety of places. (Here, "pointers" means byte positions in the file or in a nested range within the file.) To do that without corrupting people's files, we'd have to understand the set of possible states ROOT files can be in better than we do now. The list of free spaces is some sort of linked list (in file byte positions), but there's also a footer after it that would have to be copied if we're going to expand that list. We could try adding an append-only update feature, testing it on all the files we have, and then adding a warning that it's experimental and shouldn't be used on valuable files because what we don't know can corrupt files, then collect feedback from users who do encounter exceptional cases. However, that's a greater level of engagement than I'm able to support right now. Append-only is a good idea to simplify the problem, but it's still going to require a lot of work. (And then after that, surely someone will ask why they can't delete objects or why their output files are so large. However, doing it in two stages like this does help to break down the problem.) |
Ok, very interesting. Thanks a lot for the clarification! |
When trying to uproot.update a root file, uproot raises a TypeError:
From what I understand it looks like the update constructor takes too few arguments, compared to the called _openfile function.
Is this feature meant to be working or should it still be not implemented, judging from the NotImplementedError that is still present in the constructor?
PS: Thank you for this awesome python package, this is really bringing back some fun to the daily root file massage :-)
The text was updated successfully, but these errors were encountered: