Feature theme suggestion: data versioning #5918
Replies: 13 comments
-
It would be great to have some clarification about "data artefacts". I'll try to answer the question based on my understanding... All outputs ( Many outputs example:
In the old\released version it works pretty much the same way but the syntax and the dependency logic are a bit different. |
Beta Was this translation helpful? Give feedback.
-
I think that he meant individual data files versioning, e.g. the ability to roll back some csv file to a previous version. But I think that goes in the wrong direction. The script that generated that csv should be versioned and reproducible, hence versioning the code should be enough. That's one of the main reasons to have reproducible pipelines, so you don't have to version outputs. I can imagine valid arguments that some data files may not belong to the pipeline and are output from external sources. In that case I think it's best to fall back to git or git-lfs. |
Beta Was this translation helpful? Give feedback.
-
@dmpetrov @villasv apologies for having introduced the redundant terminology ― by artefacts, I meant outputs indeed. I will try to elucidate. Sometimes you want to preserve not just the skeleton of the pipeline that created an output, but also keep the output itself, the final output of the pipeline mainly. That's because it may typically take between hours to days for the machine learning process to re-create it. For example, in a certain kind of workflow, you'd later compare that output to newer ones you are creating, and/or possibly revert to it down the road. Alternatively, if you are adding evaluation steps as you learn more about the data, you may have reason to go back evaluate against older outputs of the same pipeline, without re-building them from scratch. Waiting hours or days to reproduce a previous version of the output would then be counter-productive, in these cases. Or, you found a bug in your evaluation script, and would like to re-evaluate some older outputs. So by versioning outputs, I do mean providing options to keep some of them some of the time, granted that traceability to the version of the code/pipeline that had created each one is preserved (and not just having the latest version of outputs files, which is what git-lfs could satisfy). |
Beta Was this translation helpful? Give feedback.
-
Hmm. Yeah. Comparing before/after a pipeline change makes sense, even though I usually try my best to avoid that (e.g. having always the "current" and "best" outputs active in the pipeline). But you're right that sometimes I wish I could revert back my changes and end up forcing myself to grab a coffee before the change is undone and I reproduce an old result. Not sure how easy it would be to correctly "check out" the correct output cache at different stages. I think keep some of them some of the time is a sensible request and is more like extra chaching instead of a full fledged versioning feature of those files like I interpreted before. |
Beta Was this translation helpful? Give feedback.
-
Thank you guys for the clarification! Yeah, this is the most interesting subject about DVC... As you see there are two different types of reproducibility "philosophies": version only code (data can be easily derived from the code) and version both - code and data. Makefile (and analogs) versions only code. DVC versions code and data. Why versioning code+data? Two advantages:
Previous commits situation was well explained by @matanster - sometimes we don't want to wait 15 minutes or even 5 hours for reproduction of the previous version that we had already built yesterday. With DVC you can do Repos sharing scenario is the same but we avoid rebuilding in different machines: I can train a model with my 12Gb GPU machine, sync result to a cloud storage and then resync and reuse it from my laptop: Okay, DVC uses code+data "philosophy". Does it support code only philosophy like @villasv described? To some extent - yes. In local machine it is code+data. But when you share a Git repository you share only code and the code is still reproducible. But if you share a repository AND an access to your synced cloud storage then you make it code+data sharing. This is a DVC feature I'm very proud of - we were able to separate code reproducibility and code+data reproducibility and stay compatible with Git where My personal opinion - code based reproducibility is the right way to share result where code+data based reproducibility is kind of optimization and it is the most convenient way to work on models by yourself or in a team (using dvc sharing). What is implemented today? In the old, released DVC version and in the new one both of these scenarios are implemented. Also, in the new version, you will be able to commit outputs to Git directly if it's needed So, to answer the original question, yes, this is already baked. And this is the most important and interesting part of DVC with many stories behind :) |
Beta Was this translation helpful? Give feedback.
-
Ah, I see. I imagined that the extra caching was happening and wasn't sure it was being fully exposed as feature, but it is. These first few paragraphs deserve to become a blog post eventually if they're not already :) |
Beta Was this translation helpful? Give feedback.
-
Yes, I'm going to publish all this information before the next release. |
Beta Was this translation helpful? Give feedback.
-
Hi, I just discovered DVC yesterday, and it seems very close to what we need in our team as well. Thanks for developing it! I'd like to strongly support that code+data versioning is extremely important in practice. @dmpetrov Your scenario about training the cnn_model.p on the GPU machine and then syncing it to the laptop(s of multiple team members) is exactly the kind of situation I'm interested in. I look forward to your blog post that explains this with great interest. However, let's say you change your CNN a bit, and then generate a new version of cnn_model.p. How will |
Beta Was this translation helpful? Give feedback.
-
@alexanderkoller thank you for your feedback! We are making final steps on releasing the new DVC version in mid-March.
In general, DVC stores all meta information in your Git repository, |
Beta Was this translation helpful? Give feedback.
-
@alexanderkoller Regarding the version selection... "Adding all version of cnn_model.p to a Git repository seems clunky" But it is nice to have the entire history of ML project. I personally try to keep all the attempts I made. However, I don't keep them as linear changes in a single branch. I make a separate branch for each of my hyperparameters and then merge the best branch\params into master. It's like feature branch in software development but you can have 15 features\branches and only one will be used\merged. If you see the value of your failed (not merged) experiments you can even push them to |
Beta Was this translation helpful? Give feedback.
-
We are discussing dataset scenarios in #1487. Guys, please feel free to join the discussion. |
Beta Was this translation helpful? Give feedback.
-
@dmpetrov , do you think we can close this? Looks like DVC already support |
Beta Was this translation helpful? Give feedback.
-
@MrOutis I think the discussion started after So, it might be resolved with proper git tags and Btw, I wonder if |
Beta Was this translation helpful? Give feedback.
-
I'm not sure if this is already baked in or not. It would be a great feature theme to automatically version data artefacts, especially for the final outputs of a workflow. On the one hand this is at par with going in lockstep with what git is about, yet on the other hand it might be a whole feature theme to consider with great care, rather than a small addition.
Anyway the motivation being, that data processing, machine learning in particular, is a very iterative process, and we gain a lot by being able to version the code and workflow that created a result along with the result itself. This would seem to elegantly materialize what we call reproducible data science.
Beta Was this translation helpful? Give feedback.
All reactions