libdvc
for performance and interfacing
#6547
Replies: 3 comments 3 replies
-
I did a pretty quick test with the following script which corresponds to dvc add in functionality. It doesn't do everything
This is what I get for
and the similar command
I used So for 1000 files, @shcheklein suggested that had the functionality added to DVC is added to another piece of software, it would also slow it down. So a full-fledged DVC clone with another language may have a similar performance. It's true that there are theoretical boundaries to the algorithms. You can only find the minimum element of a list in One problem is architecture I believe. AFAIU DVC loads all classes for all features for all commands. When I run It may not show up in profilers because I believe there is no single place the software waits. It runs something, very small tasks, checking this and that and all these add up to a significant amount. Our basic performance bug seems to be caused of this. Because individual functions, like calculating the hash of a file are certainly optimized. But when I type I have two initial observations:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the suggestions, @iesahin 🙏 My 2c: this is not really about the language dvc is written in, we just didn't spend enough time optimizing it properly yet. We've done some optimizations before, but they were always crumbling because of the mess in the architecture that we've collected over the years. In the past few months, we've put significant effort into cleaning up data management architecture as well as some other subsystems, and we are now starting to carefully tweak it to improve the performance. Also, it is important to remember that even though that bash script does a bunch of things like |
Beta Was this translation helpful? Give feedback.
-
My test result My script import os
os.makedirs("data", exist_ok=True)
for i in range(1000):
with open(f"data/{i}", "wb") as f_w:
f_w.write(os.urandom(2**15))
directory=data
out=/tmp/cache-${RANDOM}
ls -1 $directory | while read -r f ; do
md5s=$(md5sum "${directory}/${f}" | cut -f 1 -d ' ')
cachedir="${md5s:0:2}"
mkdir -p "${out}/${cachedir}"
echo "${directory}/${f} -> ${out}/${cachedir}/${md5s:0:2}"
cp ${directory}/${f} ${out}/${cachedir}/${md5s:0:2}
cat > "${directory}/${f}.dvc" << EOF
md5: ${md5s}
EOF
done
import os
import random
import shutil
import hashlib
directory='data'
out='/tmp/cache-{}'.format(random.randint(1,99999))
for file in os.listdir(directory):
full_file = f'{directory}/{file}'
hash_md5 = hashlib.md5()
with open(full_file, 'rb') as f_r:
hash_md5.update(f_r.read())
md5s = hash_md5.hexdigest()
cachedir=f'{out}/{md5s[:2]}'
print(f"{full_file} -> '{cachedir}/{md5s[2:]}")
os.makedirs(cachedir, exist_ok=True)
shutil.copy(full_file, f'{cachedir}/{md5s[2:]}')
with open(f'{full_file}.dvc', 'w') as f_w:
f_w.write(f"md5:\n\t{md5s}") So the speed difference mostly comes from the complex logic in DVC, not from the computations or data transfer. Rewrite a go/c version of DVC could help, only part of the computations will not. |
Beta Was this translation helpful? Give feedback.
-
One of the important issues that we aim to improve is DVC's performance in common tasks. Recently I began to wonder if moving certain parts, e.g., filesystem and networking to a
libdvc
written in C/Go/Rust improves performance. I'm not sure about this, as the current packages probably use some optimized compiled code and the investment may not worth it.Another aspect may be to have a common library that can be used in other languages like R or Julia but this requires a rather full-featured
libdvc
, and that's another topic. What I ask here is rewriting stable functionality that won't change in the short term in C (or something that can export "extern C") and usingctypes
orcffi
to call that in Python.I believe this idea was considered before, as "libdvc" is mentioned in a commit (#123) 4 years ago, but probably considered not worth it.
If it seems OK to test some functionality, e.g.,
add
,checkout
,get
in a forked DVC and see whether if it's worth it, I can reserve some time for this.Beta Was this translation helpful? Give feedback.
All reactions