-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blob / File iterators not working with python 3 #51
Comments
Here is an iterator takes an argument: 0..127
|
Iterator approach appears to work, started getting valid blobs and content, I rewrote it as: import sys
import os
from pathlib import Path
import oscar
blob_fbase = "/da5_data/All.blobs/blob_{section}.{ext}"
def iter_blobs(section):
p_idx = Path(blob_fbase.format(section=section, ext="idx"))
p_bin = Path(blob_fbase.format(section=section, ext="bin"))
with p_idx.open("rt") as idx_f, p_bin.open("rb") as bin_f:
for idx_line in idx_f:
fields = idx_line.rstrip().split(";")
_hash = fields[3]
if len(fields) > 4:
_hash = fields[4]
offset = int(fields[1])
length = int(fields[2])
bin_f.seek(offset, os.SEEK_SET)
val = oscar.decomp(bin_f.read(length))
yield (_hash, val) Though I can't connect the blobs to files or commits yet:
Guessing a language to use based on text content has not been very performant (seeking and reading whole blobs), so doing a quick check for supported language file extensions would be preferred. (e.g. skip if no files in |
Each blob may have multiple filenames and b2f gets that, but it makes sense to have filenames in the order (as when you process all the blobs) in which the blobs are stored to avoid lookup delays, e.g.
|
It makes little sense to get commits via b2c:
python currently has pathnames hard-wired Perhaps you can fix that? |
Indeed, reading idxf is much faster, updated my iterator, this seem right? def iter_blobs(section, blobfilter=lambda b: True, filefilter=lambda fnames: True):
"""
blobfilter(str) -> bool
filefilter(List[PurePosixPath]) -> bool
all provided filters must pass
"""
p_idx = Path(blob_fbase.format(section=section, ext="idxf"))
p_bin = Path(blob_fbase.format(section=section, ext="bin"))
with gzip.open(p_idx, "rt") as idx_f, p_bin.open("rb") as bin_f:
for idx_line in idx_f:
fields = idx_line.rstrip().split(";")
_hash = fields[3]
filenames = tuple(PurePosixPath(x) for x in fields[4:])
offset = int(fields[1])
length = int(fields[2])
if not blobfilter(_hash):
continue
if not filefilter(filenames):
continue
bin_f.seek(offset, os.SEEK_SET)
val = oscar.decomp(bin_f.read(length))
yield (_hash, val) I'll look into fixing up the paths in oscar, but most of the newer paths functionality in python is added in python3 and not available in 2.x. |
|
I still saw a lot of python2 supporting code in oscar, if I can get rid of that I can help a lot more. Default python is still 2.7 on the da systems, so I am not sure if that would impact anyone still running really old code. (Da's python3 is 3.6 which is now "end of life" as well and does not get security updates any longer) |
I will see what are the options to go to rhel 8 and 9 Meanwhile, on da5 i installed 3.8:
To install system-wide packages you can use
|
What encoding / character set is the
|
encoding is everywhere set to C on the command line. the stuff from the wild could be in any encoding whatsoever. filename comes from the tree, tree and tree has whatever encoding the user had |
They seem to be different kinds of errors:
Not sure if this one is 3 specific or not:
The text was updated successfully, but these errors were encountered: