Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blob / File iterators not working with python 3 #51

Open
robobenklein opened this issue Mar 27, 2023 · 10 comments
Open

Blob / File iterators not working with python 3 #51

robobenklein opened this issue Mar 27, 2023 · 10 comments
Assignees

Comments

@robobenklein
Copy link

They seem to be different kinds of errors:

bklein3@da4 ~> python3 test_oscar_iterators.py
Traceback (most recent call last):
  File "test_oscar_iterators.py", line 3, in <module>
    print(f"first file: {next(File.all())}")
  File "oscar.pyx", line 597, in all
  File "oscar.pyx", line 592, in all_keys
  File "oscar.pyx", line 514, in oscar._get_tch
TypeError: expected bytes, str found

Not sure if this one is 3 specific or not:

bklein3@da4 ~> python3 test_oscar_iterators.py
Traceback (most recent call last):
  File "test_oscar_iterators.py", line 3, in <module>
    print(f"first blob: {next(Blob.all())}")
  File "oscar.pyx", line 607, in all
KeyError: 'blob_sequential_idx'
from oscar import Blob, File, Commit

print(f"first file: {next(File.all())}")
print(f"first blob: {next(Blob.all())}")
@audrism
Copy link
Contributor

audrism commented Mar 27, 2023

Here is an iterator takes an argument: 0..127
and extracts content of all blobs.
I have not had a chance to integrate it into oscar yet, but it goes over blob sequence as stored, you just need to chose which treesiter language to apply.

import sys
import oscar

sections = 128;
sec = sys.argv[1]
fbase = "/da5_data/All.blobs/blob_"

idx = open(fbase+sec+".idx","r");
bin = open(fbase+sec+".bin","rb");


with open(fbase+sec+".idx","r") as idx:
 for line in idx:
  fields = (line .rstrip()) .split (";")
  nfields = len(fields)
  hash = fields[3]
  if (nfields > 4): hash = fields[4]
  off = int (fields[1])
  length = int (fields[2])
  bin .seek (off, 0)
  val = bin.read (length)
  print (oscar.decomp(val))  

@robobenklein
Copy link
Author

Iterator approach appears to work, started getting valid blobs and content, I rewrote it as:

import sys
import os
from pathlib import Path
import oscar

blob_fbase = "/da5_data/All.blobs/blob_{section}.{ext}"

def iter_blobs(section):
    p_idx = Path(blob_fbase.format(section=section, ext="idx"))
    p_bin = Path(blob_fbase.format(section=section, ext="bin"))

    with p_idx.open("rt") as idx_f, p_bin.open("rb") as bin_f:
        for idx_line in idx_f:
            fields = idx_line.rstrip().split(";")
            _hash = fields[3]
            if len(fields) > 4:
                _hash = fields[4]
            offset = int(fields[1])
            length = int(fields[2])
            bin_f.seek(offset, os.SEEK_SET)
            val = oscar.decomp(bin_f.read(length))
            yield (_hash, val)

Though I can't connect the blobs to files or commits yet:

OSError: Failed to close .tch "b'/da5_fast/b2cFullU.1.tch'": file not found
Exception ignored in: 'oscar.Hash.__del__'
OSError: Failed to close .tch "b'/da5_fast/b2cFullU.1.tch'": file not found
Traceback (most recent call last):
  File "test_oscar_iterators.py", line 8, in <module>
    print(b.commit_shas)
  File "oscar.pyx", line 344, in oscar.cached_property.wrapper
  File "oscar.pyx", line 718, in oscar.Blob.commit_shas
  File "oscar.pyx", line 574, in oscar._Base.read_tch
  File "oscar.pyx", line 523, in oscar._get_tch
  File "oscar.pyx", line 459, in oscar.Hash.__cinit__
OSError: Failed to open .tch file "b'/da5_fast/b2cFullU.1.tch'": file not found

Guessing a language to use based on text content has not been very performant (seeking and reading whole blobs), so doing a quick check for supported language file extensions would be preferred. (e.g. skip if no files in py|cs|c|cpp|...) getValues b2f works, but the oscar version should too.

@audrism
Copy link
Contributor

audrism commented Apr 10, 2023

Each blob may have multiple filenames and b2f gets that, but it makes sense to have filenames in the order (as when you process all the blobs) in which the blobs are stored to avoid lookup delays, e.g.

zcat /da5_data/All.blobs/blob_61.idxf|head
00000000;0;17787;bde50e34d01322144639695d0608ec14144ed84f;.fr-M0hgqN/drivers/hwmon/w83627hf.c
00000001;17787;14237;bdf75c98a1635cdf1a187210428ee1f1810230ea;drivers/mtd/ubi/build.c
00000002;32024;11296;3d74047fbb50e15c76ad1648bb7391674a95b0d0;lib/ETController.class.php
00000003;43320;1281;bd5dd0c6fba898a417858bfa52c9709027d93e92;examples/ex02c_color_palette.c

@audrism
Copy link
Contributor

audrism commented Apr 10, 2023

It makes little sense to get commits via b2c:
b2fFullU would give full list of files.

ls /da?_fast/b2fFullU.0.tch
/da4_fast/b2fFullU.0.tch

python currently has pathnames hard-wired
and does not find anything that is not on /da5_fast/

Perhaps you can fix that?

@robobenklein
Copy link
Author

Indeed, reading idxf is much faster, updated my iterator, this seem right?

def iter_blobs(section, blobfilter=lambda b: True, filefilter=lambda fnames: True):
    """
    blobfilter(str) -> bool
    filefilter(List[PurePosixPath]) -> bool

    all provided filters must pass
    """
    p_idx = Path(blob_fbase.format(section=section, ext="idxf"))
    p_bin = Path(blob_fbase.format(section=section, ext="bin"))

    with gzip.open(p_idx, "rt") as idx_f, p_bin.open("rb") as bin_f:
        for idx_line in idx_f:
            fields = idx_line.rstrip().split(";")
            _hash = fields[3]
            filenames = tuple(PurePosixPath(x) for x in fields[4:])
            offset = int(fields[1])
            length = int(fields[2])
            if not blobfilter(_hash):
                continue
            if not filefilter(filenames):
                continue
            bin_f.seek(offset, os.SEEK_SET)
            val = oscar.decomp(bin_f.read(length))
            yield (_hash, val)

I'll look into fixing up the paths in oscar, but most of the newer paths functionality in python is added in python3 and not available in 2.x.

@audrism
Copy link
Contributor

audrism commented Apr 10, 2023

  1. Please keep in mind that idxf do not go to the very end of the data
  2. I thought python2 no longer works
  3. Adding @loconous to write a very simple test suite to test if all the maps are accessible from python3. Basically doesItWork.py that anyone can run and it produces output indicating what maps/objects are available

@robobenklein
Copy link
Author

I still saw a lot of python2 supporting code in oscar, if I can get rid of that I can help a lot more.

Default python is still 2.7 on the da systems, so I am not sure if that would impact anyone still running really old code. (Da's python3 is 3.6 which is now "end of life" as well and does not get security updates any longer)

@audrism
Copy link
Contributor

audrism commented Apr 10, 2023

I will see what are the options to go to rhel 8 and 9
as rhel 7 does not support py3 well.

Meanwhile, on da5 i installed 3.8:

yum -y install rh-python38
scl enable rh-python36 bash
python3 -V
Python 3.8.14

To install system-wide packages you can use
instructions from here: https://developers.redhat.com/blog/2018/08/13/install-python3-rhel#why_use_red_hat_software_collections

scl enable rh-python38 bash
mkdir ~/pydev
$ cd ~/pydev

$ python3 -m venv py38-venv
$ source py38-venv/bin/activate

(py38-venv) $ python3 -m pip install ...some modules...

@robobenklein
Copy link
Author

What encoding / character set is the idxf?

  File "/home/bklein3/WorldSyntaxTree/wsyntree_collector/wociterators.py", line 27, in iter_blobs
    for idx_line in idx_f:
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 7771: invalid continuation byte

@audrism
Copy link
Contributor

audrism commented Apr 11, 2023

encoding is everywhere set to C on the command line.

the stuff from the wild could be in any encoding whatsoever. filename comes from the tree, tree and tree has whatever encoding the user had

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants