Blob / File iterators not working with python 3 #51

robobenklein · 2023-03-27T17:24:07Z

They seem to be different kinds of errors:

bklein3@da4 ~> python3 test_oscar_iterators.py
Traceback (most recent call last):
  File "test_oscar_iterators.py", line 3, in <module>
    print(f"first file: {next(File.all())}")
  File "oscar.pyx", line 597, in all
  File "oscar.pyx", line 592, in all_keys
  File "oscar.pyx", line 514, in oscar._get_tch
TypeError: expected bytes, str found

Not sure if this one is 3 specific or not:

bklein3@da4 ~> python3 test_oscar_iterators.py
Traceback (most recent call last):
  File "test_oscar_iterators.py", line 3, in <module>
    print(f"first blob: {next(Blob.all())}")
  File "oscar.pyx", line 607, in all
KeyError: 'blob_sequential_idx'

from oscar import Blob, File, Commit

print(f"first file: {next(File.all())}")
print(f"first blob: {next(Blob.all())}")

The text was updated successfully, but these errors were encountered:

audrism · 2023-03-27T18:43:31Z

Here is an iterator takes an argument: 0..127
and extracts content of all blobs.
I have not had a chance to integrate it into oscar yet, but it goes over blob sequence as stored, you just need to chose which treesiter language to apply.

import sys
import oscar

sections = 128;
sec = sys.argv[1]
fbase = "/da5_data/All.blobs/blob_"

idx = open(fbase+sec+".idx","r");
bin = open(fbase+sec+".bin","rb");


with open(fbase+sec+".idx","r") as idx:
 for line in idx:
  fields = (line .rstrip()) .split (";")
  nfields = len(fields)
  hash = fields[3]
  if (nfields > 4): hash = fields[4]
  off = int (fields[1])
  length = int (fields[2])
  bin .seek (off, 0)
  val = bin.read (length)
  print (oscar.decomp(val))

robobenklein · 2023-04-10T01:47:32Z

Iterator approach appears to work, started getting valid blobs and content, I rewrote it as:

import sys
import os
from pathlib import Path
import oscar

blob_fbase = "/da5_data/All.blobs/blob_{section}.{ext}"

def iter_blobs(section):
    p_idx = Path(blob_fbase.format(section=section, ext="idx"))
    p_bin = Path(blob_fbase.format(section=section, ext="bin"))

    with p_idx.open("rt") as idx_f, p_bin.open("rb") as bin_f:
        for idx_line in idx_f:
            fields = idx_line.rstrip().split(";")
            _hash = fields[3]
            if len(fields) > 4:
                _hash = fields[4]
            offset = int(fields[1])
            length = int(fields[2])
            bin_f.seek(offset, os.SEEK_SET)
            val = oscar.decomp(bin_f.read(length))
            yield (_hash, val)

Though I can't connect the blobs to files or commits yet:

OSError: Failed to close .tch "b'/da5_fast/b2cFullU.1.tch'": file not found
Exception ignored in: 'oscar.Hash.__del__'
OSError: Failed to close .tch "b'/da5_fast/b2cFullU.1.tch'": file not found
Traceback (most recent call last):
  File "test_oscar_iterators.py", line 8, in <module>
    print(b.commit_shas)
  File "oscar.pyx", line 344, in oscar.cached_property.wrapper
  File "oscar.pyx", line 718, in oscar.Blob.commit_shas
  File "oscar.pyx", line 574, in oscar._Base.read_tch
  File "oscar.pyx", line 523, in oscar._get_tch
  File "oscar.pyx", line 459, in oscar.Hash.__cinit__
OSError: Failed to open .tch file "b'/da5_fast/b2cFullU.1.tch'": file not found

Guessing a language to use based on text content has not been very performant (seeking and reading whole blobs), so doing a quick check for supported language file extensions would be preferred. (e.g. skip if no files in py|cs|c|cpp|...) getValues b2f works, but the oscar version should too.

audrism · 2023-04-10T04:40:02Z

Each blob may have multiple filenames and b2f gets that, but it makes sense to have filenames in the order (as when you process all the blobs) in which the blobs are stored to avoid lookup delays, e.g.

zcat /da5_data/All.blobs/blob_61.idxf|head
00000000;0;17787;bde50e34d01322144639695d0608ec14144ed84f;.fr-M0hgqN/drivers/hwmon/w83627hf.c
00000001;17787;14237;bdf75c98a1635cdf1a187210428ee1f1810230ea;drivers/mtd/ubi/build.c
00000002;32024;11296;3d74047fbb50e15c76ad1648bb7391674a95b0d0;lib/ETController.class.php
00000003;43320;1281;bd5dd0c6fba898a417858bfa52c9709027d93e92;examples/ex02c_color_palette.c

audrism · 2023-04-10T04:46:02Z

It makes little sense to get commits via b2c:
b2fFullU would give full list of files.

ls /da?_fast/b2fFullU.0.tch
/da4_fast/b2fFullU.0.tch

python currently has pathnames hard-wired
and does not find anything that is not on /da5_fast/

Perhaps you can fix that?

robobenklein · 2023-04-10T05:04:27Z

Indeed, reading idxf is much faster, updated my iterator, this seem right?

def iter_blobs(section, blobfilter=lambda b: True, filefilter=lambda fnames: True):
    """
    blobfilter(str) -> bool
    filefilter(List[PurePosixPath]) -> bool

    all provided filters must pass
    """
    p_idx = Path(blob_fbase.format(section=section, ext="idxf"))
    p_bin = Path(blob_fbase.format(section=section, ext="bin"))

    with gzip.open(p_idx, "rt") as idx_f, p_bin.open("rb") as bin_f:
        for idx_line in idx_f:
            fields = idx_line.rstrip().split(";")
            _hash = fields[3]
            filenames = tuple(PurePosixPath(x) for x in fields[4:])
            offset = int(fields[1])
            length = int(fields[2])
            if not blobfilter(_hash):
                continue
            if not filefilter(filenames):
                continue
            bin_f.seek(offset, os.SEEK_SET)
            val = oscar.decomp(bin_f.read(length))
            yield (_hash, val)

I'll look into fixing up the paths in oscar, but most of the newer paths functionality in python is added in python3 and not available in 2.x.

audrism · 2023-04-10T07:11:52Z

Please keep in mind that idxf do not go to the very end of the data
I thought python2 no longer works
Adding @loconous to write a very simple test suite to test if all the maps are accessible from python3. Basically doesItWork.py that anyone can run and it produces output indicating what maps/objects are available

robobenklein · 2023-04-10T14:29:50Z

I still saw a lot of python2 supporting code in oscar, if I can get rid of that I can help a lot more.

Default python is still 2.7 on the da systems, so I am not sure if that would impact anyone still running really old code. (Da's python3 is 3.6 which is now "end of life" as well and does not get security updates any longer)

audrism · 2023-04-10T16:16:53Z

I will see what are the options to go to rhel 8 and 9
as rhel 7 does not support py3 well.

Meanwhile, on da5 i installed 3.8:

yum -y install rh-python38

scl enable rh-python36 bash
python3 -V
Python 3.8.14

To install system-wide packages you can use
instructions from here: https://developers.redhat.com/blog/2018/08/13/install-python3-rhel#why_use_red_hat_software_collections

scl enable rh-python38 bash
mkdir ~/pydev
$ cd ~/pydev

$ python3 -m venv py38-venv
$ source py38-venv/bin/activate

(py38-venv) $ python3 -m pip install ...some modules...

robobenklein · 2023-04-11T00:14:27Z

What encoding / character set is the idxf?

  File "/home/bklein3/WorldSyntaxTree/wsyntree_collector/wociterators.py", line 27, in iter_blobs
    for idx_line in idx_f:
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 7771: invalid continuation byte

audrism · 2023-04-11T02:37:59Z

encoding is everywhere set to C on the command line.

the stuff from the wild could be in any encoding whatsoever. filename comes from the tree, tree and tree has whatever encoding the user had

audrism assigned loconous Apr 10, 2023

gaokai320 mentioned this issue May 12, 2023

Fix several bugs. #52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blob / File iterators not working with python 3 #51

Blob / File iterators not working with python 3 #51

robobenklein commented Mar 27, 2023

audrism commented Mar 27, 2023

robobenklein commented Apr 10, 2023

audrism commented Apr 10, 2023

audrism commented Apr 10, 2023

robobenklein commented Apr 10, 2023

audrism commented Apr 10, 2023

robobenklein commented Apr 10, 2023

audrism commented Apr 10, 2023

robobenklein commented Apr 11, 2023

audrism commented Apr 11, 2023

Blob / File iterators not working with python 3 #51

Blob / File iterators not working with python 3 #51

Comments

robobenklein commented Mar 27, 2023

audrism commented Mar 27, 2023

robobenklein commented Apr 10, 2023

audrism commented Apr 10, 2023

audrism commented Apr 10, 2023

robobenklein commented Apr 10, 2023

audrism commented Apr 10, 2023

robobenklein commented Apr 10, 2023

audrism commented Apr 10, 2023

robobenklein commented Apr 11, 2023

audrism commented Apr 11, 2023