Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating TextGrid object fails for large TextGrid files #6

Open
hal3003 opened this issue May 6, 2021 · 7 comments
Open

Creating TextGrid object fails for large TextGrid files #6

hal3003 opened this issue May 6, 2021 · 7 comments

Comments

@hal3003
Copy link

hal3003 commented May 6, 2021

I'm creating a TextGrid object from a file like so

try: grid = textgrids.TextGrid(arg)

Which works well for smaller files (~300KB), but fails with the following error message for longer files ~(1.7MB-3MB)

Traceback (most recent call last): File "***", line 9, in <module> grid = textgrids.TextGrid(arg) File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 151, in __init__ self.read(self.filename) File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 402, in read self.parse(data) File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 288, in parse self._parse_long(buff) File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 359, in _parse_long x0, x1 = [float(grab(s)) for s in data[p:p + 2]] File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 359, in <listcomp> x0, x1 = [float(grab(s)) for s in data[p:p + 2]] File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 339, in <lambda> grab = lambda s: s.split(' = ')[1] IndexError: list index out of range

Might that be a bug in the library, or some Python limitation?

@maxime-fily
Copy link

Hi, I m also a user.
I have had the same issue, with a 500ko textgrid. I suspect the issue occurs in the parse function. The line that causes issues is

  grab = lambda s: s.split(' = ')[1]

But there are many limitations to the things I can do to try and bypass the issue:
First it seems to me that "defining" the grab command may not be the cause of the issue, rather its use in the very next line. My lack of knowledge of the functions and the ways to track down errors in them makes me helpless in such situations.
Second, there could somehow be a "quick fix", if we knew what are the limitations in size that trigger such issue. @Legisign : this may be distant memory, but it would be of greeeaaaat help to us if such information was accessible.

@Legisign
Copy link
Owner

Legisign commented May 20, 2021

That’s curious. That line only defines a function; I could have used the longer form with def but since this is only to take the key of a key = value pair it seemed good enough to use a lambda.

However, I could understand if calling the function caused a crash unless the string supplied as a parameter is of the form key = value. And that could mean one of two things:

  1. the file is detected as a long-form textgrid file when it’s actually a short-form one, or
  2. the parser tries to read a key = value pair when one is not coming up in the input.

Part of the difficulty is that the Praat developers didn’t think it worth while to have distinct headers for the long and short-form textgrid files; both have file type "ooTextFile" and object class "TextGrid".

@maxime-fily
Copy link

Dear Tommi,
I have made a MWE with textgrid attached. Should I e-mail them to you ?
Also, see my responses below. Hopefully this is nothing and the package will work for every type :)
(btw I use it and I try to test it, so far only one issue that we can discuss later)

That’s curious. That line only defines a function; I could have used the longer form with def but since this is only to take the key of a key = value pair it seemed good enough to use a lambda.

However, I could understand if calling the function caused a crash unless the string supplied as a parameter is of the form key = value. And that could mean one of two things:

1. the file is detected as a long-form textgrid file when it’s actually a short-form one, or

This is very unlikely because the package works ceteris paribus for small-size textgrids

2. the parser tries to read a `key = value` pair when one is not coming up in the input.

I've tried to see how "grab" works, seems simple enough, but I am not very familiar as to why the error is spotted line 372, where the grab tool is defined, and not when it is used. But again I am not a pro with debugging packages / functions

Part of the difficulty is that the Praat developers didn’t think it worth while to have distinct headers for the long and short-form textgrid files; both have file type "ooTextFile" and object class "TextGrid".

I myself don't really understand why there should be a need for two types of files, but anyway. I don't have the history.

Cheers anyway :)

@Legisign
Copy link
Owner

Legisign commented May 20, 2021

I have made a MWE with textgrid attached. Should I e-mail them to you ?

That might be helpful…

I think there might be a third option too. As the value is converted to float as soon as it is read, it might trigger an error if it tried to read a non-numeric value. However, that seems unlikely too.

I didn’t actually expect anyone to come up with a huge textgrid file as they tend to be significantly smaller than the sound files. The script basically just reads them in full instead of carefully reading line-by-line or chunk-by-chunk. In my setting (using a physical computer instead of, say, a virtual environment) it could gobble whatever I tried to feed it.

I myself don't really understand why there should be a need for two types of files

I quite agree. In fact, both text file formats are poorly designed. The short form was probably meant for conserving some space but retaining legibility for humans: it’s actually quite close to the binary format in how it’s structured. The long form, on the other hand, is very readable for humans but a pain in the a*se to parse programmatically. No doubt the creators of Praat cannot change the format to JSON or XML or whatever any longer because everyone already has so many text-form textgrids lying around.

@hal3003
Copy link
Author

hal3003 commented May 21, 2021

I didn’t actually expect anyone to come up with a huge textgrid file as they tend to be significantly smaller than the sound files. The script basically just reads them in full instead of carefully reading line-by-line or chunk-by-chunk. In my setting (using a physical computer instead of, say, a virtual environment) it could gobble whatever I tried to feed it.

I'm running my script on bare metal too. My TextGrid files are transcriptions of sociolinguistic interviews of about an hour in length. With four transcription tiers I end up with files between 1 and 3+MB.

For the time being, I'll try to split the files into smaller parts.

I myself don't really understand why there should be a need for two types of files

I guess only the praat developers know that.

@Legisign
Copy link
Owner

I looked at it in the weekend and it’s a tough call.

  • There’s very little sense in iterating over the file (instead of reading it in one chunk), because the textgrid is organized so that the outer loop consists of tiers which are (usually) as long as the whole textgrid. The intervals and points you might reasonably expect to read item-by-item are in the inner loop.
  • And in any case, iterating over the file would mean the basic structure would need to be changed. Now the whole textgrid is one OrderedDict which can be searched and manipulated as any dict can. If it were only a window into the textgrid, one would need different containers.
  • Also, I began to suspect that’s what really the issue isn’t so much the memory requirement of the data structures themselves but the fact that memory is simultaneously needed for (a) the unparsed file, (b) the OrderedDict as it is being built, and (c) the temporary variables and structures that are used in the parsing.

So yeah, it’s not optimized in any way, but then again, I’m not a real programmer myself :) Huge files might need an altogether different kind of implementation.

@maxime-fily
Copy link

I looked at it in the weekend and it’s a tough call.

* There’s very little sense in iterating over the file (instead of reading it in one chunk), because the textgrid is organized so that the outer loop consists of tiers which are (usually) as long as the whole textgrid. The intervals and points you might reasonably expect to read item-by-item are in the inner loop.

* And in any case, iterating over the file would mean the basic structure would need to be changed. Now the whole textgrid is one `OrderedDict` which can be searched and manipulated as any `dict` can. If it were only a window into the textgrid, one would need different containers.

* Also, I began to suspect that’s what really the issue isn’t so much the memory requirement of the data structures themselves but the fact that memory is simultaneously needed for (a) the unparsed file, (b) the `OrderedDict` as it is being built, and (c) the temporary variables and structures that are used in the parsing.

I agree, It probably isn't a matter of global size, but a temporary variable.

So yeah, it’s not optimized in any way, but then again, I’m not a real programmer myself :) Huge files might need an altogether different kind of implementation.
Totally cool. A warning for big TextGrids, good practices in TextGrid file size, and it'll work just fine. I wish I could help, but I'd need to learn a bit more about programming. Maybe a pointer (if it exists in python, I don't even know that) could solve the issue, dunno...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants