Skip to content

Conversation

@ooprathamm
Copy link
Contributor

Closes #759

@mr-tz
Copy link
Collaborator

mr-tz commented Apr 9, 2024

overall this appears to add complexity vs. reducing it. can you walk me through the benefits of this approach?

@ooprathamm
Copy link
Contributor Author

pkgutil.get_data is supposed to be better at getting a resource from a package.

Profiling load_databases:
image
image
(Similar perf across multiple runs though)

  • As QS is distributed as a bundled .exe pkgutil references directly to these embedded files, rather than relative to floss.qs.db path.
  • Directly reading bytes, rather than 2 step approach to find path and then read_bytes()

Also pyinstaller .spec guide mentions both the approaches without specifing any added benefits [using .spec]

@williballenthin could you provide us with your concerns for opening the issue .

@williballenthin
Copy link
Collaborator

williballenthin commented Apr 10, 2024

I understand pkgutil to expose an interface that Python packagers will implement when building single-file executables, like PyInstaller. The get_data routine lets you access supporting data (non-.py files) found within a module.

This doesn't really matter for packagers that extract a copy of the module to the file system during loading, like PyInstaller. For these packagers, we can use pathlib to access file system paths reletive to Python module contents. For PyInstaller, pkgutil is probably a small shim over file system operations, anyways.

However, there are packagers that don't write module content to disk. They might keep Python byte code in memory, having read it from a compressed resource or even from the network. In this case, we can't rely on using pathlib to access data files. We should use pkgutil instead.

But! We might decide that we don't want to support those packagers (which seem to be uncommon) and consider this out of scope. Or, we could follow the spec and be prepared to switch to something other than PyInstaller.

Does that make sense? @mr-tz @ooprathamm

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also try to find one of these packagers and see if it works with the proposed changes. I suspect Nuitka might work for this.


@classmethod
def from_file(cls, path: pathlib.Path) -> "ExpertStringDatabase":
def from_file(cls, package: str, resource: str) -> "ExpertStringDatabase":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep from_file around since it's really useful for dev and testing. but let's add a from_pkgutil to address the feature request. and please factor out the common code into a sub routine.

Copy link
Contributor Author

@ooprathamm ooprathamm Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which routine will the get_default_databases use for now ? Or we intend to keep it as a future utility.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please prefer pkgutil

@ooprathamm
Copy link
Contributor Author

For simplicity, the code for using from_file is commented out.
Or we could leave something like BASE_PATH = pathlib.Path(floss.qs.db.__file__).parent constant to infer on own.

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes look great!

I still think we should try to find a package that uses the pkgutil interface. Do you have some time to find one @ooprathamm?

@ooprathamm
Copy link
Contributor Author

Current Scenario -
Pyinstaller - Uses a bootloader that uncompresses data files to a temp folder. [Limitation]

Other Executable Builder Packages-

  • py2exe - Windows Only, data files bundled alongside the .exe file in the directory structure
  • cx_Freeze - Doesn't embed data files to executables. Keeps data files in a directory structure
  • Pyoxidizer [unmaintained now] - pkgutil compatability mentioned for importing modules from memory, for loading data files encourages the use of importlib.resources, doesn't mention pkgutil

Viable Option -

@williballenthin
Copy link
Collaborator

first off, great research @ooprathamm!

next, the pyoxidizer documentation is outstanding. it does nice job of summarizing the options. and since it doesn't touch on pkgutil, I wonder if we're using that wrong? or is it the new interface alluded to at the end?

based on that article, what do you think the best strategy is, @ooprathamm? and, would you consider trying out pyoxidizer on FLOSS to see if it works with our code?

@ooprathamm
Copy link
Contributor Author

ooprathamm commented Apr 14, 2024

On further looking into PyOxidizer.
pkgutil.get_data() enhancement for in_memory rsrc still pending github

The incompatibility is further confirmed here. (comparing different utilities ).

For now, It might not be good to consider PyOxidizer

@williballenthin
Copy link
Collaborator

ok sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants