Skip to content

Conversation

@jungnitz
Copy link

@jungnitz jungnitz commented Aug 8, 2025

Hi, thanks for the library!

When using the Streams for text extraction, GlobalRefs are dropped after the thread is already detached from the VM. This leads to a warning from jni crate:

Dropping a GlobalRef in a detached thread. Fix your code if this message appears frequently (see the GlobalRef docs).

I fixed this by including the AttachGuard in the JReaderInputStream struct, so that the thread is detached only after all GlobalRefs in it are dropped.

(also, the tagged v0.3.0 commit is not part of any branch in this repository and main is outdated)

@jungnitz
Copy link
Author

Ah, I see. This seems to be incompatible with the Python bindings, because pyobjects have to be Send and that is obviously not implemented for AttachGuard.
I have put the changes behind a feature flag so that it does no longer break the Python builds. Depending on whether this change would be included in a minor or patch release, one could also think about making it a default feature and disabling it for the python bindings, which I personally would prefer.

mjpowersjr added a commit to ProSync/extractous that referenced this pull request Nov 21, 2025
This commit addresses multiple critical stability issues and enhances
file format support through dependency upgrades and metadata improvements.

Key Changes:
- Upgrade Apache Tika from 2.9.2 to 2.9.3 (last stable 2.x release)
- Fix JNI memory management in JReaderInputStream Drop implementation
- Add 33 missing OOXML class definitions to GraalVM reflection metadata
  - XLSX support: CTWorkbook, CTWorksheet, CTCell, CTRow classes
  - PPTX SmartArt: CTDiagram, CTDataModel, CTLayoutNode classes
  - Visio support: PageContentsDocument, VisioDocument classes
  - Additional XMLBeans and Word processing classes
- Fix documentation inconsistencies in config defaults (4 fixes)
- Add workspace sections to Cargo.toml files for independent builds

Issues Addressed:
- yobix-ai#64: JNI memory management and GlobalRef cleanup order
- yobix-ai#60: RuntimeException with XLSX files
- yobix-ai#58: JniError with PPTX files containing SmartArt
- yobix-ai#40: Visio graphics in DOC files causing failures
- yobix-ai#44: Inconsistent config default documentation

All changes focus on stability and file format compatibility without
adding new features. Tests pass successfully with core extraction
functionality verified.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant