Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF-HUL-45 : What logic is being checked for malformed filters? #971

Open
asciim0 opened this issue Nov 12, 2024 · 9 comments
Open

PDF-HUL-45 : What logic is being checked for malformed filters? #971

asciim0 opened this issue Nov 12, 2024 · 9 comments
Assignees
Milestone

Comments

@asciim0
Copy link
Contributor

asciim0 commented Nov 12, 2024

I'm curious what logic is actually being checked for PDF-HUL-45 error messages. I very much appreciate the fact that filter arrays are now supported and no longer throw an error, however, it seems that most false manipulations I conduct to filter dictionaries pass validation as well.

Please see attached file to try put various dictionary manipulations, e.g.:
The obj should be (and currently is):
malformednew.pdf

22 0 obj << /BitsPerComponent 8 /ColorSpace 23 0 R /Filter /DCTDecode /Height 1042 /Name /X /Subtype /Image /Type /XObject /Width 736 /Length 114577 >>

Changing for example the filter from /DCTDecode to /DXTDecode or something else fictive, still results in a well-formed and valid file.

Could you tell me what exactly JHOVE is checking in a filter dictionary?

malformednew.pdf

@samalloing
Copy link
Collaborator

Hi @asciim0 ,

I made a quick look at filters in the PdfStream.java file. The structure is checked, but not the name of the filter itself (also the decode parameters are stored). If there is a complete list of possible filters, then that would be easy to add I think.

Sam

@asciim0
Copy link
Contributor Author

asciim0 commented Nov 15, 2024

Could you please elaborate on what you mean by "the structure is checked"? what does that include? mandatory keys for all filters? for some?

@samalloing
Copy link
Collaborator

Just a simple check for the PdfObject for example PdfArray or a PdfSimpleObject

@asciim0
Copy link
Contributor Author

asciim0 commented Nov 15, 2024

I'm sorry, I still don't understand what that simple check means. You mean it just checks if an array is a correct array? Could you give a logic translation of the code for checking filters per chance?

@samalloing
Copy link
Collaborator

Hi Micky,

Sure the java code translates the PDF entities to java objects. So for example you have array that is a PDF array. So what the code does, it implements what type of PDF entity is allowed in this specific case a Filter, can be a PDF Object or a PDF array. A Filter can also be an indirect Reference. That was not implemented at first so this gave an error (PDF-HUL-45) until it was added. What the current code does is test if the filter is a PDF Object, PDF array or an indirect reference. If in a PDF something else like I don't know a dictionary, there will be an error. It will also check if the array is correct indeed.

Hope this makes it clear

Sam

@asciim0
Copy link
Contributor Author

asciim0 commented Nov 19, 2024

Just to make sure we're talking about the same thing here:
What is being checked is the value of the Key Filter, right? As per spec (ISO 32000-2:2020, sect 7.4, Table 5 that can be:

  • a name
  • an array of zero, one or several names (of filter(s))

Does that align with what is being checked? Or isn't it the value of the /Filter at all that is being checked?
What I'm trying to do is trigger the PDF-HUL-45 rule by manipulating a file containing a filter ... but I can change the value to pretty much whatever I want to (nothing, integer, indirect reference) and the file is still validated as well-formed and valid.

@samalloing
Copy link
Collaborator

Sure! No the value of the filter is not check. What is checked if it is "an array of zero, one or several names (of filter(s)". And if it is a PDF Object or an Indirect reference. But this is just the structure of the PDF. What I mean in your example a filter is at "17 0 obj". That is the only thing that is checked. If you want to trigger PDF-HUL-45. I'll send you an example.

@asciim0
Copy link
Contributor Author

asciim0 commented Nov 22, 2024

I took a look at the file that Sam shared with me. It seems that what triggers the error is the indirect reference leading to an error. The error was thrown at the end of obj 183:
183 0 obj [/ASCII85Decode /LZWDecode]

obj 183 is referenced by obj 184:
184 0 obj <</Filter 183 0 R /Length 185 0 R>> stream

As far as I understand the spec, arrays (like all objects) can be represented by indirect objects and filter values can be names or arrays ... and therefore also indirect objects. The syntax of the array looks fine.
I therefore believe that there is still a possible case of a false positive for this error, as shown here.

I also still don't understand what the malformed filter then checks :-P

@carlwilson carlwilson self-assigned this Dec 5, 2024
@carlwilson carlwilson added this to the JHOVE 1.34 milestone Dec 5, 2024
@carlwilson
Copy link
Member

carlwilson commented Jan 21, 2025

Hi @asciim0, I've taken a look and here's the best answer I can come up with. For reference the code for filter extraction is here: https://github.com/openpreserve/jhove/blob/rel/1.32/jhove-modules/pdf-hul/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PdfStream.java#L136

  1. Get the PdfObject with the dictionary key Filter. If that's null, then check for stream filters (dictionary key FFilter). If no filters found that's it.
  2. If we have found either Filters or FFilters the get the parameter object, DecodeParms or FDecodeParms respectively.
  3. IF the Filter/FFilter object is a single string filter:
    a. IF it's a PdfSimpleObject then create a filter from the string value.
    b. and cast the params to PdfDictionary if it is an instance of PdfDictionary.
  4. IF the Filter/FFilter object is a PdfArray instance:
    a. Obtain a Vector<PdfObject> from the filter array.
    b. Assume that params are an array and cast them as such (this unchecked cast can lead to a PDF-HUL-45 error if the params aren't an array. Safe to say that this should be an explicit check first.
    c. Loop through the filters and:
    i. Cast the filter value from PdfObject to PdfSimpleObject and add the string value to the filter array, again this is an unchecked cast, if it fails then PDF-HUL-45
    ii. Resolve the param indirect reference as a PdfObject; if it's a PdfSimpleObject, and the string value is null, then move on to the next filter.
    iii. If the param object is NOT a PdfSimpleObject with a null string value then cast the param object as a PdfDictionary and add to the param array. Again this is an unchecked cast, if it fails then a PDF-HUL-45 error is raised.

That's it. To summarise the raising of PDF-HUL-45 is simply function of one of the failing casts above, so represents some very simple type checking. It can also be triggered by a failure to resolve an indirect reference. The error really represents a failure to parse the filter rather than a rigorous check of the filter against the specification. There are two pieces of code that do the same thing sightly differently, depending on whether you have a filter array or a single filter, which is dangerous. The array code in particular makes several assumptions that will lead to errors. In both cases only the "happy path" has really been considered. I suspect that if only a malformed single filter is present the code won't even flag the issue. This probably merits a "fix filter checking" issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants