-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF-HUL-45 : What logic is being checked for malformed filters? #971
Comments
Hi @asciim0 , I made a quick look at filters in the PdfStream.java file. The structure is checked, but not the name of the filter itself (also the decode parameters are stored). If there is a complete list of possible filters, then that would be easy to add I think. Sam |
Could you please elaborate on what you mean by "the structure is checked"? what does that include? mandatory keys for all filters? for some? |
Just a simple check for the PdfObject for example PdfArray or a PdfSimpleObject |
I'm sorry, I still don't understand what that simple check means. You mean it just checks if an array is a correct array? Could you give a logic translation of the code for checking filters per chance? |
Hi Micky, Sure the java code translates the PDF entities to java objects. So for example you have array that is a PDF array. So what the code does, it implements what type of PDF entity is allowed in this specific case a Filter, can be a PDF Object or a PDF array. A Filter can also be an indirect Reference. That was not implemented at first so this gave an error (PDF-HUL-45) until it was added. What the current code does is test if the filter is a PDF Object, PDF array or an indirect reference. If in a PDF something else like I don't know a dictionary, there will be an error. It will also check if the array is correct indeed. Hope this makes it clear Sam |
Just to make sure we're talking about the same thing here:
Does that align with what is being checked? Or isn't it the value of the /Filter at all that is being checked? |
Sure! No the value of the filter is not check. What is checked if it is "an array of zero, one or several names (of filter(s)". And if it is a PDF Object or an Indirect reference. But this is just the structure of the PDF. What I mean in your example a filter is at "17 0 obj". That is the only thing that is checked. If you want to trigger PDF-HUL-45. I'll send you an example. |
I took a look at the file that Sam shared with me. It seems that what triggers the error is the indirect reference leading to an error. The error was thrown at the end of obj 183: obj 183 is referenced by obj 184: As far as I understand the spec, arrays (like all objects) can be represented by indirect objects and filter values can be names or arrays ... and therefore also indirect objects. The syntax of the array looks fine. I also still don't understand what the malformed filter then checks :-P |
Hi @asciim0, I've taken a look and here's the best answer I can come up with. For reference the code for filter extraction is here: https://github.com/openpreserve/jhove/blob/rel/1.32/jhove-modules/pdf-hul/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PdfStream.java#L136
That's it. To summarise the raising of PDF-HUL-45 is simply function of one of the failing casts above, so represents some very simple type checking. It can also be triggered by a failure to resolve an indirect reference. The error really represents a failure to parse the filter rather than a rigorous check of the filter against the specification. There are two pieces of code that do the same thing sightly differently, depending on whether you have a filter array or a single filter, which is dangerous. The array code in particular makes several assumptions that will lead to errors. In both cases only the "happy path" has really been considered. I suspect that if only a malformed single filter is present the code won't even flag the issue. This probably merits a "fix filter checking" issue. |
I'm curious what logic is actually being checked for PDF-HUL-45 error messages. I very much appreciate the fact that filter arrays are now supported and no longer throw an error, however, it seems that most false manipulations I conduct to filter dictionaries pass validation as well.
Please see attached file to try put various dictionary manipulations, e.g.:
The obj should be (and currently is):
malformednew.pdf
22 0 obj << /BitsPerComponent 8 /ColorSpace 23 0 R /Filter /DCTDecode /Height 1042 /Name /X /Subtype /Image /Type /XObject /Width 736 /Length 114577 >>
Changing for example the filter from /DCTDecode to /DXTDecode or something else fictive, still results in a well-formed and valid file.
Could you tell me what exactly JHOVE is checking in a filter dictionary?
malformednew.pdf
The text was updated successfully, but these errors were encountered: