Prevent user from using illegal characters in output filename #834

bnewc · 2024-06-12T01:19:55Z

This should resolve issue #362 by validating that output filenames contain no illegal characters.

I've used the Python re module for validation, and created IllegalOutputFilenameException to be raised in the case of an invalid filename. I also updated test_document.py to include a unit test for this new exception type.

Filenames are checked against a universal set of illegal characters: <>:"|?*. I believe it's best practice to make the character set system-agnostic, though someone with more knowledge of the security implications could update the feature to derive the character set directly from the host system using sanitize_filename from the pathvalidate library.

My branch is open to edits if there are any changes that need to be made.

almet · 2024-06-12T08:24:09Z

Hi, and thanks for your work on this :-)

After a quick check on the python docs, I found that by default, sanitize_pathname targets all platforms, so we can probably let the library take care of the checks directly, and remove the regexp:

The default target platform is universal. i.e. the sanitized file name is valid for any platform.

Also, the CI is currently failing because of our linting rules. It asks to apply the following diff (replacing double quotes with simple ones):

--- dangerzone/errors.py	2024-06-12 02:21:28.668833+00:00
+++ dangerzone/errors.py	2024-06-12 02:21:57.251937+00:00
@@ -44,11 +44,11 @@
 
 class IllegalOutputFilenameException(DocumentFilenameException):
     """Exception for when the output file contains illegal characters."""
 
     def __init__(self) -> None:
-        super().__init__("Filename must not contain the following characters: <>:\"|?*")        
+        super().__init__('Filename must not contain the following characters: <>:"|?*')

In case it helps, you can reproduce the linter work by running this command locally:

poetry run make lint

almet · 2024-06-12T09:02:33Z

After a quick check on the python docs, I found that by default, sanitize_pathname targets all platforms, so we can probably let the library take care of the checks directly, and remove the regexp:

I just realized that pathvalidate is not in the standard python library, and this changes my thoughts.

I'm wondering if we should take the same approach that's the Django slugify() is taking, which seem to cope well with unicode chars by using unicodedata.normalize("NFKC", text).

bnewc · 2024-06-12T16:36:00Z

After making the linter's suggested changes, it still takes issue with that line, suggesting I change it to be... exactly the same?

--- /root/project/dangerzone/errors.py  2024-06-12 15:36:20.384450+00:00
+++ /root/project/dangerzone/errors.py  2024-06-12 15:36:50.243414+00:00
@@ -44,11 +44,11 @@
 
 class IllegalOutputFilenameException(DocumentFilenameException):
     """Exception for when the output file contains illegal characters."""
 
     def __init__(self) -> None:
-        super().__init__('Filename must not contain the following characters: <>:"|?*')        
+        super().__init__('Filename must not contain the following characters: <>:"|?*')

bnewc · 2024-06-12T16:39:22Z

I just realized that pathvalidate is not in the standard python library, and this changes my thoughts.

That's why I was also uncertain about using it :)

Could you clarify what you're thinking about the slugify() function? I'm not sure how checking if the text is Unicode or ASCII would catch illegal characters a user may enter, since realistically they'd be a subset of Unicode or ASCII.

bnewc · 2024-06-12T16:44:28Z

After syncing my branch with the main dev branch, two new test failures are occurring.

A conversion error in Bookworm for the sample-hwp.hwp test:

ERROR    dangerzone.isolation_provider.base:base.py:170 [doc z1s5Yj] 0% \ 
Some unexpected error occurred while converting the document

A Dangerzone build error in Focal:

Error: error building at STEP "RUN apt-get update     && apt-get install -y \
--no-install-recommends dh-python make build-essential git fakeroot libqt5gui5 \
libxcb-cursor0 pipx python3 python3-dev python3-venv python3-stdeb \
python3-all     && rm -rf /var/lib/apt/lists/*": error while running runtime: exit status 100

I believe this is unrelated to my additions, but I wanted to point it out.

almet · 2024-06-13T13:43:38Z

-        super().__init__('Filename must not contain the following characters: <>:"|?*')        
+        super().__init__('Filename must not contain the following characters: <>:"|?*')

This is actually fixing the extra space on the top line. You can run the black formatter locally with poetry run make black .

Could you clarify what you're thinking about the slugify() function? I'm not sure how checking if the text is Unicode or ASCII would catch illegal characters a user may enter, since realistically they'd be a subset of Unicode or ASCII.

My thinking was about taking a different way than the one you provided, and rather than displaying an error, replace the "unsafe" chars with safer ones.

After some more thinking, I believe the approach you provide here makes more sense, because some of the chars that we would consider "unsafe" can actually meaningful (think accents, for instance). Sorry for bothering you with this 🙄

Let me know when you've had the time to include the linting changes, and we should be good to go!

It's possible that the tests failures you're been seeing are now fixed in the main branch. If it's the case, rebasing on top of the latest main branch would make them go away (thanks for reporting)

bnewc · 2024-06-15T01:31:51Z

Currently, the commit is throwing an exception on the following test in Windows:

 filename = 'D:\\a\\dangerzone\\dangerzone\\something_else.pdf'

This indicates a challenge with the validation structure. validate_output_filename(), which I modified, checks not only the user extension but also the full filepath, which on Windows will include a colon and backslashes. On Linux it will include forward slashes. These are all illegal characters for filenames, but not for filepaths. Interestingly, on the Windows GUI build, if you include a backslash in your safe extension name, the document will successfully convert and saved under only the extension name in the selected directory. For instance, choosing the safe extension \safe for questionable.docx will output filepath save_dir\safe.pdf, rather than save_dir\questionable\safe.pdf.

For now, I can remove the colon from the illegal character set to make the code functional, and effectively warn users about other illegal characters, but this will not fully address issue #362 . We need to either use something like sanitize_filename, or restructure the validation process.

apyrgio · 2024-06-17T12:49:17Z

Hey @bnewc, thanks a lot for you work on this!

You know what, a quick fix for the above would be to get just the filename (e.g., with pathlib.Path.name), and perform your check for just that part. This way, you can keep the : and other characters (such as / and \) in the illegal character set.

I'm afraid though that even this check has its caveats. Linux filesystems allow the use of virtually every character, and users may already have such files. It would not be nice to error on these files, when they are actually perfectly acceptable. So, my suggestion would be to use a different illegal character set, per OS. For Windows, we can target NTFS, and for macOS we can target HFS+. You can read more here.

Finally, note that the main idea of #362 was to not let the user proceed, until the safe extension contains legal characters. So, the best way to tackle this would be to also add check for illegal chars in this method:

dangerzone/dangerzone/gui/main_window.py

Lines 768 to 780 in e81ecbc

    
           def check_safe_extension_is_valid(self) -> bool: 
        
               if self.save_checkbox.checkState() == QtCore.Qt.Unchecked: 
        
                   # ignore validity if not saving file 
        
                   self.safe_extension_invalid.hide() 
        
                   return True 
        
               if self.safe_extension.hasAcceptableInput(): 
        
                   self.safe_extension_invalid.hide() 
        
                   return True 
        
               else: 
        
                   # prevent starting conversion until correct 
        
                   self.safe_extension_invalid.show() 
        
                   return False

In any case, the PR is in a good direction, so I would consider this part as a good-to-have, only if you have time to work on it.

apyrgio · 2024-06-25T09:36:58Z

Hi @bnewc. We're in the process of releasing 0.7.0, so I wanted to ask if you still have time to work on this issue. If not, that's totally fine, we can continue from where you left off and merge it.

Thanks again for all your work 🙂

bnewc · 2024-06-26T02:16:59Z

I'll see what I can do!

bnewc · 2024-06-28T06:34:27Z

The GUI should now inform users of illegal characters in the output safe extension, and prevent conversion until the characters are removed.

However, the commit is still failing a number of checks. The linter check issue appears to be related to the Ubuntu Focal workaround for PySide2. I see how to fix that and will do so when I have a bit more time.

Additionally, it fails checks that expect an IllegalOutputFilenameException where none is raised. This is due to a discrepancy between the characters considered illegal by the test, and those considered illegal by the standards @apyrgio linked for Linux (where only \ is illegal). If someone would advise me on how to proceed, I can either make the tests match the HFS+ standard, or make the validation stricter to comply with the tests.

apyrgio · 2024-07-01T14:38:16Z

Hi @bnewc. Just writing here quick that we're currently in the middle of releasing Dangerzone 0.7.0, and I haven't found the time to look into the latest changes in your PR. Once I manage to find some time though, I'll definitely get back to this.
Thanks for your patience 🙂

apyrgio

I found some time to test this PR, and check out the test failures. Thanks a lot for updating our GUI checks, I think these should detect 99% of the errors.

I have also added some comments on what to check per OS. Tell me what you think.

dangerzone/document.py

apyrgio · 2024-07-08T13:48:28Z

dangerzone/gui/main_window.py

+            QRegExValidator = QtGui.QRegExpValidator
+        self.dot_pdf_validator = QRegExValidator(QRegEx(r".*\.[Pp][Dd][Ff]"))
+        if platform.system() == "Linux":
+            illegal_chars_regex = r"[\\]"


Here you can check for the / character (but not the \ one).

(Might be worth having a string constant for that)

I've switched the regex contents from \\ to / and the warning widget catches / on my Linux system. Why might a string constant be preferable?

dangerzone/gui/main_window.py

tests/test_document.py

bnewc · 2024-07-24T00:51:49Z

That all makes sense. I don't have time at the moment, but I should be able to implement the suggestions you made in the next couple of weeks.

almet · 2024-09-11T14:18:08Z

Hi @bnewc, I hope you're doing well.

We're currently planning the 0.8.0 release, which we plan on releasing for mid October. Don't hesitate to let us know if you need some help on going forward on this issue, it would be great to have it included then!

bnewc · 2024-09-21T17:17:08Z

Hi @almet , I'm implementing the suggestions made by @apyrgio . I'll let you know if I run into any issues.

bnewc · 2024-09-22T00:23:31Z

Per my testing, the GUI now works as intended. However, some fault was introduced when I merged your main branch into mine. Now nearly every conversion test fails.

The exception tracebacks are not giving me much info, but I'm still investigating. If you want to take a look, something happened between these two commits.

apyrgio · 2024-09-24T13:10:30Z

Awesome, thanks a lot @bnewc. Taking a look right now and will let you know. As for the errors, we are aware of those, and they are tracked in #928. In a nutshell, the latest gVisor release somehow does not work well in a nested container. We have a workaround in the issue to work with the previous release, if you want to test out stuff.

bnewc · 2024-09-24T17:53:05Z

Thanks for the update, @apyrgio. I'll test out my fixes with the previous gVisor release.

bnewc · 2024-09-24T19:20:36Z

I can confirm that using the previous gVisor release fixes the issue of failing conversions.

bnewc · 2024-09-25T01:28:47Z

The CI tests pass with my changes and the previous version of gVisor. The issue should be fixed once you resolve #928.

apyrgio · 2024-09-25T09:12:10Z

Gave it one more look. Awesome work @bnewc! I'll make sure to merge this once our CI tests work again. For the time being, I've approved this PR.

apyrgio · 2024-10-07T15:10:59Z

Hm, I wanted to rebase and squash your commits on top of our main branch, but I guess I closed your PR. Weird...

Anyways, I opened a new PR (#942), where the commits are squashed, and your authorship is retained. Once the tests pass, I'll merge this PR. Cheers!

apyrgio reviewed Jul 8, 2024

View reviewed changes

bnewc force-pushed the main branch from d16a806 to a2bb6c3 Compare September 21, 2024 21:54

apyrgio approved these changes Sep 25, 2024

View reviewed changes

apyrgio closed this Oct 7, 2024

apyrgio force-pushed the main branch from 5bff305 to 2751895 Compare October 7, 2024 15:05

apyrgio mentioned this pull request Oct 7, 2024

Prevent user from using illegal characters in output filename #942

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent user from using illegal characters in output filename #834

Prevent user from using illegal characters in output filename #834

bnewc commented Jun 12, 2024 •

edited

Loading

almet commented Jun 12, 2024 •

edited

Loading

almet commented Jun 12, 2024

bnewc commented Jun 12, 2024

bnewc commented Jun 12, 2024

bnewc commented Jun 12, 2024 •

edited

Loading

almet commented Jun 13, 2024 •

edited

Loading

bnewc commented Jun 15, 2024

apyrgio commented Jun 17, 2024

apyrgio commented Jun 25, 2024

bnewc commented Jun 26, 2024

bnewc commented Jun 28, 2024 •

edited

Loading

apyrgio commented Jul 1, 2024

apyrgio left a comment

apyrgio Jul 8, 2024

bnewc Sep 21, 2024

bnewc commented Jul 24, 2024

almet commented Sep 11, 2024

bnewc commented Sep 21, 2024

bnewc commented Sep 22, 2024

apyrgio commented Sep 24, 2024

bnewc commented Sep 24, 2024 •

edited

Loading

bnewc commented Sep 24, 2024

bnewc commented Sep 25, 2024 •

edited

Loading

apyrgio commented Sep 25, 2024

apyrgio commented Oct 7, 2024

Prevent user from using illegal characters in output filename #834

Prevent user from using illegal characters in output filename #834

Conversation

bnewc commented Jun 12, 2024 • edited Loading

almet commented Jun 12, 2024 • edited Loading

almet commented Jun 12, 2024

bnewc commented Jun 12, 2024

bnewc commented Jun 12, 2024

bnewc commented Jun 12, 2024 • edited Loading

almet commented Jun 13, 2024 • edited Loading

bnewc commented Jun 15, 2024

apyrgio commented Jun 17, 2024

apyrgio commented Jun 25, 2024

bnewc commented Jun 26, 2024

bnewc commented Jun 28, 2024 • edited Loading

apyrgio commented Jul 1, 2024

apyrgio left a comment

Choose a reason for hiding this comment

apyrgio Jul 8, 2024

Choose a reason for hiding this comment

bnewc Sep 21, 2024

Choose a reason for hiding this comment

bnewc commented Jul 24, 2024

almet commented Sep 11, 2024

bnewc commented Sep 21, 2024

bnewc commented Sep 22, 2024

apyrgio commented Sep 24, 2024

bnewc commented Sep 24, 2024 • edited Loading

bnewc commented Sep 24, 2024

bnewc commented Sep 25, 2024 • edited Loading

apyrgio commented Sep 25, 2024

apyrgio commented Oct 7, 2024

bnewc commented Jun 12, 2024 •

edited

Loading

almet commented Jun 12, 2024 •

edited

Loading

bnewc commented Jun 12, 2024 •

edited

Loading

almet commented Jun 13, 2024 •

edited

Loading

bnewc commented Jun 28, 2024 •

edited

Loading

bnewc commented Sep 24, 2024 •

edited

Loading

bnewc commented Sep 25, 2024 •

edited

Loading