Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent user from using illegal characters in output filename #834

Closed
wants to merge 0 commits into from

Conversation

bnewc
Copy link
Contributor

@bnewc bnewc commented Jun 12, 2024

This should resolve issue #362 by validating that output filenames contain no illegal characters.

I've used the Python re module for validation, and created IllegalOutputFilenameException to be raised in the case of an invalid filename. I also updated test_document.py to include a unit test for this new exception type.

Filenames are checked against a universal set of illegal characters: <>:"|?*. I believe it's best practice to make the character set system-agnostic, though someone with more knowledge of the security implications could update the feature to derive the character set directly from the host system using sanitize_filename from the pathvalidate library.

My branch is open to edits if there are any changes that need to be made.

@almet
Copy link
Contributor

almet commented Jun 12, 2024

Hi, and thanks for your work on this :-)

After a quick check on the python docs, I found that by default, sanitize_pathname targets all platforms, so we can probably let the library take care of the checks directly, and remove the regexp:

The default target platform is universal. i.e. the sanitized file name is valid for any platform.


Also, the CI is currently failing because of our linting rules. It asks to apply the following diff (replacing double quotes with simple ones):

--- dangerzone/errors.py	2024-06-12 02:21:28.668833+00:00
+++ dangerzone/errors.py	2024-06-12 02:21:57.251937+00:00
@@ -44,11 +44,11 @@
 
 class IllegalOutputFilenameException(DocumentFilenameException):
     """Exception for when the output file contains illegal characters."""
 
     def __init__(self) -> None:
-        super().__init__("Filename must not contain the following characters: <>:\"|?*")        
+        super().__init__('Filename must not contain the following characters: <>:"|?*')

In case it helps, you can reproduce the linter work by running this command locally:

poetry run make lint

@almet
Copy link
Contributor

almet commented Jun 12, 2024

After a quick check on the python docs, I found that by default, sanitize_pathname targets all platforms, so we can probably let the library take care of the checks directly, and remove the regexp:

I just realized that pathvalidate is not in the standard python library, and this changes my thoughts.

I'm wondering if we should take the same approach that's the Django slugify() is taking, which seem to cope well with unicode chars by using unicodedata.normalize("NFKC", text).

@bnewc
Copy link
Contributor Author

bnewc commented Jun 12, 2024

After making the linter's suggested changes, it still takes issue with that line, suggesting I change it to be... exactly the same?

--- /root/project/dangerzone/errors.py  2024-06-12 15:36:20.384450+00:00
+++ /root/project/dangerzone/errors.py  2024-06-12 15:36:50.243414+00:00
@@ -44,11 +44,11 @@
 
 class IllegalOutputFilenameException(DocumentFilenameException):
     """Exception for when the output file contains illegal characters."""
 
     def __init__(self) -> None:
-        super().__init__('Filename must not contain the following characters: <>:"|?*')        
+        super().__init__('Filename must not contain the following characters: <>:"|?*')

@bnewc
Copy link
Contributor Author

bnewc commented Jun 12, 2024

I just realized that pathvalidate is not in the standard python library, and this changes my thoughts.

That's why I was also uncertain about using it :)

Could you clarify what you're thinking about the slugify() function? I'm not sure how checking if the text is Unicode or ASCII would catch illegal characters a user may enter, since realistically they'd be a subset of Unicode or ASCII.

@bnewc
Copy link
Contributor Author

bnewc commented Jun 12, 2024

After syncing my branch with the main dev branch, two new test failures are occurring.

A conversion error in Bookworm for the sample-hwp.hwp test:

ERROR    dangerzone.isolation_provider.base:base.py:170 [doc z1s5Yj] 0% \ 
Some unexpected error occurred while converting the document

A Dangerzone build error in Focal:

Error: error building at STEP "RUN apt-get update     && apt-get install -y \
--no-install-recommends dh-python make build-essential git fakeroot libqt5gui5 \
libxcb-cursor0 pipx python3 python3-dev python3-venv python3-stdeb \
python3-all     && rm -rf /var/lib/apt/lists/*": error while running runtime: exit status 100

I believe this is unrelated to my additions, but I wanted to point it out.

@almet
Copy link
Contributor

almet commented Jun 13, 2024

-        super().__init__('Filename must not contain the following characters: <>:"|?*')        
+        super().__init__('Filename must not contain the following characters: <>:"|?*')

This is actually fixing the extra space on the top line. You can run the black formatter locally with poetry run make black .


Could you clarify what you're thinking about the slugify() function? I'm not sure how checking if the text is Unicode or ASCII would catch illegal characters a user may enter, since realistically they'd be a subset of Unicode or ASCII.

My thinking was about taking a different way than the one you provided, and rather than displaying an error, replace the "unsafe" chars with safer ones.

After some more thinking, I believe the approach you provide here makes more sense, because some of the chars that we would consider "unsafe" can actually meaningful (think accents, for instance). Sorry for bothering you with this 🙄

Let me know when you've had the time to include the linting changes, and we should be good to go!


It's possible that the tests failures you're been seeing are now fixed in the main branch. If it's the case, rebasing on top of the latest main branch would make them go away (thanks for reporting)

@bnewc
Copy link
Contributor Author

bnewc commented Jun 15, 2024

Currently, the commit is throwing an exception on the following test in Windows:

 filename = 'D:\\a\\dangerzone\\dangerzone\\something_else.pdf'

This indicates a challenge with the validation structure. validate_output_filename(), which I modified, checks not only the user extension but also the full filepath, which on Windows will include a colon and backslashes. On Linux it will include forward slashes. These are all illegal characters for filenames, but not for filepaths. Interestingly, on the Windows GUI build, if you include a backslash in your safe extension name, the document will successfully convert and saved under only the extension name in the selected directory. For instance, choosing the safe extension \safe for questionable.docx will output filepath save_dir\safe.pdf, rather than save_dir\questionable\safe.pdf.

For now, I can remove the colon from the illegal character set to make the code functional, and effectively warn users about other illegal characters, but this will not fully address issue #362 . We need to either use something like sanitize_filename, or restructure the validation process.

@apyrgio
Copy link
Contributor

apyrgio commented Jun 17, 2024

Hey @bnewc, thanks a lot for you work on this!

You know what, a quick fix for the above would be to get just the filename (e.g., with pathlib.Path.name), and perform your check for just that part. This way, you can keep the : and other characters (such as / and \) in the illegal character set.

I'm afraid though that even this check has its caveats. Linux filesystems allow the use of virtually every character, and users may already have such files. It would not be nice to error on these files, when they are actually perfectly acceptable. So, my suggestion would be to use a different illegal character set, per OS. For Windows, we can target NTFS, and for macOS we can target HFS+. You can read more here.

Finally, note that the main idea of #362 was to not let the user proceed, until the safe extension contains legal characters. So, the best way to tackle this would be to also add check for illegal chars in this method:

def check_safe_extension_is_valid(self) -> bool:
if self.save_checkbox.checkState() == QtCore.Qt.Unchecked:
# ignore validity if not saving file
self.safe_extension_invalid.hide()
return True
if self.safe_extension.hasAcceptableInput():
self.safe_extension_invalid.hide()
return True
else:
# prevent starting conversion until correct
self.safe_extension_invalid.show()
return False

In any case, the PR is in a good direction, so I would consider this part as a good-to-have, only if you have time to work on it.

@apyrgio
Copy link
Contributor

apyrgio commented Jun 25, 2024

Hi @bnewc. We're in the process of releasing 0.7.0, so I wanted to ask if you still have time to work on this issue. If not, that's totally fine, we can continue from where you left off and merge it.

Thanks again for all your work 🙂

@bnewc
Copy link
Contributor Author

bnewc commented Jun 26, 2024

I'll see what I can do!

@bnewc
Copy link
Contributor Author

bnewc commented Jun 28, 2024

The GUI should now inform users of illegal characters in the output safe extension, and prevent conversion until the characters are removed.

However, the commit is still failing a number of checks. The linter check issue appears to be related to the Ubuntu Focal workaround for PySide2. I see how to fix that and will do so when I have a bit more time.

Additionally, it fails checks that expect an IllegalOutputFilenameException where none is raised. This is due to a discrepancy between the characters considered illegal by the test, and those considered illegal by the standards @apyrgio linked for Linux (where only \ is illegal). If someone would advise me on how to proceed, I can either make the tests match the HFS+ standard, or make the validation stricter to comply with the tests.

@apyrgio
Copy link
Contributor

apyrgio commented Jul 1, 2024

Hi @bnewc. Just writing here quick that we're currently in the middle of releasing Dangerzone 0.7.0, and I haven't found the time to look into the latest changes in your PR. Once I manage to find some time though, I'll definitely get back to this.
Thanks for your patience 🙂

Copy link
Contributor

@apyrgio apyrgio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some time to test this PR, and check out the test failures. Thanks a lot for updating our GUI checks, I think these should detect 99% of the errors.

I have also added some comments on what to check per OS. Tell me what you think.

dangerzone/document.py Outdated Show resolved Hide resolved
dangerzone/document.py Outdated Show resolved Hide resolved
QRegExValidator = QtGui.QRegExpValidator
self.dot_pdf_validator = QRegExValidator(QRegEx(r".*\.[Pp][Dd][Ff]"))
if platform.system() == "Linux":
illegal_chars_regex = r"[\\]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can check for the / character (but not the \ one).

(Might be worth having a string constant for that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched the regex contents from \\ to / and the warning widget catches / on my Linux system. Why might a string constant be preferable?

dangerzone/gui/main_window.py Outdated Show resolved Hide resolved
tests/test_document.py Outdated Show resolved Hide resolved
@bnewc
Copy link
Contributor Author

bnewc commented Jul 24, 2024

That all makes sense. I don't have time at the moment, but I should be able to implement the suggestions you made in the next couple of weeks.

@almet
Copy link
Contributor

almet commented Sep 11, 2024

Hi @bnewc, I hope you're doing well.

We're currently planning the 0.8.0 release, which we plan on releasing for mid October. Don't hesitate to let us know if you need some help on going forward on this issue, it would be great to have it included then!

@bnewc
Copy link
Contributor Author

bnewc commented Sep 21, 2024

Hi @almet , I'm implementing the suggestions made by @apyrgio . I'll let you know if I run into any issues.

@bnewc
Copy link
Contributor Author

bnewc commented Sep 22, 2024

Per my testing, the GUI now works as intended. However, some fault was introduced when I merged your main branch into mine. Now nearly every conversion test fails.

The exception tracebacks are not giving me much info, but I'm still investigating. If you want to take a look, something happened between these two commits.

@apyrgio
Copy link
Contributor

apyrgio commented Sep 24, 2024

Awesome, thanks a lot @bnewc. Taking a look right now and will let you know. As for the errors, we are aware of those, and they are tracked in #928. In a nutshell, the latest gVisor release somehow does not work well in a nested container. We have a workaround in the issue to work with the previous release, if you want to test out stuff.

@bnewc
Copy link
Contributor Author

bnewc commented Sep 24, 2024

Thanks for the update, @apyrgio. I'll test out my fixes with the previous gVisor release.

@bnewc
Copy link
Contributor Author

bnewc commented Sep 24, 2024

I can confirm that using the previous gVisor release fixes the issue of failing conversions.

@bnewc
Copy link
Contributor Author

bnewc commented Sep 25, 2024

The CI tests pass with my changes and the previous version of gVisor. The issue should be fixed once you resolve #928.

@apyrgio
Copy link
Contributor

apyrgio commented Sep 25, 2024

Gave it one more look. Awesome work @bnewc! I'll make sure to merge this once our CI tests work again. For the time being, I've approved this PR.

@apyrgio
Copy link
Contributor

apyrgio commented Oct 7, 2024

Hm, I wanted to rebase and squash your commits on top of our main branch, but I guess I closed your PR. Weird...

Anyways, I opened a new PR (#942), where the commits are squashed, and your authorship is retained. Once the tests pass, I'll merge this PR. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants