Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

per-source collection numbering can collide after restoration of a larger collection from a smaller backup #7132

Open
cfm opened this issue Mar 6, 2024 · 3 comments

Comments

@cfm
Copy link
Member

cfm commented Mar 6, 2024

Description

For the collection C(S, t) for a source S at time t, collection-level numbering can collide (and effectively lock) if a smaller collection C(S, 1) is restored from backup over a larger collection C(S, 2). This is a by-product of #5225.

Steps to Reproduce

During #7121:

  1. Create source S and submit something.
  2. Reply to source S, resulting in collection C(S, 1).
  3. Back up the server, including collection C(S, 1).
  4. Reply again to source S, resulting in collection C(S, 2) > C(S, 1).
  5. Restore the backup from step (3).
  6. Reply again to source S.

Expected Behavior

The reply is saved successfully.

Actual Behavior

HTTP 500 error with log messages:

[Wed Mar 06 17:38:32.029048 2024] [wsgi:error] [pid 753:tid 129806460811008] [remote 127.0.0.1:38650]   File "/var/www/securedrop/journalist_app/main.py", line 144, in reply
[Wed Mar 06 17:38:32.029050 2024] [wsgi:error] [pid 753:tid 129806460811008] [remote 127.0.0.1:38650]     EncryptionManager.get_default().encrypt_journalist_reply(
[Wed Mar 06 17:38:32.029052 2024] [wsgi:error] [pid 753:tid 129806460811008] [remote 127.0.0.1:38650]   File "/var/www/securedrop/encryption.py", line 166, in encrypt_journalist_reply
[Wed Mar 06 17:38:32.029065 2024] [wsgi:error] [pid 753:tid 129806460811008] [remote 127.0.0.1:38650]     redwood.encrypt_message(
[Wed Mar 06 17:38:32.029067 2024] [wsgi:error] [pid 753:tid 129806460811008] [remote 127.0.0.1:38650] redwood.RedwoodError: Io(Os { code: 17, kind: AlreadyExists, message: "File exists" })

Comments

The reply from step (4) is not removed from disk when the backup (without it) is restored in step (5).

ansible.builtin.unarchive does not appear to have an argument like rsync --delete.

@zenmonkeykstop
Copy link
Contributor

zenmonkeykstop commented Mar 6, 2024

Yikes. Probably not a regression but it is a nasty bug in backup/restore procedure when restoring over updated data.

I think one mitigating factor in practice is that the inaccessible reply from C(S,2) will be deleted by the orphaned submission nightly job. But that would also imply that any new submissions or replies since the backup would be nuked, so you'd always just be restoring only to the backup point anyway.

We could make that explicit by nuking the in-place data first or asking folks to do so. At the very least this would require a docs update to make the backup behaviour explicit.

(One reason we may not have encountered this in the wild is that folks may only be backing up after complete data loss, and to new instances, in which case it wouldn't be an issue.)

@cfm cfm changed the title per-source collection numbering can collide after restoration of a larger collection from backup per-source collection numbering can collide after restoration of a larger collection from a smaller backup Mar 6, 2024
@cfm
Copy link
Member Author

cfm commented Mar 6, 2024

I support formalizing the restore operation as a filesystem- as well as database-level equivalent of a strict DROP TABLE; CREATE TABLE; INSERT INTO, rather than the filesystem upsert it basically is now. Whether that strictness is immediate (enforced by the restore operation) or eventual (enforced by the nightly clean-up job) is, I agree, mostly an edge concern, as much as I'd like to reduce it to zero.

@zenmonkeykstop
Copy link
Contributor

running the orphaned submission job as part of the restore operation seems doable. We'd need to be more intentional about quiescing the app first tho, otherwise there would be more edge cases to get snagged on.

@cfm cfm mentioned this issue Mar 6, 2024
31 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@cfm @zenmonkeykstop and others