-
Notifications
You must be signed in to change notification settings - Fork 27
feat(storage): Git-backed storage adapter (prototype) #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(storage): Git-backed storage adapter (prototype) #71
Conversation
Summary of ChangesHello @Prateekiiitg56, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a foundational prototype for a Git-backed storage system designed to manage immutable patient vaults. The primary goal is to leverage Git's inherent versioning capabilities to establish a robust, auditable history for clinical data. The implementation includes a flexible storage interface, a concrete Git-based adapter with optimistic locking for concurrency control, and a functional Flask application to demonstrate the system's core features like file uploads, history tracking, and version diffing. This work lays the groundwork for a versioned data management approach, emphasizing accountability and data integrity. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a prototype for a Git-backed storage adapter, which is a creative and powerful approach for maintaining an audit trail of patient data. The implementation is well-structured with a base adapter interface, a Git implementation, and a demo Flask application. The code is generally clear and includes tests for the core functionality.
My review focuses on security and design improvements. I've identified critical security vulnerabilities, including path traversal and command injection, that should be addressed even in this prototype stage. I've also suggested some design enhancements to the storage adapter interface and implementation to improve robustness and auditability. These changes will help build a more secure and maintainable foundation for this feature.
| filename = f.filename | ||
| relative_path = os.path.join(patient_id, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a path traversal vulnerability here. Both patient_id and f.filename come from user input and are used to construct a file path without proper sanitization. A malicious user could provide values like ../ to read or write files outside the intended storage directory. This is a critical security risk.
You should sanitize both inputs. For example:
from werkzeug.utils import secure_filename
# In upload()
patient_id = request.form.get('patient_id')
if not patient_id or os.path.sep in patient_id or '..' in patient_id:
return jsonify({'error': 'valid patient_id is required'}), 400
f = request.files.get('file')
if not f:
return jsonify({'error': 'file is required'}), 400
filename = secure_filename(f.filename)
if not filename:
return jsonify({'error': 'invalid filename'}), 400
relative_path = os.path.join(patient_id, filename)
# ...This sanitization should be applied in all endpoints that accept these parameters (/history, /file, /diff, etc.).
| if a and not b: | ||
| range_spec = f"{a}..HEAD" | ||
| elif a and b: | ||
| range_spec = f"{a}..{b}" | ||
| else: | ||
| # only b provided -> HEAD..b | ||
| range_spec = f"HEAD..{b}" | ||
|
|
||
| try: | ||
| diff_text = repo.git.diff(range_spec, '--', rel) | ||
| except Exception as e: | ||
| return jsonify({'error': str(e)}), 400 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The range_spec variable is constructed from user-provided a and b parameters and passed directly to repo.git.diff(). Since repo.git is a wrapper for executing git commands on the command line, this is vulnerable to command injection. A malicious user could craft the a or b parameters to execute arbitrary shell commands on the server.
You must validate that a and b are valid Git commit hashes before using them. For example:
import re
HEX_SHA_RE = re.compile(r'^[0-9a-f]{4,40}$')
a = request.args.get('a')
b = request.args.get('b')
if a and not HEX_SHA_RE.match(a):
return jsonify({'error': 'invalid format for parameter a'}), 400
if b and not HEX_SHA_RE.match(b):
return jsonify({'error': 'invalid format for parameter b'}), 400
# ... then construct range_spec and call repo.git.diff| parts = relative_path.split(os.sep) | ||
| if len(parts) < 2: | ||
| raise ValueError("relative_path must start with '<patient_id>/...'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a defense-in-depth measure, the storage adapter itself should validate the relative_path to prevent path traversal attacks. Currently, a path like 'patientA/../patientB/file.txt' would allow writing to patientB's storage area from a context that should be restricted to patientA.
All public methods of the adapter that accept relative_path (save, get, history, head) should validate it. This ensures the adapter is secure even if the calling code fails to sanitize its inputs.
Example validation:
# At the start of a method using relative_path
if os.path.isabs(relative_path) or '..' in relative_path.split(os.sep):
raise ValueError("Invalid relative_path: path cannot be absolute or contain '..'")| def save_with_parent(self, relative_path: str, data: bytes, user_id: str, action: str, parent: Optional[str] = None, message: Optional[str] = None) -> str: | ||
| """Save bytes with optimistic locking using a parent commit hash. If parent is provided, the adapter should verify | ||
| that the repository HEAD matches parent before committing. Returns new commit hash.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The save_with_parent method is a core part of the storage adapter's public interface for optimistic locking, but it's not defined as an abstract method. This means a subclass could be created without implementing it, leading to unexpected behavior.
To enforce the contract of the StorageAdapter, you should mark save_with_parent as an abstract method. The same applies to the head method, which is also essential for the optimistic locking flow.
| def save_with_parent(self, relative_path: str, data: bytes, user_id: str, action: str, parent: Optional[str] = None, message: Optional[str] = None) -> str: | |
| """Save bytes with optimistic locking using a parent commit hash. If parent is provided, the adapter should verify | |
| that the repository HEAD matches parent before committing. Returns new commit hash.""" | |
| @abstractmethod | |
| def save_with_parent(self, relative_path: str, data: bytes, user_id: str, action: str, parent: Optional[str] = None, message: Optional[str] = None) -> str: | |
| """Save bytes with optimistic locking using a parent commit hash. If parent is provided, the adapter should verify | |
| that the repository HEAD matches parent before committing. Returns new commit hash.""" |
| try: | ||
| head = repo.head.commit.hexsha | ||
| except Exception: | ||
| head = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching a broad Exception can hide bugs and make debugging difficult. In this case, repo.head.commit is known to raise a ValueError on a new repository with no commits. You should catch the specific exception to make the code more robust and explicit about what it's handling. This also applies to the head method.
| try: | |
| head = repo.head.commit.hexsha | |
| except Exception: | |
| head = None | |
| try: | |
| head = repo.head.commit.hexsha | |
| except ValueError: # repo.head.commit raises ValueError on a new repo | |
| head = None |
| f.write(data) | ||
|
|
||
| repo.index.add([os.path.relpath(full_path, repo.working_tree_dir)]) | ||
| actor = Actor("BHV System", "no-reply@example.com") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit author is currently hardcoded as 'BHV System'. For a system that provides an audit trail, it's crucial to record who performed the action. The user_id is available in this method and should be used to set the commit author. This will significantly improve the usefulness of the Git history for auditing purposes. You could use the user's ID for the name and construct an email, e.g., f"{user_id}@example.com".
| actor = Actor("BHV System", "no-reply@example.com") | |
| actor = Actor(user_id, f"{user_id}@example.com") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds a prototype Git-backed storage layer for per-patient “vaults” with optimistic locking, plus a minimal Flask demo app, docs, and tests to validate the approach end-to-end.
Changes:
- Introduces a
StorageAdapterinterface and aGitAdapterimplementation backed by per-patient local Git repos. - Adds a demo Flask app with upload/history/diff endpoints (plus simple admin HTML views) and maps optimistic-lock conflicts to HTTP 409.
- Adds unit + integration tests and an end-to-end demo script for reviewers.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 22 comments.
Show a summary per file
| File | Description |
|---|---|
bhv/storage/base.py |
Defines the storage adapter interface used by the git-backed implementation. |
bhv/storage/git_adapter.py |
Implements Git-backed persistence, history, and optimistic-lock-aware saves. |
bhv/storage/errors.py |
Introduces a Conflict exception for optimistic locking failures. |
bhv/app.py |
Adds demo HTTP endpoints for upload/history/diff and simple admin HTML views. |
tests/test_git_adapter.py |
Unit tests for save/get/history and optimistic locking conflict behavior. |
tests/test_app_integration.py |
Flask integration tests for upload flow and HTTP 409 conflict mapping. |
scripts/demo_run.py |
Scripted end-to-end demo using Flask’s test client. |
requirements.txt |
Adds runtime dependencies for the demo/adapter. |
bhv/__init__.py |
Minimal package initializer for the demo package. |
README_GIT_ADAPTER.md |
Draft documentation for running and evaluating the prototype. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| repo = self._ensure_repo(patient_id) | ||
| rel_path = os.path.join(*parts[1:]) | ||
| if version is None: |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rel_path = os.path.join(*parts[1:]) produces platform-specific separators (e.g., dir\\file.txt on Windows). That value is later used as a git pathspec (iter_commits(paths=rel_path)) and as a tree path (commit.tree / rel_path), both of which are typically POSIX (/) paths inside git objects. To avoid Windows-only breakage for nested paths, normalize git-internal paths to / (e.g., via posixpath.join(...) or PurePosixPath).
| patient_id = parts[0] | ||
| repo = self._ensure_repo(patient_id) | ||
| lock = self._locks[patient_id] | ||
| full_path = os.path.join(repo.working_tree_dir, *parts[1:]) | ||
| os.makedirs(os.path.dirname(full_path), exist_ok=True) | ||
|
|
||
| # Default save uses no parent check | ||
| return self.save_with_parent(relative_path, data, user_id, action, parent=None, message=message) | ||
|
|
||
| def save_with_parent(self, relative_path: str, data: bytes, user_id: str, action: str, parent: Optional[str] = None, message: Optional[str] = None) -> str: | ||
| parts = relative_path.split(os.sep) |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In save(), repo/lock/full_path are computed and the directory is created, but then save_with_parent() is called which repeats the same repo/path setup and os.makedirs(). Consider refactoring so the setup happens in only one place (e.g., have save() just delegate to save_with_parent() after validation), to avoid duplicated logic and inconsistencies.
| patient_id = parts[0] | |
| repo = self._ensure_repo(patient_id) | |
| lock = self._locks[patient_id] | |
| full_path = os.path.join(repo.working_tree_dir, *parts[1:]) | |
| os.makedirs(os.path.dirname(full_path), exist_ok=True) | |
| # Default save uses no parent check | |
| return self.save_with_parent(relative_path, data, user_id, action, parent=None, message=message) | |
| def save_with_parent(self, relative_path: str, data: bytes, user_id: str, action: str, parent: Optional[str] = None, message: Optional[str] = None) -> str: | |
| parts = relative_path.split(os.sep) | |
| # Default save uses no parent check and delegates to save_with_parent | |
| return self.save_with_parent(relative_path, data, user_id, action, parent=None, message=message) | |
| def save_with_parent(self, relative_path: str, data: bytes, user_id: str, action: str, parent: Optional[str] = None, message: Optional[str] = None) -> str: | |
| parts = relative_path.split(os.sep) | |
| if len(parts) < 2: | |
| raise ValueError("relative_path must start with '<patient_id>/...'") |
| return self.save_with_parent(relative_path, data, user_id, action, parent=None, message=message) | ||
|
|
||
| def save_with_parent(self, relative_path: str, data: bytes, user_id: str, action: str, parent: Optional[str] = None, message: Optional[str] = None) -> str: | ||
| parts = relative_path.split(os.sep) |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save_with_parent() assumes relative_path has at least two components and immediately does patient_id = parts[0] without a length check. For invalid inputs (empty string, no separator) this can raise IndexError or create a repo with an unintended name. Mirror the len(parts) < 2 validation used in save()/get()/history()/head() here as well.
| parts = relative_path.split(os.sep) | |
| parts = relative_path.split(os.sep) | |
| if len(parts) < 2: | |
| raise ValueError("relative_path must start with '<patient_id>/...'") |
| patient_id = request.form.get('patient_id') | ||
| user_id = request.form.get('user_id', 'anonymous') | ||
| action = request.form.get('action', 'upload') |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
patient_id is taken directly from user input and becomes part of the filesystem path (and repo name). Without validation/normalization, values like ../outside or ones containing path separators can escape STORAGE_ROOT and write to unexpected locations. Add strict validation for patient_id (allowlist characters) before constructing any paths.
| rows.append(f"<li><strong>{escape(item['datetime'])}</strong> - {escape(item['author'])} - {escape(item['message'])} - <a href='/admin/diff/{patient_id}/{filename}?a={item['hexsha']}'>diff from here to HEAD</a></li>") | ||
| body = "<h1>History for " + escape(filename) + "</h1><ul>" + "".join(rows) + "</ul>" |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In admin_history, patient_id, filename, and item['hexsha'] are interpolated directly into the href attribute without URL-encoding/escaping. This can break the HTML (e.g., quotes in the path) and can open XSS vectors in an admin context. Use url_for(...) plus URL-encoding for path/query components, and escape any attribute values you render into HTML.
| commit1 = adapter.save(rel, b'first version', user_id='user1', action='create') | ||
| commit2 = adapter.save(rel, b'second version', user_id='user2', action='edit') |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable commit1 is not used.
| commit1 = adapter.save(rel, b'first version', user_id='user1', action='create') | |
| commit2 = adapter.save(rel, b'second version', user_id='user2', action='edit') | |
| adapter.save(rel, b'first version', user_id='user1', action='create') | |
| adapter.save(rel, b'second version', user_id='user2', action='edit') |
| adapter = GitAdapter(tmp) | ||
| rel = os.path.join('patientA', 'notes.txt') | ||
| commit1 = adapter.save(rel, b'first version', user_id='user1', action='create') | ||
| commit2 = adapter.save(rel, b'second version', user_id='user2', action='edit') |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable commit2 is not used.
| commit2 = adapter.save(rel, b'second version', user_id='user2', action='edit') | |
| adapter.save(rel, b'second version', user_id='user2', action='edit') |
|
|
||
|
|
||
| @app.route('/diff/<patient_id>/<path:filename>', methods=['GET']) | ||
| def diff_versions(patient_id, filename): |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff_versions returns tuple of size 2 and tuple of size 3.
diff_versions returns tuple of size 2 and tuple of size 3.
| def _setup_storage(tmpdir): | ||
| # Replace the app's storage with a GitAdapter backed by tmpdir | ||
| adapter = GitAdapter(tmpdir) | ||
| import bhv.app as appmod |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module 'bhv.app' is imported with both 'import' and 'import from'.
| @@ -0,0 +1,77 @@ | |||
| import tempfile | |||
| import io | |||
| import os | |||
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
…nd templated UI - Add run.py: single-command app launcher - Implement bhv/db.py: MongoDB + TinyDB fallback DB abstraction - Implement bhv/full_app.py: minimal Flask app with routes for signup, login, logout, upload, my_entries, admin, history, diff, file_version - Add templates: base.html (header/nav/footer), index.html, signup.html, login.html, upload.html, patient.html (my entries), admin.html, history.html, diff.html - Add static/css/alaska.css: Alaska-themed design (teal accents, minimal cards, responsive layout) - Update requirements.txt: add TinyDB for embedded DB fallback - Templates support Bootstrap-like card layout with forms and file listings - Routes support role-based access control (patients view own entries, admins see all) - Storage integration: uses GitAdapter for versioned commits per file save
- Add flask-wtf CSRF protection to all forms - Implement email validation and password strength check (min 6 chars) - Add Google OAuth endpoint (/auth/google) for user auto-creation and login - Add Google Sign-In button to login and signup pages using Google Identity Services - Set secure session cookie flags (HttpOnly, SameSite, Secure in production) - Add persistent session management (7-day lifetime) - Update signup/login to handle OAuth and email-based auth seamlessly - Add .env.example with documentation for GOOGLE_CLIENT_ID and MONGO_URI
This PR adds a prototype Git-backed storage adapter for immutable patient vaults and a minimal demo app for reviewers to validate the approach.
Summary
StorageAdapterinterface:bhv/storage/base.py(includes optimistic-lock-awaresave_with_parentandhead).GitAdapterinbhv/storage/git_adapter.pyusing GitPython. Each patient has a per-patient local Git repo underdata/storage/<patient_id>/.save_with_parent(relative_path, parent=hexsha)checksHEAD == parentand raisesConflictif not.bhv/storage/errors.pywithConflictexception and maps it to HTTP 409 in the demo app.bhv/app.py:POST /upload(acceptsparentoptional field)GET /history/<patient>/<file>GET /diff/<patient>/<file>?a=&b=/admin/history/...,/admin/diff/...tests/test_git_adapter.py— unit tests for adapter behaviortests/test_app_integration.py— Flask integration tests that validate upload, optimistic-locking, and conflict (409)scripts/demo_run.py— automated end-to-end demo using Flask test client.git/pr_body.txt(present for convenience)Why this approach
Notes and limitations (important for reviewers)
data/storage/<patient_id>/. For stronger tamper evidence and backup, we should push to a remote and optionally sign commits.Testing & how to run locally
2 Run test:
python -m pytest -q
3.Run demo script (end-to-end via Flask test client):
python scripts/demo_run.py
4.Optional: run the demo server and exercise endpoints:
python -m bhv.app
then use curl/Postman against http://localhost:5000
Reviewer checklist
Design: Approve per-patient git repo approach and optimistic locking model.
Edge cases: Check bhv/storage/git_adapter.py handling of binary files, path sanitization, and large files.
Integration: Where to update MongoDB records with commit/head after save.
Auth: Confirm auth/admin gating for endpoints.
Hardening: Suggest production steps: remote push, commit signing, distributed locking
Here is some images :
@mdxabu Please look into this and let me know if this approach is suitable or not