-
Notifications
You must be signed in to change notification settings - Fork 5
Anthropic pdf #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
add BotHandler class + update architecture
update llmlite library + small changes
refactor collections respource and chromadb modules
feat: timer decorator
add check collection creation status logic
chromad persistence
… + update chromadb version to fix folder deletion
improve collection error handling + add sqlalchemy + improve flyio co…
|
Caution Review failedFailed to post review comments WalkthroughAdds Docker/Fly/GCloud configs, environment templates, and pins Python to 3.10.13. Introduces Flask app setup with SQLAlchemy, sessions, and centralized error handling. Adds API namespaces for Vertex/Gemini, Anthropic, OpenAI audio, and collections management with ChromaDB storage, web crawling, and PDF processing utilities. Expands dependency manifests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant API as Flask API
participant Coll as Collections NS
participant Store as ChromaStorage
participant WC as WebCrawling
participant PDF as PdfHandler
participant DB as SQLAlchemy(DB)
rect rgba(230,245,255,0.5)
Client->>Coll: POST /collections/document {chatId,url,fileName}
Coll->>Coll: spawn background handler
Coll-->>Client: 202 {collectionName}
end
alt URL is PDF
Coll->>PDF: pdf_to_chunks(url)
PDF-->>Coll: chunks
else URL is HTML
Coll->>WC: get_web_content(url)
WC-->>Coll: urlText chunks
end
Coll->>Store: store_text_array(..., collectionName)
Store-->>Coll: index created
Coll->>DB: (on error) save CollectionError
DB-->>Coll: ack
Client->>Coll: GET /collections/document/{collectionName}
alt exists
Coll-->>Client: 200 {status:DONE, price, tokens}
else processing
Coll-->>Client: 200 {status:PROCESSING}
else error
Coll-->>Client: 200 {status:DONE,error:INVALID_COLLECTION}
end
Client->>Coll: POST /collections/query {collectionName,prompt,conversation}
Coll->>Store: get_vector_index(collectionName)
Store-->>Coll: vector index
Coll-->>Client: 200 {message, tokens, price}
sequenceDiagram
autonumber
participant Client
participant Anth as Anthropic NS
participant AC as Anthropic Client
Client->>Anth: POST /anthropic/completions {messages,stream}
alt stream=true
Anth->>AC: messages.create(stream=True)
AC-->>Anth: event stream
Anth-->>Client: SSE data chunks
else
Anth->>AC: messages.create(stream=False)
AC-->>Anth: response
Anth-->>Client: 200 JSON
end
Anth-->>Client: (on error) CustomError -> JSON handler
sequenceDiagram
autonumber
participant Client
participant Vertex as Vertex NS
participant GAI as Vertex/Gemini SDK
Client->>Vertex: POST /vertex/completions[/gemini] {input,stream}
Vertex->>GAI: init + generate (stream or not)
alt stream
GAI-->>Vertex: token stream
Vertex-->>Client: SSE chunks
else
GAI-->>Vertex: full response
Vertex-->>Client: 200 JSON
end
Vertex-->>Client: errors via CustomError
sequenceDiagram
autonumber
participant Client
participant OAI as OpenAI NS
participant FS as File System
participant Whisper as OpenAI Whisper
Client->>OAI: POST /openai/upload-audio (multipart)
OAI->>FS: save file
OAI->>Whisper: transcribe(path)
Whisper-->>OAI: text
OAI->>FS: delete file
OAI-->>Client: 200 {transcription}
OAI-->>Client: 500 CustomError on failures
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
Pre-merge checks and finishing touches and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 38
🧹 Nitpick comments (10)
.gitignore (1)
1-14: LGTM! Consider adding additional common patterns.The ignore rules appropriately exclude runtime data directories, environment files, and build artifacts. However, consider adding these common patterns for completeness:
+# Additional Python patterns +*.pyc +*.pyo +*.egg-info/ +dist/ +build/ +.pytest_cache/ +.coverage +htmlcov/ + +# Additional virtual environment patterns +.venv/ +env/ +ENV/.dockerignore (1)
1-5: Expand .dockerignore to reduce build context size and improve security.The current patterns are a good start, but the Docker build context could be significantly reduced by adding more exclusions. This improves build speed and prevents accidentally copying sensitive or unnecessary files into the image.
**/.DS_Store .vscode/ __pycache__/ chroma/ .env + +# Git and version control +.git/ +.gitignore +.gitattributes + +# Environment files +.env.* +*.env + +# Documentation +*.md +docs/ + +# Tests +tests/ +test_*.py +*_test.py +.pytest_cache/ +.coverage +htmlcov/ + +# CI/CD +.github/ +.gitlab-ci.yml + +# Build artifacts +*.pyc +*.pyo +*.egg-info/ +dist/ +build/ + +# Data directories +data/ +uploads/ +*.db +*.sqlite + +# Virtual environments +venv/ +.venv/ +env/ + +# IDE +.idea/ +*.swp +*.swo + +# Backup files +*.bkp +*.bak +requirements.txt.bkp*requirements.txt.bkp2 (2)
80-80: Consider migration to openai 1.x for improved typing and API stability.
openai==0.28.1uses the legacy global-function API. The 1.x client offers:
- Improved typing
- Explicit client instantiation
- Better retry/backoff behavior
- More maintainable code
Note that upgrading from 0.28.1 to 1.x is a breaking change requiring code updates across all OpenAI API calls.
Based on learnings.
127-127: Update werkzeug to the latest 3.0.x patch release.
werkzeug==3.0.1should be updated to 3.0.2, which includes fixes for routing behavior, type handling, and multipart parser improvements.Based on learnings.
-werkzeug==3.0.1; python_version >= '3.8' +werkzeug==3.0.2; python_version >= '3.8'.env.example (1)
8-8: Add trailing newline for POSIX compliance.Consider adding a blank line at the end of the file to follow POSIX text file conventions.
</review_comment_end -->
services/timer_decorator.py (1)
8-8: Prefer logging over print for production code.Using
print()for timing information bypasses the application's logging configuration and cannot be easily controlled, filtered, or redirected. Consider using theloggingmodule instead.Apply this diff:
import time +import logging + +logger = logging.getLogger(__name__) def timer(func): def wrapper(*args, **kwargs): start = time.time() result = func(*args, **kwargs) end = time.time() - print(f"{func.__name__} took {end - start:.6f} seconds") + logger.info(f"{func.__name__} took {end - start:.6f} seconds") return result return wrapper</review_comment_end -->
res/custom_error.py (1)
1-6: Consider adding docstring and type hints for better API documentation.The CustomError class would benefit from:
- A docstring explaining its purpose and usage
- Type hints for the constructor parameters (error_code could be int, message should be str)
Apply this diff to add documentation and type hints:
class CustomError(Exception): + """Custom exception for API error handling with error codes. + + Attributes: + error_code: Numeric error code for categorizing errors + message: Human-readable error description + """ - def __init__(self, error_code, message): + def __init__(self, error_code: int, message: str) -> None: self.error_code = error_code self.message = message super().__init__(f"Error with code {error_code}: {message}")Dockerfile (1)
6-6: Remove or document commented-out configuration lines.Lines 6 and 33 contain commented-out code:
- GOOGLE_APPLICATION_CREDENTIALS environment variable
- Alternative Flask run command
If these are temporary, consider removing them. If they serve as configuration examples, move them to documentation or add explanatory comments.
Also applies to: 33-33
res/llm_exceptions.py (1)
1-15: Simplify exception definitions - custom init is redundant.All four exception classes define identical
__init__methods that simply forward arguments toException.__init__. This is unnecessary because Python's default exception handling already does this. The classes can be simplified to just the class declaration, or given meaningful init signatures if they need specific attributes.Apply this diff to simplify and add documentation:
+"""Domain-specific exceptions for LLM operations.""" + class PdfFileInvalidFormat(Exception): - def __init__(self,*args,**kwargs): - Exception.__init__(self,*args,**kwargs) + """Raised when a PDF file has an invalid or unsupported format.""" + pass class InvalidCollectionName(Exception): - def __init__(self,*args,**kwargs): - Exception.__init__(self,*args,**kwargs) + """Raised when a collection name is invalid or malformed.""" + pass class InvalidCollection(Exception): - def __init__(self,*args,**kwargs): - Exception.__init__(self,*args,**kwargs) + """Raised when referencing a non-existent or invalid collection.""" + pass class DatabaseError(Exception): - def __init__(self,*args,**kwargs): - Exception.__init__(self,*args,**kwargs) + """Raised when a database operation fails.""" + passAlternatively, if you need to store specific attributes, define meaningful constructors:
class DatabaseError(Exception): """Raised when a database operation fails.""" def __init__(self, operation: str, original_error: Exception): self.operation = operation self.original_error = original_error super().__init__(f"Database error during {operation}: {original_error}")services/telegram.py (1)
3-15: Consider adding docstrings and error handling.The
BotHandlerclass would benefit from:
- Class and method docstrings explaining parameters and behavior
- Error handling for bot API failures (network errors, invalid tokens, etc.)
- Type hints for better IDE support
Example enhancement:
class BotHandler: + """Wrapper for Telegram bot operations using telebot. + + Args: + token: Telegram bot API token + """ - def __init__(self, token): + def __init__(self, token: str) -> None: self.bot = telebot.TeleBot(token) - def edit_message(self, msg, chat_id, message_id): + def edit_message(self, msg: str, chat_id: int, message_id: int) -> None: + """Edit an existing message. + + Args: + msg: New message text + chat_id: Target chat ID + message_id: Message ID to edit + """ self.bot.edit_message_text(text=msg, chat_id=chat_id, message_id=message_id) - def send_message(self, msg, chat_id): + def send_message(self, msg: str, chat_id: int) -> None: + """Send a new message. + + Args: + msg: Message text + chat_id: Target chat ID + """ self.bot.send_message(chat_id=chat_id, text=msg)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
.DS_Storeis excluded by!**/.DS_StorePipfile.lockis excluded by!**/*.lockres/.DS_Storeis excluded by!**/.DS_Store
📒 Files selected for processing (38)
.dockerignore(1 hunks).env.example(1 hunks).gcloudignore(1 hunks).gitignore(1 hunks).python-version(1 hunks)Dockerfile(1 hunks)Pipfile(1 hunks)README.md(1 hunks)apis/__init__.py(1 hunks)apis/anthropic_resource.py(1 hunks)apis/collections/__init__.py(1 hunks)apis/collections/collections_helper.py(1 hunks)apis/collections/collections_resource.py(1 hunks)apis/llms_resource.py(2 hunks)apis/openai_resource.py(1 hunks)apis/vertex_resource.py(3 hunks)app.yaml(1 hunks)config.py(1 hunks)fly.production.toml(1 hunks)fly.toml(1 hunks)main.py(1 hunks)models/__init__.py(1 hunks)models/collection_error_model.py(1 hunks)requirements.txt(1 hunks)requirements.txt.bkp(1 hunks)requirements.txt.bkp2(1 hunks)res/__init__.py(1 hunks)res/custom_error.py(1 hunks)res/llm_exceptions.py(1 hunks)res/service_accoun_example.json(1 hunks)res/text_messages.py(1 hunks)services/__init__.py(1 hunks)services/pdf.py(1 hunks)services/telegram.py(1 hunks)services/timer_decorator.py(1 hunks)services/web_crawling.py(1 hunks)storages/__init__.py(1 hunks)storages/chromadb_storage.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (15)
models/__init__.py (1)
models/collection_error_model.py (1)
CollectionError(9-30)
apis/collections/__init__.py (1)
apis/collections/collections_helper.py (1)
CollectionHelper(7-55)
services/__init__.py (4)
services/telegram.py (1)
BotHandler(3-15)services/web_crawling.py (1)
WebCrawling(5-61)services/pdf.py (1)
PdfHandler(7-47)services/timer_decorator.py (1)
timer(3-10)
res/__init__.py (3)
res/text_messages.py (1)
EngMsg(1-22)res/llm_exceptions.py (4)
InvalidCollectionName(5-7)PdfFileInvalidFormat(1-3)DatabaseError(13-15)InvalidCollection(9-11)res/custom_error.py (1)
CustomError(1-5)
storages/__init__.py (1)
storages/chromadb_storage.py (1)
ChromaStorage(10-84)
apis/llms_resource.py (4)
res/text_messages.py (1)
EngMsg(1-22)res/custom_error.py (1)
CustomError(1-5)apis/vertex_resource.py (3)
post(59-103)post(109-153)data_generator(47-54)apis/openai_resource.py (1)
post(31-66)
apis/openai_resource.py (2)
res/text_messages.py (1)
EngMsg(1-22)res/custom_error.py (1)
CustomError(1-5)
apis/collections/collections_resource.py (6)
res/text_messages.py (1)
EngMsg(1-22)apis/collections/collections_helper.py (6)
CollectionHelper(7-55)reset_database(46-47)get_collection_name(15-16)is_pdf_url(18-19)get_collection(21-23)collection_query(25-35)services/web_crawling.py (2)
WebCrawling(5-61)get_web_content(10-33)services/pdf.py (2)
PdfHandler(7-47)pdf_to_chunks(12-15)models/collection_error_model.py (2)
CollectionError(9-30)save(19-27)storages/chromadb_storage.py (5)
reset_database(83-84)get_collection_name(28-35)store_text_array_from_url(49-57)store_text_array(60-68)get_collection(45-47)
apis/anthropic_resource.py (3)
res/text_messages.py (1)
EngMsg(1-22)res/custom_error.py (1)
CustomError(1-5)services/pdf.py (2)
PdfHandler(7-47)get_pdf_from_url(18-25)
services/pdf.py (1)
services/timer_decorator.py (1)
timer(3-10)
apis/vertex_resource.py (4)
res/text_messages.py (1)
EngMsg(1-22)res/custom_error.py (1)
CustomError(1-5)apis/collections/collections_resource.py (5)
data_generator(18-20)post(26-41)post(45-67)post(157-188)get(94-132)apis/llms_resource.py (2)
data_generator(11-13)post(18-43)
apis/collections/collections_helper.py (2)
storages/chromadb_storage.py (6)
get_collection_name(28-35)get_collection(45-47)get_vector_index(71-78)delete_collection(80-81)get_existing_collection(38-43)reset_database(83-84)res/llm_exceptions.py (1)
InvalidCollectionName(5-7)
storages/chromadb_storage.py (2)
apis/collections/collections_helper.py (4)
get_collection_name(15-16)get_collection(21-23)delete_collection(37-44)reset_database(46-47)apis/collections/collections_resource.py (1)
get(94-132)
main.py (1)
res/custom_error.py (1)
CustomError(1-5)
models/collection_error_model.py (1)
res/llm_exceptions.py (1)
DatabaseError(13-15)
🪛 ast-grep (0.39.5)
main.py
[warning] 63-63: Detected Flask app with debug=True. Do not deploy to production with this flag enabled as it will leak sensitive information. Instead, consider using Flask configuration variables or setting 'debug' using system environment variables.
Context: app.run(debug=True)
Note: [CWE-489] Active Debug Code. [REFERENCES]
- https://labs.detectify.com/2015/10/02/how-patreon-got-hacked-publicly-exposed-werkzeug-debugger/
(debug-enabled-python)
🪛 Checkov (3.2.334)
res/service_accoun_example.json
[medium] 5-6: Private Key
(CKV_SECRET_13)
🪛 dotenv-linter (3.3.0)
.env.example
[warning] 1-1: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 2-2: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 2-2: [UnorderedKey] The AI21_API_KEY key should go before the OPENAI_API_KEY key
(UnorderedKey)
[warning] 3-3: [UnorderedKey] The DEBUG key should go before the OPENAI_API_KEY key
(UnorderedKey)
[warning] 5-5: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 5-5: [UnorderedKey] The CHROMA_SERVER_HOST key should go before the DEBUG key
(UnorderedKey)
[warning] 6-6: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 6-6: [UnorderedKey] The CHROMA_SERVER_HTTP_PORT key should go before the DEBUG key
(UnorderedKey)
[warning] 7-7: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 8-8: [EndingBlankLine] No blank line at the end of the file
(EndingBlankLine)
[warning] 8-8: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 8-8: [UnorderedKey] The CHROMA_SERVER_PATH key should go before the DEBUG key
(UnorderedKey)
🪛 OSV Scanner (2.2.3)
requirements.txt
[HIGH] 63-63: jinja2 3.1.3: Jinja2 vulnerable to sandbox breakout through attr filter selecting format method
[HIGH] 63-63: jinja2 3.1.3: Jinja has a sandbox breakout through malicious filenames
[HIGH] 63-63: jinja2 3.1.3: Jinja vulnerable to HTML attribute injection when passing user input as keys to xmlattr filter
[HIGH] 63-63: jinja2 3.1.3: Jinja has a sandbox breakout through indirect reference to format method
[HIGH] 140-140: werkzeug 3.0.2: Werkzeug debugger vulnerable to remote execution when interacting with attacker controlled domain
[HIGH] 140-140: werkzeug 3.0.2: Werkzeug safe_join not safe on Windows
[HIGH] 140-140: werkzeug 3.0.2: Werkzeug possible resource exhaustion when parsing file data in forms
[HIGH] 20-20: aiohttp 3.9.3: aiohttp vulnerable to Denial of Service when trying to parse malformed POST requests
[HIGH] 20-20: aiohttp 3.9.3: aiohttp Cross-site Scripting vulnerability on index pages for static file handling
[HIGH] 20-20: aiohttp 3.9.3: aiohttp allows request smuggling due to incorrect parsing of chunk extensions
[HIGH] 20-20: aiohttp 3.9.3: AIOHTTP is vulnerable to HTTP Request/Response Smuggling through incorrect parsing of chunked trailer sections
[HIGH] 15-15: certifi 2024.2.2: undefined
(PYSEC-2024-230)
[HIGH] 15-15: certifi 2024.2.2: Certifi removes GLOBALTRUST root certificate
[HIGH] 52-52: gunicorn 21.2.0: Gunicorn HTTP Request/Response Smuggling vulnerability
[HIGH] 52-52: gunicorn 21.2.0: Request smuggling leading to endpoint restriction bypass in Gunicorn
[CRITICAL] 53-53: h11 0.14.0: h11 accepts some malformed Chunked-Encoding bodies
[HIGH] 59-59: idna 3.6: undefined
(PYSEC-2024-60)
[HIGH] 59-59: idna 3.6: Internationalized Domain Names in Applications (IDNA) vulnerable to denial of service from specially crafted inputs to idna.encode
[CRITICAL] 69-69: langchain 0.0.306: undefined
(PYSEC-2023-162)
[CRITICAL] 69-69: langchain 0.0.306: undefined
(PYSEC-2023-205)
[CRITICAL] 69-69: langchain 0.0.306: undefined
(PYSEC-2024-115)
[CRITICAL] 69-69: langchain 0.0.306: undefined
(PYSEC-2024-118)
[CRITICAL] 69-69: langchain 0.0.306: undefined
(PYSEC-2024-43)
[CRITICAL] 69-69: langchain 0.0.306: Denial of service in langchain-community
[CRITICAL] 69-69: langchain 0.0.306: Langchain SQL Injection vulnerability
[CRITICAL] 69-69: langchain 0.0.306: LangChain Server Side Request Forgery vulnerability
[CRITICAL] 69-69: langchain 0.0.306: Langchain Server-Side Request Forgery vulnerability
[CRITICAL] 69-69: langchain 0.0.306: langchain vulnerable to arbitrary code execution
[CRITICAL] 69-69: langchain 0.0.306: Langchain vulnerable to arbitrary code execution via the evaluate function in the numexpr library
[CRITICAL] 69-69: langchain 0.0.306: LangChain directory traversal vulnerability
[CRITICAL] 69-69: langchain 0.0.306: langchain Server-Side Request Forgery vulnerability
[CRITICAL] 69-69: langchain 0.0.306: LangChain vulnerable to arbitrary code execution
[CRITICAL] 69-69: langchain 0.0.306: langchain vulnerable to path traversal
[CRITICAL] 71-71: litellm 0.1.793: Arbitrary file deletion in litellm
[CRITICAL] 71-71: litellm 0.1.793: LiteLLM has Server-Side Template Injection vulnerability in /completions endpoint
[CRITICAL] 71-71: litellm 0.1.793: litellm passes untrusted data to eval function without sanitization
[CRITICAL] 71-71: litellm 0.1.793: LiteLLM Has a Leakage of Langfuse API Keys
[CRITICAL] 71-71: litellm 0.1.793: SQL injection in litellm
[CRITICAL] 71-71: litellm 0.1.793: LiteLLM Vulnerable to Denial of Service (DoS) via Crafted HTTP Request
[CRITICAL] 71-71: litellm 0.1.793: LiteLLM Has an Improper Authorization Vulnerability
[CRITICAL] 71-71: litellm 0.1.793: LiteLLM Server-Side Request Forgery (SSRF) vulnerability
[CRITICAL] 71-71: litellm 0.1.793: LiteLLM Reveals Portion of API Key via a Logging File
[CRITICAL] 71-71: litellm 0.1.793: litellm vulnerable to remote code execution based on using eval unsafely
[CRITICAL] 71-71: litellm 0.1.793: LiteLLM Vulnerable to Denial of Service (DoS)
[CRITICAL] 71-71: litellm 0.1.793: SQL injection in litellm
[CRITICAL] 71-71: litellm 0.1.793: litellm vulnerable to improper access control in team management
[CRITICAL] 72-72: llama-index 0.8.39.post2: undefined
(PYSEC-2024-12)
[CRITICAL] 72-72: llama-index 0.8.39.post2: undefined
(PYSEC-2024-192)
[CRITICAL] 72-72: llama-index 0.8.39.post2: undefined
(PYSEC-2025-11)
[CRITICAL] 72-72: llama-index 0.8.39.post2: SQL injection in llama-index
[CRITICAL] 72-72: llama-index 0.8.39.post2: llama-index vulnerable to arbitrary code execution
[CRITICAL] 72-72: llama-index 0.8.39.post2: LlamaIndex vulnerable to data loss through hash collisions in its DocugamiReader class
[CRITICAL] 72-72: llama-index 0.8.39.post2: LlamaIndex Improper Handling of Exceptional Conditions vulnerability
[CRITICAL] 72-72: llama-index 0.8.39.post2: LlamaIndex vulnerable to Creation of Temporary File in Directory with Insecure Permissions
[CRITICAL] 72-72: llama-index 0.8.39.post2: LlamaIndex Uncontrolled Resource Consumption vulnerability
[CRITICAL] 72-72: llama-index 0.8.39.post2: RunGptLLM class in LlamaIndex has a command injection
[CRITICAL] 72-72: llama-index 0.8.39.post2: llama_index vulnerable to SQL Injection
[HIGH] 82-82: nltk 3.8.1: undefined
(PYSEC-2024-167)
[HIGH] 82-82: nltk 3.8.1: ntlk unsafe deserialization vulnerability
[HIGH] 93-93: protobuf 4.25.3: protobuf-python has a potential Denial of Service issue
[HIGH] 120-120: starlette 0.37.2: Starlette has possible denial-of-service vector when parsing large files in multipart forms
[HIGH] 120-120: starlette 0.37.2: Starlette Denial of service (DoS) via multipart/form-data
[CRITICAL] 137-137: waitress 2.1.2: undefined
(PYSEC-2024-210)
[CRITICAL] 137-137: waitress 2.1.2: undefined
(PYSEC-2024-211)
[CRITICAL] 137-137: waitress 2.1.2: Waitress vulnerable to DoS leading to high CPU usage/resource exhaustion
[CRITICAL] 137-137: waitress 2.1.2: Waitress has request processing race condition in HTTP pipelining with invalid first request
🪛 Ruff (0.13.3)
apis/llms_resource.py
30-30: Avoid equality comparisons to True; use data['stream']: for truth checks
Replace with data['stream']
(E712)
35-35: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
36-36: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
37-37: Do not catch blind exception: Exception
(BLE001)
40-40: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
41-41: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
41-41: f-string without any placeholders
Remove extraneous f prefix
(F541)
apis/openai_resource.py
20-20: Consider moving this statement to an else block
(TRY300)
21-21: Do not catch blind exception: Exception
(BLE001)
60-60: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
61-61: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
62-62: Do not catch blind exception: Exception
(BLE001)
65-65: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
66-66: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
apis/collections/collections_resource.py
36-36: Abstract raise to an inner function
(TRY301)
37-37: Do not catch blind exception: Exception
(BLE001)
38-38: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
40-40: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
41-41: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
64-64: Do not catch blind exception: Exception
(BLE001)
66-66: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
67-67: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
69-69: Unused method argument: file_name
(ARG002)
77-77: Abstract raise to an inner function
(TRY301)
77-77: Avoid specifying long messages outside the exception class
(TRY003)
84-84: Abstract raise to an inner function
(TRY301)
84-84: Avoid specifying long messages outside the exception class
(TRY003)
85-85: Do not catch blind exception: Exception
(BLE001)
85-85: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
128-128: Do not catch blind exception: Exception
(BLE001)
129-129: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
131-131: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
132-132: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
146-146: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
148-148: Do not catch blind exception: Exception
(BLE001)
150-150: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
151-151: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
165-165: Local variable url is assigned to but never used
Remove assignment to unused variable url
(F841)
177-177: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
178-178: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
182-182: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
183-183: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
184-184: Do not catch blind exception: Exception
(BLE001)
186-186: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
187-187: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
188-188: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
apis/anthropic_resource.py
104-104: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
105-105: Do not catch blind exception: Exception
(BLE001)
106-106: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
106-106: Use explicit conversion flag
Replace with conversion flag
(RUF010)
107-107: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
152-152: Abstract raise to an inner function
(TRY301)
158-158: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
159-159: Do not catch blind exception: Exception
(BLE001)
160-160: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
160-160: Use explicit conversion flag
Replace with conversion flag
(RUF010)
161-161: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
248-248: Abstract raise to an inner function
(TRY301)
253-253: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
254-254: Do not catch blind exception: Exception
(BLE001)
255-255: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
255-255: Use explicit conversion flag
Replace with conversion flag
(RUF010)
256-256: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
services/pdf.py
20-20: Probable use of requests call without timeout
(S113)
46-46: Do not catch blind exception: Exception
(BLE001)
47-47: Use explicit conversion flag
Replace with conversion flag
(RUF010)
apis/vertex_resource.py
97-97: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
98-98: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
99-99: Do not catch blind exception: Exception
(BLE001)
102-102: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
103-103: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
121-121: Local variable parameters is assigned to but never used
Remove assignment to unused variable parameters
(F841)
139-139: Avoid equality comparisons to True; use data['stream']: for truth checks
Replace with data['stream']
(E712)
147-147: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
148-148: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
149-149: Do not catch blind exception: Exception
(BLE001)
152-152: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
153-153: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
apis/collections/collections_helper.py
35-35: Avoid specifying long messages outside the exception class
(TRY003)
services/web_crawling.py
17-17: Probable use of requests call without timeout
(S113)
32-32: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
33-33: Use raise without specifying exception name
Remove exception name
(TRY201)
storages/chromadb_storage.py
12-12: Unused method argument: settings
(ARG002)
29-29: Probable use of insecure hash functions in hashlib: md5
(S324)
41-41: Consider moving this statement to an else block
(TRY300)
42-42: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
56-56: Local variable index is assigned to but never used
Remove assignment to unused variable index
(F841)
67-67: Local variable index is assigned to but never used
Remove assignment to unused variable index
(F841)
main.py
62-62: Possible binding to all interfaces
(S104)
64-64: Use of debug=True in Flask app detected
(S201)
models/collection_error_model.py
25-25: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
25-25: Avoid specifying long messages outside the exception class
(TRY003)
26-26: Do not catch blind exception: Exception
(BLE001)
27-27: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
27-27: Create your own exception
(TRY002)
27-27: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (12)
.gcloudignore (1)
15-15: LGTM!Excluding the
chroma/directory from GCP builds is appropriate, as it likely contains large vector database files that should not be uploaded to the cloud. This aligns with similar exclusions in.gitignoreand.dockerignore.app.yaml (1)
2-2: Clarify whether 'test' in the service name is intentional.The service name
llm-vector-backend-testincludes "test", which might cause confusion if this configuration is used for production deployments. Consider using environment-specific service names or removing the "test" suffix if this is intended for production use.Pipfile (2)
32-33: Clarify the reason for downgrading Python from 3.11.5 to 3.10.13.The Python version requirement was changed from 3.11.5 to 3.10.13, which is a significant downgrade. While this aligns with the Dockerfile and
.python-versionfile, please confirm:
- Is this downgrade due to dependency compatibility requirements?
- Are there specific packages that don't support Python 3.11?
- Have you considered the performance and feature improvements in Python 3.11?
Additionally, as noted in the
.python-versionreview, verify that 3.10.13 is the latest patch in the 3.10.x series.
7-27: LGTM! Version pinning improves reproducibility.The shift from wildcard version specifications to explicit pins (e.g.,
flask==2.3.3,gunicorn==21.2.0,chromadb==0.4.13) is a positive change that ensures consistent builds across environments. This follows best practices for production deployments..python-version (1)
1-1: Confirm Python 3.10.x patch and security status
.python-versionis set to 3.10.13—no advisories were found for the pip “python” package via GitHub, but please check python.org for the latest 3.10.x patch and review the official CPython security bulletins (CVE listings) to ensure no critical fixes are pending.res/text_messages.py (1)
10-22: LGTM! Well-structured API documentation constants.The new namespace descriptions and parameter documentation strings are clear and consistent, providing good API documentation support.
</review_comment_end -->
README.md (1)
1-13: LGTM! Clear and concise deployment documentation.The fly.io deployment instructions are straightforward and align with the deployment configurations in the PR.
</review_comment_end -->
fly.toml (1)
1-21: LGTM! Standard Fly.io development configuration.The configuration is well-structured with appropriate settings for a development environment, including HTTPS enforcement, auto-scaling, and persistent data mounting.
</review_comment_end -->
fly.production.toml (1)
1-21: LGTM! Production configuration separates concerns appropriately.The production configuration correctly uses separate app name (
harmony-llm-api) and data mount (llm_api_data) from the development setup. While there is duplication withfly.toml, this is acceptable for deployment configurations where explicit separation provides clarity and safety.</review_comment_end -->
apis/collections/__init__.py (1)
1-2: LGTM! Clean package interface.The re-exports follow standard Python package organization patterns and provide a clean public API.
</review_comment_end -->
services/__init__.py (1)
1-4: LGTM!The package initialization cleanly exports service utilities, making them accessible at the package level.
models/__init__.py (1)
1-10: No changes required for circular import Thedbinstance is defined before importingCollectionError, and the relative import incollection_error_model.pyis correct, so there’s no runtime circular import issue.
| OPENAI_API_KEY="XXXXX" | ||
| AI21_API_KEY="XXXXX" | ||
| DEBUG=True | ||
| TESTING=False | ||
| CHROMA_SERVER_HOST="XXXXXX" | ||
| CHROMA_SERVER_HTTP_PORT="8000" | ||
| WEB_CRAWLER_HTTP="XXXXX" | ||
| CHROMA_SERVER_PATH="./chroma" No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove unnecessary quotes from environment variable values.
In .env files, quotes around values are treated as literal characters and become part of the value itself. Remove the quotes unless you specifically need them as part of the string.
Apply this diff:
-OPENAI_API_KEY="XXXXX"
-AI21_API_KEY="XXXXX"
+OPENAI_API_KEY=XXXXX
+AI21_API_KEY=XXXXX
DEBUG=True
TESTING=False
-CHROMA_SERVER_HOST="XXXXXX"
-CHROMA_SERVER_HTTP_PORT="8000"
-WEB_CRAWLER_HTTP="XXXXX"
-CHROMA_SERVER_PATH="./chroma"
+CHROMA_SERVER_HOST=XXXXXX
+CHROMA_SERVER_HTTP_PORT=8000
+WEB_CRAWLER_HTTP=XXXXX
+CHROMA_SERVER_PATH=./chroma</review_comment_end -->
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| OPENAI_API_KEY="XXXXX" | |
| AI21_API_KEY="XXXXX" | |
| DEBUG=True | |
| TESTING=False | |
| CHROMA_SERVER_HOST="XXXXXX" | |
| CHROMA_SERVER_HTTP_PORT="8000" | |
| WEB_CRAWLER_HTTP="XXXXX" | |
| CHROMA_SERVER_PATH="./chroma" | |
| OPENAI_API_KEY=XXXXX | |
| AI21_API_KEY=XXXXX | |
| DEBUG=True | |
| TESTING=False | |
| CHROMA_SERVER_HOST=XXXXXX | |
| CHROMA_SERVER_HTTP_PORT=8000 | |
| WEB_CRAWLER_HTTP=XXXXX | |
| CHROMA_SERVER_PATH=./chroma |
🧰 Tools
🪛 dotenv-linter (3.3.0)
[warning] 1-1: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 2-2: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 2-2: [UnorderedKey] The AI21_API_KEY key should go before the OPENAI_API_KEY key
(UnorderedKey)
[warning] 3-3: [UnorderedKey] The DEBUG key should go before the OPENAI_API_KEY key
(UnorderedKey)
[warning] 5-5: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 5-5: [UnorderedKey] The CHROMA_SERVER_HOST key should go before the DEBUG key
(UnorderedKey)
[warning] 6-6: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 6-6: [UnorderedKey] The CHROMA_SERVER_HTTP_PORT key should go before the DEBUG key
(UnorderedKey)
[warning] 7-7: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 8-8: [EndingBlankLine] No blank line at the end of the file
(EndingBlankLine)
[warning] 8-8: [QuoteCharacter] The value has quote characters (', ")
(QuoteCharacter)
[warning] 8-8: [UnorderedKey] The CHROMA_SERVER_PATH key should go before the DEBUG key
(UnorderedKey)
🤖 Prompt for AI Agents
.env.example lines 1-8: the example contains unnecessary double quotes around
values which become part of the environment values; remove the surrounding
quotes so keys are like OPENAI_API_KEY=XXXXX, AI21_API_KEY=XXXXX, DEBUG=True,
TESTING=False, CHROMA_SERVER_HOST=XXXXXX, CHROMA_SERVER_HTTP_PORT=8000,
WEB_CRAWLER_HTTP=XXXXX, CHROMA_SERVER_PATH=./chroma (i.e., edit each line to
drop the leading and trailing quotes).
| os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True) | ||
| filename = secure_filename(pdf.filename) | ||
| file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename) | ||
| pdf.save(file_path) | ||
| pdf_file = open(file_path, 'rb') | ||
| pdf_text = extract_text(pdf_file) | ||
| return { | ||
| "text": pdf_text, | ||
| "path": file_path | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close the uploaded PDF before returning it
get_pdf_text leaves the file handle open, so when the caller later tries to delete the temporary file (or reuse the filename), the open descriptor can block cleanup—especially on Windows where removing an open file raises an exception. Wrap the open call in a context manager (or close it explicitly) before returning to avoid leaking descriptors and to guarantee os.remove succeeds.
🤖 Prompt for AI Agents
In apis/anthropic_resource.py around lines 52 to 61, the function opens the
uploaded PDF with open(...) and returns while the file handle remains open;
update the code to open the file in a context manager so it is closed before
returning (e.g., use with open(file_path, 'rb') as pdf_file: pdf_text =
extract_text(pdf_file)), then return the text and path after the with-block so
no file descriptor is leaked and downstream os.remove or reuse of the filename
will succeed.
| elif (url and url.lower().endswith('.pdf')): | ||
| pdf_handler = PdfHandler() | ||
| data = pdf_handler.get_pdf_from_url(url) | ||
| pdf_text = extract_text(data) | ||
| if (pdf_text): | ||
| messages = create_message(pdf_text) | ||
| response = client.messages.create( | ||
| messages=messages, model=model, system=system, max_tokens=max_tokens) | ||
| responseJson = extract_response_data(response) | ||
| return responseJson, 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert max_tokens to an int for URL-based PDFs
In the URL code path max_tokens remains a string ('1000' default or user input). Passing a string to client.messages.create raises a validation error because the SDK expects an integer. Mirror the uploaded-file branch: cast with int(max_tokens) before invoking the API.
- response = client.messages.create(
- messages=messages, model=model, system=system, max_tokens=max_tokens)
+ response = client.messages.create(
+ messages=messages, model=model, system=system, max_tokens=int(max_tokens))🤖 Prompt for AI Agents
In apis/anthropic_resource.py around lines 141 to 150, the URL-based PDF branch
passes max_tokens as a string to client.messages.create which causes SDK
validation errors; cast max_tokens to an integer before calling the API (e.g.,
use int(max_tokens)) mirroring the uploaded-file branch so
client.messages.create receives an int value.
| pdf_file = args.get('pdf', None) | ||
| job_description = args.get('jobDescription') | ||
| model = args['model'] if args.get('model') is not None else 'claude-3-opus-20240229' | ||
| max_tokens = args['maxTokens'] if args.get('maxTokens') is not None else '1024' | ||
| if pdf_file and job_description: | ||
| resume = get_pdf_text(pdf_file) | ||
|
|
||
| prompt = f""" | ||
| Please analyze the following resume in relation to the given job description: | ||
| Job Description: | ||
| {job_description} | ||
| Resume: | ||
| {resume} | ||
| Provide the following: | ||
| 1. A score out of 100 for the matching of the candidate to the job description. | ||
| 2. A short paragraph explaining the reasoning behind the score. | ||
| 3. Three single-sentence suggestions for improving the resume, separated by newlines and numbered. | ||
| Please format your response as follows: | ||
| Score: {{score}} | ||
| Reasoning: {{reasoning}} | ||
| Suggested Improvements: | ||
| 1. {{improvement1}} | ||
| 2. {{improvement2}} | ||
| 3. {{improvement3}} | ||
| """ | ||
|
|
||
| response = client.messages.create( | ||
| model=model, | ||
| max_tokens=int(max_tokens), | ||
| messages=[ | ||
| {"role": "user", "content": prompt} | ||
| ] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the temporary CV file
get_pdf_text returns both text and the saved path, but this branch never deletes the uploaded file afterwards. Every CV request therefore leaves a file behind in UPLOAD_FOLDER, causing unbounded disk growth. After you extract the text (and use it in the prompt), call os.remove(resume["path"]) (after closing the file per earlier comment) to clean up.
🤖 Prompt for AI Agents
In apis/anthropic_resource.py around lines 185 to 224, get_pdf_text returns both
extracted text and the saved file path but the code never deletes the uploaded
CV, leaving files in UPLOAD_FOLDER; after calling get_pdf_text and using
resume["text"] in the prompt, close any open file handles if applicable and call
os.remove(resume["path"]) (with a try/except around the remove to avoid crashing
if deletion fails) to delete the temporary file immediately after building the
prompt/response.
| collection = collection_helper.get_collection(collection_name) | ||
| if (collection): | ||
| embeddings_number = collection.count() | ||
| current_app.logger.info(f'******* Number of embeddings: {embeddings_number}') | ||
| response = { | ||
| "price": embeddings_number * 0.05, # TBD | ||
| "status": 'DONE', | ||
| "error": None | ||
| } | ||
| else: | ||
| response = { | ||
| "price": 0, | ||
| "status": 'PROCESSING', | ||
| "error": None | ||
| } | ||
| return make_response(jsonify(response), 200) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distinguish pending collections before marking them DONE
collection_helper.get_collection always calls create_collection(..., get_or_create=True), so even when ingestion hasn’t completed yet this line creates a brand‑new, empty collection. The if (collection) branch is therefore always entered and status is reported as DONE even though collection.count() can still be zero. We lose the intended PROCESSING vs DONE distinction and surface “done” to clients while indexing is still running. Fetch the collection without get_or_create (e.g., get_existing_collection) or treat count()==0 as PROCESSING to keep the status accurate.
🤖 Prompt for AI Agents
In apis/collections/collections_resource.py around lines 110 to 125, the current
call to collection_helper.get_collection uses get_or_create semantics so it
always returns a (possibly empty) collection and causes status to be set to
DONE; update the logic to either (preferred) call a non-creating fetch (e.g.,
collection_helper.get_existing_collection or equivalent) and only enter the DONE
branch if an existing collection is returned and collection.count() > 0, or
(alternative) keep using get_collection but treat collection.count() == 0 as
PROCESSING (set price=0, status='PROCESSING', error=None) and only set
status='DONE' when count() > 0; ensure the response still returns 200 and
preserve logging with clear messages about whether collection was missing, empty
(processing) or done.
| def timer(func): | ||
| def wrapper(*args, **kwargs): | ||
| start = time.time() | ||
| result = func(*args, **kwargs) | ||
| end = time.time() | ||
| print(f"{func.__name__} took {end - start:.6f} seconds") | ||
| return result | ||
| return wrapper No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add @functools.wraps to preserve function metadata.
The decorator wrapper doesn't preserve the wrapped function's metadata (__name__, __doc__, __module__, etc.), which can break introspection, documentation tools, and debugging.
Apply this diff:
import time
+from functools import wraps
def timer(func):
+ @wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
end = time.time()
print(f"{func.__name__} took {end - start:.6f} seconds")
return result
return wrapper</review_comment_end -->
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def timer(func): | |
| def wrapper(*args, **kwargs): | |
| start = time.time() | |
| result = func(*args, **kwargs) | |
| end = time.time() | |
| print(f"{func.__name__} took {end - start:.6f} seconds") | |
| return result | |
| return wrapper | |
| import time | |
| from functools import wraps | |
| def timer(func): | |
| @wraps(func) | |
| def wrapper(*args, **kwargs): | |
| start = time.time() | |
| result = func(*args, **kwargs) | |
| end = time.time() | |
| print(f"{func.__name__} took {end - start:.6f} seconds") | |
| return result | |
| return wrapper |
🤖 Prompt for AI Agents
In services/timer_decorator.py around lines 3 to 10, the timer decorator does
not preserve the wrapped function's metadata; add functools.wraps to the wrapper
to copy __name__, __doc__, __module__, etc.: import functools at top if not
present and annotate the inner wrapper with @functools.wraps(func) so the
original function metadata is preserved while keeping the current timing logic
and return value.
| response = requests.get(request_url) | ||
| response.raise_for_status() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a timeout to the crawler request.
This requests.get call can hang indefinitely if the crawler backend stalls. Add a timeout so API workers aren't pinned waiting forever.
Use:
- response = requests.get(request_url)
+ response = requests.get(request_url, timeout=30)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| response = requests.get(request_url) | |
| response.raise_for_status() | |
| response = requests.get(request_url, timeout=30) | |
| response.raise_for_status() |
🧰 Tools
🪛 Ruff (0.13.3)
17-17: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In services/web_crawling.py around lines 17 to 18, the requests.get call can
hang indefinitely; add a timeout parameter (e.g., timeout=10) to the
requests.get invocation and update error handling to catch
requests.exceptions.Timeout (and optionally
requests.exceptions.RequestException) so the worker logs the timeout and fails
fast instead of waiting forever.
| f"Webcrawling {url} => Tags processed: {len(result['elements']) if 'elements' in result else 0}" | ||
| ) | ||
| chunks = self.parse_web_content(result['elements']) | ||
| return { | ||
| "urlText": chunks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard missing elements in crawler response.
If the crawler responds without an elements key, this line raises and the endpoint 500s. You already log with a fallback count, so fall back here as well.
Fix with:
- chunks = self.parse_web_content(result['elements'])
+ elements = result.get('elements', [])
+ chunks = self.parse_web_content(elements)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| f"Webcrawling {url} => Tags processed: {len(result['elements']) if 'elements' in result else 0}" | |
| ) | |
| chunks = self.parse_web_content(result['elements']) | |
| return { | |
| "urlText": chunks, | |
| f"Webcrawling {url} => Tags processed: {len(result['elements']) if 'elements' in result else 0}" | |
| ) | |
| elements = result.get('elements', []) | |
| chunks = self.parse_web_content(elements) | |
| return { | |
| "urlText": chunks, |
🤖 Prompt for AI Agents
In services/web_crawling.py around lines 21 to 25, the code assumes
result['elements'] exists and will raise if the crawler response lacks an
'elements' key; change to read elements = result.get('elements', []) (or
equivalent) and pass that safe empty list into self.parse_web_content, and use
elements for any length/count logic so the function returns an empty chunks list
instead of 500ing when 'elements' is missing.
| for item in no_duplicates: | ||
| words = item.split() | ||
| word_count = len(words) | ||
|
|
||
| if current_word_count + word_count <= max_words: | ||
| concatenated_text += item + ' ' | ||
| current_word_count += word_count | ||
| else: | ||
| chunks.append(concatenated_text) | ||
| concatenated_text = '' | ||
| current_word_count = 0 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix dropped content when a chunk exceeds max_words.
When the current item pushes the chunk past the word limit, you append the existing buffer but never add the overflowing item itself. Those tokens are silently lost. Reload the buffer with the current item before continuing.
Apply:
- if current_word_count + word_count <= max_words:
- concatenated_text += item + ' '
- current_word_count += word_count
- else:
- chunks.append(concatenated_text)
- concatenated_text = ''
- current_word_count = 0
+ if current_word_count + word_count <= max_words:
+ concatenated_text += item + ' '
+ current_word_count += word_count
+ else:
+ if concatenated_text:
+ chunks.append(concatenated_text.rstrip())
+ concatenated_text = item + ' '
+ current_word_count = word_count📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for item in no_duplicates: | |
| words = item.split() | |
| word_count = len(words) | |
| if current_word_count + word_count <= max_words: | |
| concatenated_text += item + ' ' | |
| current_word_count += word_count | |
| else: | |
| chunks.append(concatenated_text) | |
| concatenated_text = '' | |
| current_word_count = 0 | |
| for item in no_duplicates: | |
| words = item.split() | |
| word_count = len(words) | |
| if current_word_count + word_count <= max_words: | |
| concatenated_text += item + ' ' | |
| current_word_count += word_count | |
| else: | |
| if concatenated_text: | |
| chunks.append(concatenated_text.rstrip()) | |
| concatenated_text = item + ' ' | |
| current_word_count = word_count |
🤖 Prompt for AI Agents
In services/web_crawling.py around lines 46 to 57, when an item causes the
current chunk to exceed max_words the code appends the existing buffer but drops
the overflowing item; change the logic so after appending concatenated_text you
reload the buffer with the current item (concatenated_text = item + ' ') and set
current_word_count = word_count so the item is preserved; if you want to handle
items longer than max_words later, add a TODO or separate splitting logic, but
do not reset the buffer to empty and lose the item.
| client_settings = Settings( | ||
| is_persistent= True, | ||
| allow_reset=True | ||
| ) | ||
|
|
||
| path = config.CHROMA_SERVER_PATH | ||
| chromadb = ChromaStorage(path, client_settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid module-level singleton initialization - creates testing and configuration issues.
Creating the chromadb singleton at import time introduces several problems:
- Testing difficulties: Cannot easily mock or replace the storage for unit tests
- Configuration timing: Config must be loaded before any import of this module
- Multiple app instances: Cannot have different storage instances for different Flask apps
- Error handling: Import-time failures are harder to debug and recover from
- Security risk:
allow_reset=Trueis dangerous in production as it permits database deletion
Consider using a factory pattern or lazy initialization:
from .chromadb_storage import ChromaStorage
from chromadb.config import Settings
from config import config
+_chromadb_instance = None
+
+def get_chromadb():
+ """Get or create the ChromaDB storage singleton."""
+ global _chromadb_instance
+ if _chromadb_instance is None:
+ client_settings = Settings(
+ is_persistent=True,
+ allow_reset=config.CHROMA_ALLOW_RESET # Make configurable
+ )
+ path = config.CHROMA_SERVER_PATH
+ _chromadb_instance = ChromaStorage(path, client_settings)
+ return _chromadb_instance
-client_settings = Settings(
- is_persistent= True,
- allow_reset=True
-)
-
-path = config.CHROMA_SERVER_PATH
-chromadb = ChromaStorage(path, client_settings)
+# For backward compatibility, expose as module-level attribute
+chromadb = None # Will be initialized on first get_chromadb() callThen update config.py to add CHROMA_ALLOW_RESET with a safe default (False for production).
🤖 Prompt for AI Agents
In storages/__init__.py around lines 6 to 12, the module currently creates a
ChromaStorage singleton at import time which causes
testing/configuration/instance management and security problems; change this to
a factory or lazy initializer function (e.g.,
get_chroma_storage(app_config=None) or create_chroma_storage(path=None,
allow_reset=None)) that constructs and returns a ChromaStorage instance on
demand, remove the module-level chromadb variable, read allow_reset from
config.CHROMA_ALLOW_RESET (add this flag to config.py with default False), and
ensure the factory validates config and handles/propagates creation errors
rather than performing import-time side effects so callers (including tests and
multiple app instances) can control lifecycle and mocking.
Summary by CodeRabbit