You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the process of automating document ingestion, an issue was encountered where the script would halt when coming across defective files. This presents a challenge when dealing with large volumes of files, as it is impractical to identify the problematic file manually. To mitigate this issue, I propose implementing error handling to move defective files to a 'defective' folder and continue with the next file.
To accomplish this, modifications have been made to the 'ingest_folder.py' and 'ingest_service.py' files:
IngestFolder.py:
Modified '_do_ingest_one' method to handle exceptions and enable the ingestion process to proceed despite errors with specific files:
def _do_ingest_one(self, changed_path: Path) -> None:
try:
if changed_path.exists():
self.logger.info(f"Started ingesting file={changed_path}")
self.ingest_service.ingest_file(changed_path.name, changed_path)
self.logger.info(f"Completed ingesting file={changed_path}")
except BaseException as e:
self.logger.error(f"Failed to ingest document: {changed_path}, with exception: {e}")
self.move_defective_file(changed_path)
IngestService.py:
Modified 'ingest_file' and 'bulk_ingest' methods to handle exceptions:
def ingest_file(self, file_name: str, file_data: Path) -> List[IngestedDoc]:
self.logger.info(f"Ingesting file_name={file_name}")
try:
documents = self.ingest_component.ingest(file_name, file_data)
self.logger.info(f"Finished ingestion file_name={file_name}")
return [IngestedDoc.from_document(document) for document in documents]
except Exception as e:
self.logger.error(f"Failed to ingest {file_name} due to {e}")
return []
def bulk_ingest(self, files: list[tuple[str, Path]]) -> list[IngestedDoc]:
logger.info("Bulk ingesting files: %s", [f[0] for f in files])
ingested_docs = []
for file_name, file_data in files:
try:
documents = self.ingest_file(file_name, file_data)
ingested_docs.extend(documents)
except Exception as e:
logger.error(f"Failed to ingest {file_name} in bulk due to {e}")
logger.info("Finished bulk ingestion.")
return ingested_docs
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
In the process of automating document ingestion, an issue was encountered where the script would halt when coming across defective files. This presents a challenge when dealing with large volumes of files, as it is impractical to identify the problematic file manually. To mitigate this issue, I propose implementing error handling to move defective files to a 'defective' folder and continue with the next file.
To accomplish this, modifications have been made to the 'ingest_folder.py' and 'ingest_service.py' files:
IngestService.py:
def ingest_file(self, file_name: str, file_data: Path) -> List[IngestedDoc]:
self.logger.info(f"Ingesting file_name={file_name}")
try:
documents = self.ingest_component.ingest(file_name, file_data)
self.logger.info(f"Finished ingestion file_name={file_name}")
return [IngestedDoc.from_document(document) for document in documents]
except Exception as e:
self.logger.error(f"Failed to ingest {file_name} due to {e}")
return []
def bulk_ingest(self, files: list[tuple[str, Path]]) -> list[IngestedDoc]:
logger.info("Bulk ingesting files: %s", [f[0] for f in files])
ingested_docs = []
for file_name, file_data in files:
try:
documents = self.ingest_file(file_name, file_data)
ingested_docs.extend(documents)
except Exception as e:
logger.error(f"Failed to ingest {file_name} in bulk due to {e}")
logger.info("Finished bulk ingestion.")
return ingested_docs
Beta Was this translation helpful? Give feedback.
All reactions