You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@santoshborse . I was working my way through your PyData NYC tutorial, but I'm hitting a problem with cell 3.2 of dpk_intro_1_python.ipynb. I have tried installing dpk 0.2.1. which gives this error below. If I install dpk 0.2.2, then I get No module named 'docling.backend.docling_parse_v2_backend' trying to run the same cell. Any thoughts on how I might fix this?
16:05:12 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}
INFO:pdf2parquet_transform:pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}
16:05:12 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
16:05:12 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
16:05:12 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out
INFO:data_processing.data_access.data_access_factory_base07e751be-a466-43c5-8b1c-c20fe1535242:data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out
16:05:12 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_base07e751be-a466-43c5-8b1c-c20fe1535242:data factory data_ max_files -1, n_sample -1
16:05:12 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
INFO:data_processing.data_access.data_access_factory_base07e751be-a466-43c5-8b1c-c20fe1535242:data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
16:05:12 INFO - orchestrator pdf2parquet started at 2024-12-03 16:05:12
INFO:data_processing.runtime.pure_python.transform_orchestrator:orchestrator pdf2parquet started at 2024-12-03 16:05:12
16:05:12 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}
INFO:data_processing.runtime.pure_python.transform_orchestrator:Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}
16:05:12 INFO - Initializing models
INFO:pdf2parquet_transform:Initializing models
Fetching 9 files: 100%
9/9 [00:00<00:00, 426.19it/s]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/data_processing/runtime/pure_python/transform_orchestrator.py", line 84, in orchestrate
_process_transforms(
File "/usr/local/lib/python3.10/dist-packages/data_processing/runtime/pure_python/transform_orchestrator.py", line 153, in _process_transforms
executor = PythonTransformFileProcessor(
File "/usr/local/lib/python3.10/dist-packages/data_processing/runtime/pure_python/transform_file_processor.py", line 46, in __init__
self.transform = transform_class(self.transform_params)
File "/usr/local/lib/python3.10/dist-packages/pdf2parquet_transform.py", line 105, in __init__
self._converter = DocumentConverter(
File "/usr/local/lib/python3.10/dist-packages/docling/document_converter.py", line 54, in __init__
self.model_pipeline = pipeline_cls(
File "/usr/local/lib/python3.10/dist-packages/docling/pipeline/standard_model_pipeline.py", line 24, in __init__
LayoutModel(
File "/usr/local/lib/python3.10/dist-packages/docling/models/layout_model.py", line 46, in __init__
self.layout_predictor = LayoutPredictor(
File "/usr/local/lib/python3.10/dist-packages/docling_ibm_models/layoutmodel/layout_predictor.py", line 96, in __init__
raise FileNotFoundError("Missing ONNX file: {}".format(self._onnx_fn))
FileNotFoundError: Missing ONNX file: /root/.cache/huggingface/hub/models--ds4sd--docling-models/snapshots/a8a57426c20d9f7bc0343cfd84e8b439425e5561/model_artifacts/layout/beehive_v0.0.5/model.pt
16:05:17 ERROR - Exception during execution Missing ONNX file: /root/.cache/huggingface/hub/models--ds4sd--docling-models/snapshots/a8a57426c20d9f7bc0343cfd84e8b439425e5561/model_artifacts/layout/beehive_v0.0.5/model.pt: None
ERROR:data_processing.runtime.pure_python.transform_orchestrator:Exception during execution Missing ONNX file: /root/.cache/huggingface/hub/models--ds4sd--docling-models/snapshots/a8a57426c20d9f7bc0343cfd84e8b439425e5561/model_artifacts/layout/beehive_v0.0.5/model.pt: None
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/data_processing/runtime/pure_python/transform_orchestrator.py", line 104, in orchestrate
stats["processing_time"] = round(stats["processing_time"], 3)
KeyError: 'processing_time'
16:05:17 ERROR - Exception during execution 'processing_time': None
ERROR:data_processing.runtime.pure_python.transform_orchestrator:Exception during execution 'processing_time': None
16:05:17 INFO - Completed execution in 0.083 min, execution result 1
INFO:data_processing.runtime.pure_python.transform_launcher:Completed execution in 0.083 min, execution result 1
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<timed exec> in <module>
Exception: ❌ Job failed
The text was updated successfully, but these errors were encountered:
@santoshborse . I was working my way through your PyData NYC tutorial, but I'm hitting a problem with cell
3.2
ofdpk_intro_1_python.ipynb
. I have tried installing dpk 0.2.1. which gives this error below. If I install dpk 0.2.2, then I getNo module named 'docling.backend.docling_parse_v2_backend'
trying to run the same cell. Any thoughts on how I might fix this?The text was updated successfully, but these errors were encountered: