Skip to content

Conversation

@rh-rahulshetty
Copy link
Collaborator

Issue

Identified a scenario where the entire model-pipeline fails when timestamp extraction for a log line fails either due to unknown timestamp format.

Log Line (redacted lines for safety):

Sep 02 21:33:49 ip-10-***** kubenswrapper[3872]: E0902 21:33:49.941756    3872 event.go:346] "Server rejected event (will not retry!)" err="Timeout: request did not complete within requested timeout - conte.......

Error:

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
Namespace(input_files=['../../debug/5d900be3/0.log'], time_range='all-data', output_dir='./tmp/output', debug_mode=True, process_log_files=True, process_txt_files=False, model_type=<ModelType.ZERO_SHOT: 'zero_shot'>, model_name=<ZeroShotModels.CROSSENCODER: 'cross-encoder/nli-MiniLM2-L6-H768'>, clean_up=False)
Input files:
['../../debug/5d900be3/0.log']
../../debug/5d900be3/0.log patoolib.is_archive(file_): False
../../debug/5d900be3/0.log os.path.isdir(file_): False
'.csv' in extensions: False
'.xml' in extensions: False
Files to process:
../../debug/5d900be3/0.log
100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.66it/s]
Debug mode is set to: True
Starting pandarallel for log processing
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/.venv/lib64/python3.11/site-packages/pandarallel/core.py", line 95, in __call__
    result = self.work_function(
             ^^^^^^^^^^^^^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/.venv/lib64/python3.11/site-packages/pandarallel/data_types/series.py", line 26, in work
    return data.apply(
           ^^^^^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/.venv/lib64/python3.11/site-packages/pandas/core/series.py", line 4924, in apply
    ).apply()
      ^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/.venv/lib64/python3.11/site-packages/pandas/core/apply.py", line 1427, in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/.venv/lib64/python3.11/site-packages/pandas/core/apply.py", line 1507, in apply_standard
    mapped = obj._map_values(
             ^^^^^^^^^^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/.venv/lib64/python3.11/site-packages/pandas/core/base.py", line 921, in _map_values
    return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/.venv/lib64/python3.11/site-packages/pandas/core/algorithms.py", line 1743, in map_array
    return lib.map_infer(values, mapper, convert=convert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib.pyx", line 2972, in pandas._libs.lib.map_infer
  File "/home/rashetty/Desktop/LogAn/open-source/logan/preprocessing/preprocessing.py", line 828, in <lambda>
    df[['timestamps', 'epoch', 'text', 'preprocessed_text', 'numeric_count', "total_count", "token_count"]] = df['text'].parallel_apply(lambda log: pd.Series(self.process_fn(log, self.timezone_dict, self.master_timestamp_list, self.master_format_list)))
                                                                                                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/preprocessing/preprocessing.py", line 667, in process_fn
    timestamp, ts = self.extract_ts(log, rbr, timezone_dict, master_timestamp_list, master_format_list)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/preprocessing/preprocessing.py", line 638, in extract_ts
    timestamp, ts, future_flag = self.master_datetime_extractor(log, timezone_dict, master_timestamp_list, master_format_list)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/preprocessing/preprocessing.py", line 584, in master_datetime_extractor
    future_flag = parsed_date.timestamp() >= current_datetime.timestamp()
                  ^^^^^^^^^^^^^^^^^^^^^^^
ValueError: year 0 is out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rashetty/Desktop/LogAn/open-source/logan/run_log_diagnosis.py", line 101, in <module>
    preprocessing_obj.preprocess(args.input_files, 
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/preprocessing/preprocessing.py", line 828, in preprocess
    df[['timestamps', 'epoch', 'text', 'preprocessed_text', 'numeric_count', "total_count", "token_count"]] = df['text'].parallel_apply(lambda log: pd.Series(self.process_fn(log, self.timezone_dict, self.master_timestamp_list, self.master_format_list)))
                                                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rashetty/Desktop/LogAn/open-source/logan/.venv/lib64/python3.11/site-packages/pandarallel/core.py", line 333, in closure
    results_promise.get()
  File "/usr/lib64/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
ValueError: year 0 is out of range

Fix

To resolve the issue, this PR wraps the timestamp extraction with a simple try/except block and logs the failure.

Signed-off-by: Rahul Shetty <rashetty@redhat.com>
@rh-rahulshetty rh-rahulshetty self-assigned this Nov 28, 2025
@rh-rahulshetty rh-rahulshetty added the bug Something isn't working label Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant