Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filelog receiver reads old logs even when start_at is set to end #36091

Open
obs-gh-enricogiorio opened this issue Oct 30, 2024 · 11 comments
Open

Comments

@obs-gh-enricogiorio
Copy link

Hi @djaglowski I am still an issue where filelog starts reading from beginning of files despite a fix that was mentioned in this other issue.
That issue mentions that when the max concurrent files is reached, the receiver enters "batch" mode, and starts reading from the beginning of the files, regardless of what "start_at" is set to.
I made sure start_at is set to end and I have checked the receiver's logs, nothing mentions that any batching happened.
Also given the amount of log files I have in the system, I can pretty much ensure that the default limit of 1024 concurrent files has ever been reached.

For instance, on 16/10/24 13:22:24.363 CEST, the receiver read a log line with timestamp 14/05/24 14:40:41.042 CEST, which is >5 months old.
image

From this screenshot you can see:

BUNDLE_TIMESTAMP is when the log line was parsed (technically it's when it reached our system, but it's likely just a few seconds earlier)
timestamp is the log's timestamp
delta is the discrepancy between the two. The logs you see are 4 months old
The purple bits above are when the data falls over time. You can see that this did not happen constantly but only twice, with many instances of very old logs being read all together

This is how I configured the receiver:

  filelog:
    exclude:  '[]'
    include:  '["/var/log/pods/*/*/*.log", "/var/log/kube-apiserver-audit.log"]'
    include_file_name: false
    include_file_path: true
    exclude_older_than: 24h
    operators:
    - id: container-parser
      max_log_size: 102400
      type: container
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval:  30s
      max_elapsed_time: 5m
    start_at: "end"
    storage: file_storage
@djaglowski
Copy link
Member

For instance, on 16/10/24 13:22:24.363 CEST, the receiver read a log line with timestamp 14/05/24 14:40:41.042 CEST, which is >5 months old.

Just to be clear, neither start_at or exclude_older_than settings are intended to filter logs based on their timestamps. These settings are both concerned with aspects of the files themselves.

The start_at setting only has an effect when the receiver first starts running. From then on, any file which appears to be a new file will be ingested from the beginning of the file. Given that, is there any chance that these files containing old logs were moved, or could new logs have been written to the files?

@VihasMakwana
Copy link
Contributor

Perhaps related to #32727. Not sure though

@VihasMakwana
Copy link
Contributor

VihasMakwana commented Oct 31, 2024

Maybe, what's happening is:

  • You recently updated a file which was 4 months old.
  • By this time, the collector was already running, and setting start_at wouldn’t impact the later set of matched files
  • The receiver considers that the 4 months old file new (because it was updated) and read everything from beginning.

@obs-gh-enricogiorio
Copy link
Author

I need to check if it's even possible that k8s does that. The files I am parsing are all and only kubernetes log files (pod logs).

Just to be clear: what you are saying is that k8s might create a new log file (maybe mv an old file), so the receiver will (obviously) start reading it from the beginning, but that file might contain old log lines?

And one more clarification before I start looking at the files (I think I will look into whether there are files with creation_timestamp that is more recent than the first log row's timestamp).
file_storage is optional: if you don't set it, it won't use any external storage, it just keeps pointers in memory. This means that upon restart, the start_at: end is enforced. HOWEVER, if you set an external storage (either file or db), you DON'T start at the end, but you pick up from where you left, right?
So is my understanding correct that start_at: end is only relevant when not setting file_storage?

@obs-gh-enricogiorio
Copy link
Author

obs-gh-enricogiorio commented Nov 6, 2024

I confirm that this issue is NOT due to k8s renaming files, but it happens at startup.

I am deploying a stack of agent replicas via helm charts. When I run helm install ..., the following error shows up instantly in the agent container logs:

2024-11-06T10:23:57.194Z        error   internal/queue_sender.go:100    Exporting failed. Dropping data.        {"kind": "exporter", "data_type": "logs", "name": "otlphttp/observe/base", "error": "not retryable error: Permanent error: rpc error: code = Unknown desc = error exporting items, request to https://102.collect.observe-o2.com/v2/otel/v1/logs responded with HTTP Status Code 422, Message=some or all observation timestamps are too old or too far in the future: [1730547034602821748, 1730888635972996101), Details=[]", "dropped_items": 4096}
go.opentelemetry.io/collector/exporter/exporterhelper/internal.NewQueueSender.func1
        /home/runner/work/observe-agent/observe-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/internal/queue_sender.go:100
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
        /home/runner/work/observe-agent/observe-agent/vendor/go.opentelemetry.io/collector/exporter/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
        /home/runner/work/observe-agent/observe-agent/vendor/go.opentelemetry.io/collector/exporter/internal/queue/consumers.go:43

(this is due to our observability backend that collects data sent via OTEL agent rejecting the logs with 422 because they have outdated timestamps).
The error comes from the exporter, but obviously this is because filelog has read some old logs that are supposed to be left behind.

Is it possible that a change in the agent config/code makes the receiver start from the beginning of the logs at startup, even if we specify start_at: end?
I don't see why it would start at the beginning at startup otherwise...

@djaglowski
Copy link
Member

Just to be clear: what you are saying is that k8s might create a new log file (maybe mv an old file), so the receiver will (obviously) start reading it from the beginning, but that file might contain old log lines?

No, the receiver will not read a moved file at all. It will recognize it as the same file it's previously seen and ignore it (unless there are new lines to read, in which case it will read those)

file_storage is optional: if you don't set it, it won't use any external storage, it just keeps pointers in memory. This means that upon restart, the start_at: end is enforced. HOWEVER, if you set an external storage (either file or db), you DON'T start at the end, but you pick up from where you left, right?

Correct

So is my understanding correct that start_at: end is only relevant when not setting file_storage?

Pretty much, but to be more specific, it is only relevant the first time the receiver polls for files. If there is a storage extension with previously saved state, then the receiver knows it is not the first time it's polled for files.

@djaglowski
Copy link
Member

Is it possible that a change in the agent config/code makes the receiver start from the beginning of the logs at startup, even if we specify start_at: end?

If the files appear after the receiver has already been running, then it knows they are new files and reads them from the beginning. start_at: end is just a way to say "i don't want to read logs that already exist at the time the receiver starts". If they are found later, then they are read regardless of start_at because they are considered new.

@go-follow
Copy link
Contributor

We also have a need to read files from the end even after the receiver starts. For example, when we are already using storage and need to add a huge new file, but there is no need to read it from the beginning

@go-follow
Copy link
Contributor

Maybe it would be better to use the start_at setting together with the storage extension to not only start the receiver, but also for all new files (with offset = 0)? I could try to implement this

@djaglowski
Copy link
Member

We also have a need to read files from the end even after the receiver starts. For example, when we are already using storage and need to add a huge new file, but there is no need to read it from the beginning

Maybe it would be better to use the start_at setting together with the storage extension to not only start the receiver, but also for all new files (with offset = 0)? I could try to implement this

It's not clear to me what you're suggesting. How should the receiver (with or without the storage extension) know whether a file is a new file that should be consumed entirely vs a new file whose contents should be ignored initially?

@go-follow
Copy link
Contributor

I suggest adding an additional condition to check the offset when reading the file. This additional check could look something like this:

if !f.FromBeginning || (m.Offset <= 0 && f.StartAt == "end") {
  var info os.FileInfo
  if info, err = r.file.Stat(); err != nil {
	return nil, fmt.Errorf("stat: %w", err)
  }
  r.Offset = info.Size()
}

The point is that if the offset is 0 then this is a new file and it is better not to read it from the beginning, unless the option startAt explicitly has the value "beginning"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants