Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore][fileconsumer/archive] - Add archive read logic #35798

Merged
merged 55 commits into from
Dec 9, 2024

Conversation

VihasMakwana
Copy link
Contributor

This PR follows #35098.

Description

  • This PR adds core logic for matching from archive. Check this out for the core logic.

Future PRs

  • As of now, we don't keep track of most recently written index across collector restarts. This is simple to accomplish and we can use of persister for this. I haven't implemented this in current PR, as I want to guide your focus solely towards reading part. We can address this in this PR (later) or in a separate PR, independently.
  • Testing and Enabling: Once we establish common ground on reading from archive matter, we can proceed with testing and enabling the configuration.

@VihasMakwana
Copy link
Contributor Author

@djaglowski can you take a fresh look? I've removed the Record and each index of the result slice corresponds to the fingerprint of the same index in the input slice. Let me know what do you think about it!


func Mod(x, y int) int {
return ((x % y) + y) % y
}
Copy link
Contributor Author

@VihasMakwana VihasMakwana Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because in golang, the % operator acts as a remainder, not a typical math modulus and it can output negative integers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it only returns negative if you apply it to a negative value, so can't we just avoid that and compute this inline?

mostRecentIndex = (x - 1 + y) % y

or

mostRecentIndex--
if mostRecentIndex < 0 {
  mostRecentIndex += t.pollsToArchive
}

func (f *Fingerprint) GetFingerprint() *Fingerprint {
return f
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this so the Fingeprint can implement the Matchable interface, so that we can have unified return type for

func (t *fileTracker) FindFiles(fps []*fingerprint.Fingerprint) []fileset.Matchable {
// To minimize disk access, we first access the index, then review unmatched files and update the metadata, if found.

I think this is neater than returning []any or creating a struct to capture the fingerprint/metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to return the fingerprints that didn't match. Just return []*reader.Metadata and leave the not found items empty.

"go.opentelemetry.io/collector/component/componenttest"
)

func TestFindFilesOrder(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know your thoughts over this test @djaglowski.

@VihasMakwana
Copy link
Contributor Author

@djaglowski I imagine you were at KubeCon last week. Just checking in to see if you're back and reviewing.

func (f *Fingerprint) GetFingerprint() *Fingerprint {
return f
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to return the fingerprints that didn't match. Just return []*reader.Metadata and leave the not found items empty.


mostRecentIndex := util.Mod(t.archiveIndex-1, t.pollsToArchive)
matchedMetadata := make([]fileset.Matchable, len(fps))
indices := make(map[int]bool) // Track fp indices of original fps slice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This complicates the code way more than necessary. The result slice can indicate just as well whether a fingerprint has been matched. Just check if the corresponding value is nil.

if md := data.Match(fps[index], fileset.StartsWith); md != nil {
// update the matched metadata for this index
matchedMetadata[index] = md
delete(indices, index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a map and deleting keys from it may be marginally more performant but I don't buy that it's worth the complexity. Just loop over the result slice and only attempt to match an item if the corresponding result is nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I've updated the code.

matchedMetadata := make([]*reader.Metadata, len(fps))

// continue executing the loop until either all records are matched or all archive sets have been processed.
for i := 0; i < t.pollsToArchive; i, mostRecentIndex = i+1, util.Mod(mostRecentIndex-1, t.pollsToArchive) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the buffer index update out of the control loop to make this easily readable?

Suggested change
for i := 0; i < t.pollsToArchive; i, mostRecentIndex = i+1, util.Mod(mostRecentIndex-1, t.pollsToArchive) {
for i := 0; i < t.pollsToArchive; i++ {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. I've removed the Mod function and kept it in-place. Should be good now.


func Mod(x, y int) int {
return ((x % y) + y) % y
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it only returns negative if you apply it to a negative value, so can't we just avoid that and compute this inline?

mostRecentIndex = (x - 1 + y) % y

or

mostRecentIndex--
if mostRecentIndex < 0 {
  mostRecentIndex += t.pollsToArchive
}

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is making more sense to me now. This round of feedback is all nits but I think it will help readability.

pkg/stanza/fileconsumer/internal/tracker/tracker.go Outdated Show resolved Hide resolved
pkg/stanza/fileconsumer/internal/tracker/tracker.go Outdated Show resolved Hide resolved
pkg/stanza/fileconsumer/internal/tracker/tracker.go Outdated Show resolved Hide resolved
pkg/stanza/fileconsumer/internal/tracker/tracker.go Outdated Show resolved Hide resolved
pkg/stanza/fileconsumer/internal/tracker/tracker.go Outdated Show resolved Hide resolved
@VihasMakwana
Copy link
Contributor Author

@djaglowski does this look good to go?

@djaglowski djaglowski merged commit 1a90009 into open-telemetry:main Dec 9, 2024
158 checks passed
@github-actions github-actions bot added this to the next release milestone Dec 9, 2024
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this pull request Dec 17, 2024
…y#35798)

This PR follows
open-telemetry#35098.

### Description

- This PR adds core logic for matching from archive. Check [this
out](open-telemetry#32727 (comment))
for the core logic.

### Future PRs

- As of now, we don't keep track of most recently written index across
collector restarts. This is simple to accomplish and we can use of
persister for this. I haven't implemented this in current PR, as I want
to guide your focus solely towards reading part. We can address this in
this PR (later) or in a separate PR, independently.
- Testing and Enabling: Once we establish common ground on _**reading
from archive**_ matter, we can proceed with testing and enabling the
configuration.
AkhigbeEromo pushed a commit to sematext/opentelemetry-collector-contrib that referenced this pull request Jan 13, 2025
…y#35798)

This PR follows
open-telemetry#35098.

### Description

- This PR adds core logic for matching from archive. Check [this
out](open-telemetry#32727 (comment))
for the core logic.

### Future PRs

- As of now, we don't keep track of most recently written index across
collector restarts. This is simple to accomplish and we can use of
persister for this. I haven't implemented this in current PR, as I want
to guide your focus solely towards reading part. We can address this in
this PR (later) or in a separate PR, independently.
- Testing and Enabling: Once we establish common ground on _**reading
from archive**_ matter, we can proceed with testing and enabling the
configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants