Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reporting indexing events to SDR based on solr responses #1366

Merged
merged 2 commits into from
Mar 1, 2024

Conversation

thatbudakguy
Copy link
Member

@thatbudakguy thatbudakguy commented Feb 27, 2024

Closes #1321

This refactors the SolrBetterJsonWriter so that there's a data structure that connects the desired actions (add/update vs. delete) to each document/ID. We then use that data to report the status of each record after communicating with solr, so that we can dispatch one message per druid indicating if the addition/deletion was successful or not.

The calls that currently do this in the indexer config are removed in favor of doing it in the writer, since it's more accurate to report based on the actual Solr response.

@thatbudakguy thatbudakguy changed the title report via traject writer Reporting indexing events to SDR based on solr responses Feb 27, 2024
@thatbudakguy thatbudakguy marked this pull request as ready for review February 27, 2024 23:00
t0 = context.clipboard[:benchmark_start_time]
t1 = Time.now

logger.debug('geo_config.rb') { "Processed #{context.output_hash['id']} (#{t1 - t0}s)" }
SdrEvents.report_indexing_success(record.druid, target: settings['purl_fetcher.target'])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to writer

@@ -170,7 +170,6 @@ def geoserver_url(record)
# delete records form the index
context.output_hash['id'] = ["stanford-#{druid}"]

SdrEvents.report_indexing_deleted(druid, target: settings['purl_fetcher.target'])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to writer

Comment on lines +117 to +118
# Collection of Traject contexts to be sent to solr
class Batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding what the responsibility of this class is. It seems like it does several things, which makes me wonder if it should be several classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially extracted it because I needed an underlying data structure that mapped desired actions to documents. That data structure could then be used to generate the JSON to solr as well as to report about the status of the actions. I could instead extract the reporting-related stuff to methods directly on the writer that take a Batch, I guess?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it feels to me like the reporting stuff should be separate if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to the SdrEvents class.

# and data is either the doc id or the full doc hash. Druid is empty for
# non-SDR content.
def actions
@contexts.map do |context|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be memoized? It's called in generate_json and then again on the same data in report_success, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good idea

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 150 to 152
"delete: #{JSON.generate(data)}"
when :add
"add: #{JSON.generate(doc: data)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't your code, but these keys should be json strings. https://solr.apache.org/guide/8_0/uploading-data-with-index-handlers.html#UploadingDatawithIndexHandlers-SendingJSONUpdateCommands
So: JSON.generate(add: { doc: data })

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this! Much more intuitive to just build it as a regular hash instead of concat-ing a string. I'll change it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it's Solr's own specific flavor of JSON which allows duplicate keys, so I think you'll still have to concat some strings to do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added quotes

@thatbudakguy thatbudakguy force-pushed the report-via-traject-writer branch from 6f400ca to f48aac7 Compare February 29, 2024 23:18
@thatbudakguy thatbudakguy requested a review from jcoyne February 29, 2024 23:19
@jcoyne jcoyne merged commit c5e47a4 into main Mar 1, 2024
2 checks passed
@jcoyne jcoyne deleted the report-via-traject-writer branch March 1, 2024 15:25
@cbeer
Copy link
Member

cbeer commented Mar 2, 2024

This broke indexing from FOLIO. See #1383.

I've reverted it in #1384

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Report indexing to SDR based on traject writer
3 participants