add `ragnar_store_ingest()` #133

t-kalinowski · 2025-09-23T15:50:42Z

The primary motivation for this function is to speed up store creation using parallelism. It offloads the work of reading, converting to markdown, chunking, and generating embeddings to mirai background workers. This substantially reduces store creation time. For example, creating a store from quarto.org processes about 90% of the site in ~20 seconds before hitting a TPM rate limit of 1 million tokens per minute with OpenAI. Previously it would take around 10 minutes total.

This will require a solution for rate-limit handling, either through smart backoff or support for using the batch API. ~~These can be addressed in follow-up PRs.~~

We now disable the background workers from retrying the request after a 429. The main thread then tries to generate embeddings with retries, and blocks until the embedding request succeeds. This gracefully leads to backpressure and lets the whole queue be rate-limited without spamming excessive retries.

dfalbel

Nice!

I would abstract the queue implementation from actual ingest code so it's easier to read.

dfalbel · 2025-09-23T16:59:00Z

R/ingest.R

+      if (mirai::is_error_value(result)) {
+        cond <- attributes(result)
+        class(cond) <- cond$condition.class
+        stop(cond)


Do we want to fully stop when something goes wrong? I like it would be nice to recover and then offer an easy way to get the paths that didn't work.

shikokuchuo

I'd consider adding a way for users to use existing mirai daemons that they may have set up (to allow remote daemons). This could be by default or opt-in via an argument.

If you need to make anything conditional, you may use mirai::daemons_set().

An alternative is to do what purrr does, and abdicate responsibility to the user. E.g. offer a parallel = FALSE argument and for the TRUE branch, guard that with a mirai::require_daemons() call.

* abstract the queue from ingest * more robust retry logic * add default retries for other API's

t-kalinowski · 2025-09-30T00:08:14Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

chatgpt-codex-connector · 2025-09-30T00:12:50Z

R/ingest.R

+  .launch_jobs <- function() {
+    while (
+      length(.pending) &&
+        (length(.finished) + length(.active)) < max_uncollected
+    ) {
+      mirai_args <- .pending[[1L]]
+      .pending <<- .pending[-1L]
+      .active[[length(.active) + 1L]] <<-
+        inject(mirai::mirai(!!!mirai_args, .compute = .compute))


[P0] Reference rlang::inject when launching mirai jobs

The new queue helper calls inject(mirai::mirai(!!!mirai_args, .compute = .compute)), but inject is not defined or imported anywhere in the package. As soon as ragnar_store_ingest() schedules its first job, R will error with could not find function "inject" and the parallel ingestion never starts. Prefixing the call with rlang:: (or importing it) is required for this function to work at all.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-09-30T00:12:51Z

R/ingest.R

+do_ingest_remote_work <- function(path, store, prepare, embed = TRUE) {
+  chunks <- prepare(path)
+  if (embed) {
    tryCatch(
-      {
-        chunks <- origin |> read_as_markdown() |> markdown_chunk()
-        ragnar_store_update(store, chunks)
-      },
-      error = warning
+      chunks <- do_embed(store, chunks),
+      error = function(e) NULL
    )
-    # cli::cli_progress_update()
  }


[P1] Propagate embedding failures from workers

Inside do_ingest_remote_work() the embedding call is wrapped with tryCatch(..., error = function(e) NULL) and any failure is discarded before returning chunks to the main process. When the embedder relies on packages not attached in mirai workers (as in the new test that intentionally omits stats::runif), the error is swallowed and ingestion succeeds silently because the main thread re‑embeds later. This prevents callers from learning that their embed function is misconfigured and causes the added expect_error() test to fail. Consider letting the error bubble up or returning it so that the main loop can stop or handle it explicitly.

Useful? React with 👍 / 👎.

t-kalinowski · 2025-09-30T02:45:51Z

Thanks @shikokuchuo. I’d prefer to keep daemons’ lifetimes tied to the function call, since we configure global state in each worker. That way it stays simple and avoids the potential for leaking state or misconfigured workers.

t-kalinowski added 5 commits September 23, 2025 10:17

Implement ragnar_store_ingest()

85622f0

use mirai_race()

8f7b8bc

disable duckdb progress bar (interferes with cli)

85847c0

docs + small tweaks

956fe01

R CMD check nits + snapshot updates

33932a2

t-kalinowski mentioned this pull request Sep 23, 2025

Implements race_mirai() r-lib/mirai#449

Merged

add NEWS

c1c0c7e

t-kalinowski requested a review from dfalbel September 23, 2025 16:33

dfalbel reviewed Sep 23, 2025

View reviewed changes

shikokuchuo reviewed Sep 24, 2025

View reviewed changes

t-kalinowski mentioned this pull request Sep 25, 2025

Gating dispatcher based on uncollected results r-lib/mirai#454

Open

dfalbel and others added 9 commits September 25, 2025 18:20

Abstract the queue from ingest (#136)

b094a4a

* abstract the queue from ingest * more robust retry logic * add default retries for other API's

fix error forwarding

5806610

add a test

8a65fe3

tweak queue

f11da97

better retry strategy

956a905

remove file

5d0b272

remove slow examples

d0d55f7

only embed if not an update

fe0d87a

add NEWS

555292f

chatgpt-codex-connector bot reviewed Sep 30, 2025

View reviewed changes

t-kalinowski added 2 commits September 29, 2025 21:54

more aggressive retry backoff

a63808b

spread new+old origins uniformly; refactor path prep

890d48c

t-kalinowski merged commit 027c4e6 into main Sep 30, 2025
5 checks passed

t-kalinowski mentioned this pull request Sep 30, 2025

Generalize embed req retry policy #138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add `ragnar_store_ingest()` #133

add `ragnar_store_ingest()` #133

Uh oh!

t-kalinowski commented Sep 23, 2025 •

edited

Loading

Uh oh!

dfalbel left a comment

Uh oh!

dfalbel Sep 23, 2025

Uh oh!

shikokuchuo left a comment

Uh oh!

t-kalinowski commented Sep 30, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Sep 30, 2025

Uh oh!

chatgpt-codex-connector bot Sep 30, 2025

Uh oh!

t-kalinowski commented Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add ragnar_store_ingest() #133

add ragnar_store_ingest() #133

Uh oh!

Conversation

t-kalinowski commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dfalbel left a comment

Choose a reason for hiding this comment

Uh oh!

dfalbel Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

shikokuchuo left a comment

Choose a reason for hiding this comment

Uh oh!

t-kalinowski commented Sep 30, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

t-kalinowski commented Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add `ragnar_store_ingest()` #133

add `ragnar_store_ingest()` #133

t-kalinowski commented Sep 23, 2025 •

edited

Loading