light processing on over threshold errors, and keep track of record count #330

lchen-2101 · 2025-01-28T00:41:41Z

No description provided.

github-actions · 2025-01-28T00:46:05Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
src/regtech_data_validator
validation_results.py
validator.py					133-134, 294, 315-316
Project Total

_{This report was generated by python-coverage-comment-action}

jcadam14

Looks good just a few comments to discuss

jcadam14 · 2025-01-28T15:35:02Z

src/regtech_data_validator/validator.py

@@ -199,7 +205,6 @@ def validate_batch_csv(
        )
        yield results

-        print("Processing other logic errors")
        for validation_results, _ in validate_chunks(


I say we take out any of the none validate lf/parquet stuff because I think we've proven the batch csv approach isn't the way to go. Either here or in a new story.

let's deal with it in another story; I think I replicated parts of that code, and abstracted out to what we have now for lazyframes; the thought is to still be able to do cli csv, but it would then fall into using the lazyframe batch processing.

jcadam14 · 2025-01-28T15:40:10Z

src/regtech_data_validator/validator.py

        validation_results = format_findings(
            validation_results, schema.name.value, checks
        )

    error_counts, warning_counts = get_scope_counts(validation_results)
    results = ValidationResults(
+        record_count=df.height,
        error_counts=error_counts,


What is the intended use of this?

gonna guess u meant to comment on the record_count part, it's so the aggregator knows the total records of the submitted file, so we don't do that processing in the API, lessening the load there, and truly just have the API upload the file, and nothing else.

jcadam14 · 2025-01-28T15:42:40Z

src/regtech_data_validator/validator.py

    register_schema = get_register_schema(context)
-    validation_results = validate(register_schema, pl.DataFrame({"uid": all_uids}), 0)
+    validation_results = validate(register_schema, pl.DataFrame({"uid": all_uids}), 0, True)


I know this will probably never happen, but say a bank submits a sblar with over 1mil entries and they accidentally use the same UID for every row. We'd end up with register errors beyond our max limit, as well moving on to processing other logic errors beyond the max limit. Doesn't need to take place here but think we should look at treating these the same.

maybe, although this part isn't doing it on the submitted lf/df; it's on the information we already gathered from the previous pass (syntax errors), so it's a lot less intensive, and it's currently not counting towards the limit.

lchen-2101 added 2 commits January 27, 2025 13:07

test lighter processing past errors threshold

3aadc82

keep track of record count

d6bd140

lchen-2101 requested a review from jcadam14 January 28, 2025 00:41

jcadam14 reviewed Jan 28, 2025

View reviewed changes

lchen-2101 merged commit a341b4a into parquet_validator Jan 28, 2025
2 of 4 checks passed

lchen-2101 deleted the pq_val_test branch January 28, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

light processing on over threshold errors, and keep track of record count #330

light processing on over threshold errors, and keep track of record count #330

lchen-2101 commented Jan 28, 2025

github-actions bot commented Jan 28, 2025

jcadam14 left a comment

jcadam14 Jan 28, 2025

lchen-2101 Jan 28, 2025

jcadam14 Jan 28, 2025

lchen-2101 Jan 28, 2025

jcadam14 Jan 28, 2025

lchen-2101 Jan 28, 2025

light processing on over threshold errors, and keep track of record count #330

light processing on over threshold errors, and keep track of record count #330

Conversation

lchen-2101 commented Jan 28, 2025

github-actions bot commented Jan 28, 2025

Coverage report

jcadam14 left a comment

Choose a reason for hiding this comment

jcadam14 Jan 28, 2025

Choose a reason for hiding this comment

lchen-2101 Jan 28, 2025

Choose a reason for hiding this comment

jcadam14 Jan 28, 2025

Choose a reason for hiding this comment

lchen-2101 Jan 28, 2025

Choose a reason for hiding this comment

jcadam14 Jan 28, 2025

Choose a reason for hiding this comment

lchen-2101 Jan 28, 2025

Choose a reason for hiding this comment