Investigating 191520-2024 for Performance Optimization #61

cristianvasquez · 2024-09-24T11:47:43Z

cristianvasquez
Sep 24, 2024
Maintainer

The time required for transformations is generally acceptable. However, there are occasional instances where some transformations may time out.

An example is: 191520-2024 mappings/package_cn_v1.6 which times out in my machine

This particular case provides an opportunity to investigate and identify any performance bottlenecks.

html
xml

schivmeister · 2024-09-25T08:14:06Z

schivmeister
Sep 25, 2024
Collaborator

Thanks for reporting. This notice took ~7 mins with RMLMapper 6.2.2 (Java 17) on my Core i7-8565U laptop, which is six generations behind current hardware technology.

I took the liberty to glean some statistics from our current set of internal test data staging ground (where not all end up in this project so it is a superset). It appears the median is 22KB and average 54KB. In terms of distribution, 0-255KB take majority share, nearly 98%.

find -type f -printf '%s %p\n' | numfmt --to=iec | sort -h

	Size (in bytes)
Average	53833.40558
Median	21923
Minimum	1361
Maximum	1784048

This implies > 255KB can be considered relatively large for our purposes, and this notice is an outlier (aside from the *100_lots ones which are ~1.8MB).

While the remote call for hashing a URI part is an inefficient network operation, it usually only affects the runtime performance on the first run for the same set of rules, as something (likely RMLMapper) appears to be doing some sort of caching, which is good.

The performance impact seen here is introduced solely by the local transformation. Therefore, we are bottlenecked by tooling (RMLMapper) and can do very little about performance in this regard.

0 replies

cristianvasquez · 2024-09-25T14:34:46Z

cristianvasquez
Sep 25, 2024
Maintainer Author

Thank you very much for the numbers! This information is valuable for configuring the pipeline to handle such outliers.

On my machine, CPU usage was low, but the memory was not enough.

I believe a pipeline could chop down the XML in chunks with less number of lots.

I'll move this issue to discussions

1 reply

schivmeister Sep 26, 2024
Collaborator

That is quite odd, as memory consumption on my end was significant but not enough to hard-lock or swap out to disk, given at a minimum 8GB system RAM, at around 500-700MB as reported by time -v <command> (and I assume part of that is also the JRE overhead).

If it becomes a problem for the pipeline, chunking as you suggest on the lots is definitely a good idea. However I would investigate first whether it is indeed a memory-bound problem, as transformation is expected to be CPU-bound for the most part.

cristianvasquez · 2024-10-16T14:21:55Z

cristianvasquez
Oct 16, 2024
Maintainer Author

I'll leave here a link to a benchmark of different RML engines

might be relevant

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigating 191520-2024 for Performance Optimization #61

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Investigating 191520-2024 for Performance Optimization #61

cristianvasquez Sep 24, 2024 Maintainer

Replies: 3 comments · 1 reply

schivmeister Sep 25, 2024 Collaborator

cristianvasquez Sep 25, 2024 Maintainer Author

schivmeister Sep 26, 2024 Collaborator

cristianvasquez Oct 16, 2024 Maintainer Author

cristianvasquez
Sep 24, 2024
Maintainer

Replies: 3 comments 1 reply

schivmeister
Sep 25, 2024
Collaborator

cristianvasquez
Sep 25, 2024
Maintainer Author

schivmeister Sep 26, 2024
Collaborator

cristianvasquez
Oct 16, 2024
Maintainer Author