Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working on making the conversion to hdt faster #578

Open
wants to merge 19 commits into
base: dev
Choose a base branch
from

Conversation

hmottestad
Copy link
Contributor

This is all work in progress.

@ate47
Copy link
Collaborator

ate47 commented Jan 20, 2025

It seems interesting, I wrote this part years ago when I was still an intern so it'll probably be easy to find even more to patch. But did you compute the time to parse the rdf file compared to the indexing time? In my memories it was a small part

Comment on lines 153 to 171
private final ExceptionIterator<T, E> in1;
private final ExceptionIterator<T, E> in2;
private final Comparator<T> comp;
private final int chunkSize = 1024 * 4;
private final Executor executor = Executors.newVirtualThreadPerTaskExecutor(); // Could
// be
// a
// ForkJoinPool.commonPool(),
// or
// a
// custom
// pool

private final Deque<T> chunk1 = new ArrayDeque<>();
private final Deque<T> chunk2 = new ArrayDeque<>();

// Local buffer to store merged chunks
private final Deque<T> buffer = new ArrayDeque<>();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been experimenting with having the MergeExceptionIterator be parallel. The code reads up to chunkSize (4096) from both child iterators concurrently and stores them in the two variables chunk1 and chunk2.

For some reason I'm getting an exception when reading from the two child iterators, even when I don't read concurrently.

[KWayMerger#2Worker#0] .. [#         ] 10.00 reading triples part2 175300000
[KWayMerger#2Worker#1] .. [#         ] 10.00 reading triples part2 178200000
[KWayMerger#2Worker#10] . [#         ] 10.00 reading triples part2 177800000
[KWayMerger#2Worker#11] . [#         ] 10.00 reading triples part2 178100000
[KWayMerger#2Worker#12] . [#         ] 10.00 reading triples part2 178400000
[KWayMerger#2Worker#13] . [#         ] 10.00 reading triples part2 177200000
[KWayMerger#2Worker#14] . [#         ] 10.00 reading triples part2 178300000
[KWayMerger#2Worker#2] .. [#         ] 10.00 reading triples part2 175200000
[KWayMerger#2Worker#3] .. [#         ] 10.00 reading triples part2 176700000
[KWayMerger#2Worker#4] .. [#         ] 10.00 reading triples part2 178000000
[KWayMerger#2Worker#5] .. [#         ] 10.00 reading triples part2 177300000
[KWayMerger#2Worker#6] .. [#         ] 10.00 reading triples part2 175100000
[KWayMerger#2Worker#7] .. [#         ] 10.00 reading triples part2 177100000
[KWayMerger#2Worker#8] .. [#         ] 10.00 reading triples part2 174900000
[KWayMerger#2Worker#9] .. [#         ] 10.00 reading triples part2 177900000
[main] .................. [####      ] 40.00 Create mapped and sort triple file
Exception in thread "main" com.the_qa_company.qendpoint.core.exceptions.ParserException: com.the_qa_company.qendpoint.core.util.concurrent.KWayMerger$KWayMergerException: java.io.IOException: Triple got null node, but not all the nodes are 0! 2 0 17
        at com.the_qa_company.qendpoint.core.hdt.impl.HDTDiskImporter.compressTriples(HDTDiskImporter.java:253)
        at com.the_qa_company.qendpoint.core.hdt.impl.HDTDiskImporter.runAllSteps(HDTDiskImporter.java:357)
        at com.the_qa_company.qendpoint.core.hdt.HDTManagerImpl.doGenerateHDTDisk0(HDTManagerImpl.java:475)
        at com.the_qa_company.qendpoint.core.hdt.HDTManagerImpl.doGenerateHDTDisk(HDTManagerImpl.java:436)
        at com.the_qa_company.qendpoint.core.hdt.HDTManagerImpl.doGenerateHDTDisk(HDTManagerImpl.java:421)
        at com.the_qa_company.qendpoint.core.hdt.HDTManager.generateHDTDisk(HDTManager.java:818)
        at com.the_qa_company.qendpoint.core.tools.RDF2HDT.execute(RDF2HDT.java:205)
        at com.the_qa_company.qendpoint.core.tools.RDF2HDT.main(RDF2HDT.java:326)
Caused by: com.the_qa_company.qendpoint.core.util.concurrent.KWayMerger$KWayMergerException: java.io.IOException: Triple got null node, but not all the nodes are 0! 2 0 17
        at com.the_qa_company.qendpoint.core.util.io.compress.MapCompressTripleMerger.mergeChunks(MapCompressTripleMerger.java:232)
        at com.the_qa_company.qendpoint.core.util.concurrent.KWayMerger$MergeTask.run(KWayMerger.java:220)
        at com.the_qa_company.qendpoint.core.util.concurrent.KWayMerger$Worker.runException(KWayMerger.java:285)
        at com.the_qa_company.qendpoint.core.util.concurrent.ExceptionThread.run(ExceptionThread.java:125)
Caused by: java.io.IOException: Triple got null node, but not all the nodes are 0! 2 0 17
        at com.the_qa_company.qendpoint.core.util.io.compress.CompressTripleReader.setAllOrEnd(CompressTripleReader.java:76)
        at com.the_qa_company.qendpoint.core.util.io.compress.CompressTripleReader.hasNext(CompressTripleReader.java:64)
        at com.the_qa_company.qendpoint.core.iterator.utils.MergeExceptionIterator.fillBuffer(MergeExceptionIterator.java:222)
        at com.the_qa_company.qendpoint.core.iterator.utils.MergeExceptionIterator.hasNext(MergeExceptionIterator.java:185)
        at com.the_qa_company.qendpoint.core.iterator.utils.MergeExceptionIterator.fillBuffer(MergeExceptionIterator.java:222)
        at com.the_qa_company.qendpoint.core.iterator.utils.MergeExceptionIterator.hasNext(MergeExceptionIterator.java:185)
        at com.the_qa_company.qendpoint.core.iterator.utils.MergeExceptionIterator.fillBuffer(MergeExceptionIterator.java:222)
        at com.the_qa_company.qendpoint.core.iterator.utils.MergeExceptionIterator.hasNext(MergeExceptionIterator.java:185)
        at com.the_qa_company.qendpoint.core.iterator.utils.MergeExceptionIterator.fillBuffer(MergeExceptionIterator.java:222)
        at com.the_qa_company.qendpoint.core.iterator.utils.MergeExceptionIterator.hasNext(MergeExceptionIterator.java:185)
        at com.the_qa_company.qendpoint.core.util.io.compress.MapCompressTripleMerger.mergeChunks(MapCompressTripleMerger.java:221)
        ... 3 more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you have any insight @ate47 about why I'm getting this exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I figured it out. My approach is wrong, can't batch the sorting like how I first thought.

@hmottestad hmottestad force-pushed the working-on-making-the-conversion-to-hdt-faster branch from 292e0df to 6c534fe Compare February 18, 2025 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants