-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outsourcing and improving the posting writing of IndexImpl.Text.cpp #1699
base: master
Are you sure you want to change the base?
Conversation
…yet, commit is used to initialize branch
…quency and gap compressed lists.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1699 +/- ##
==========================================
- Coverage 89.86% 89.86% -0.01%
==========================================
Files 389 391 +2
Lines 37308 37317 +9
Branches 4204 4203 -1
==========================================
+ Hits 33527 33535 +8
- Misses 2485 2487 +2
+ Partials 1296 1295 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important thing is:
Please tell me in which places you have changed something, and where you only extracted, s.t. we can properly review the nontrivial changes.
I really like this idea, everything that makes the index class smaller is good.
src/index/IndexImpl.Text.cpp
Outdated
std::ranges::copy(TextIndexReadWrite::readFreqComprList<Id, Score>( | ||
tbmd._cl._nofElements, tbmd._cl._startScorelist, | ||
static_cast<size_t>(tbmd._cl._lastByte + 1 - | ||
tbmd._cl._startScorelist), | ||
textIndexFile_, &Id::makeFromInt), | ||
idTable.getColumn(2).begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of the following (index-breaking) suggestion, which makes this code possibly simpler
(maybe we can postpone it to another PR, if this stalls your work here):
- We consistently directly compress and store the bits of the ID (as they are also consecutive for positive integers, the gap encoding and frequency encoding should still work). This gets rid of all the
Id::makeFromBlaIndex(BlaIndex::make(...))
calls in thetransform
andcopy
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again: After some thought please remember this idea, but probably this is for future changes, as it is rather intrusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would theoretically work fine but there is one slight Problem. This problem has do to with simple8b encoding after gap encoding. If we try to gap encode IDs the first element will be the starting ID without any encoding. Because IDs use their first few bits to determine what type of ID they are there will be a one in the first 4 bits of the ID. This then becomes a problem in simple8b encoding, since it only works for uint64_t with the first 4 bits being 0.
std::vector<uint64_t> textRecordList(firstElements.begin(), | ||
firstElements.end()); | ||
std::vector<WordIndex> wordIndexList(secondElements.begin(), | ||
secondElements.end()); | ||
std::vector<Score> scoreList(thirdElements.begin(), thirdElements.end()); | ||
|
||
GapEncode<uint64_t> textRecordEncoder(textRecordList); | ||
FrequencyEncode<WordIndex> wordIndexEncoder(wordIndexList); | ||
FrequencyEncode<Score> scoreEncoder(scoreList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these really need a vector
or can we make them work with the lazy views
directly (I will see once I get there).
off_t& currentOffset) { | ||
TextIndexReadWrite::writeVectorAndMoveOffset(encodedVector_, nofElements, out, | ||
currentOffset); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you iin this file point out to me the places (via comments) where you have changed anything except for just copying and extracting it here?
The reason is, that a lot of code requires modernization here, but I would prefer to first quickly do the extraction, and then modernize in a separate step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments of 09d4a97
explicit GapEncode(const TypedVector& vectorToEncode); | ||
|
||
void writeToFile(ad_utility::File& out, size_t nofElements, | ||
off_t& currentOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think you can get away with a lazy view as the input to the constructor.
- Would the code in
IndexImpl.cpp
become simpler if you make a static function that does the encoding + writing in one step (same for the other encoders).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also maybe part for a separate PR, your work here is valuable, by moving it to a separate file it now has a size where we can see the possible improvements much simpler.
Conformance check passed ✅No test result changes. |
Quality Gate passedIssues Measures |
This PR is to further clean up the IndexImpl.Text file while also improving the functionality of the frequency and gap encoding. This extends to a possibilty to better compress and store floats or doubles.