Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outsourcing and improving the posting writing of IndexImpl.Text.cpp #1699

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

Flixtastic
Copy link
Contributor

This PR is to further clean up the IndexImpl.Text file while also improving the functionality of the frequency and gap encoding. This extends to a possibilty to better compress and store floats or doubles.

Copy link

codecov bot commented Jan 4, 2025

Codecov Report

Attention: Patch coverage is 93.01310% with 16 lines in your changes missing coverage. Please review.

Project coverage is 89.86%. Comparing base (acb6633) to head (777329a).

Files with missing lines Patch % Lines
src/index/TextIndexReadWrite.cpp 91.52% 6 Missing and 4 partials ⚠️
src/index/TextIndexReadWrite.h 90.62% 2 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1699      +/-   ##
==========================================
- Coverage   89.86%   89.86%   -0.01%     
==========================================
  Files         389      391       +2     
  Lines       37308    37317       +9     
  Branches     4204     4203       -1     
==========================================
+ Hits        33527    33535       +8     
- Misses       2485     2487       +2     
+ Partials     1296     1295       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important thing is:
Please tell me in which places you have changed something, and where you only extracted, s.t. we can properly review the nontrivial changes.
I really like this idea, everything that makes the index class smaller is good.

Comment on lines 596 to 601
std::ranges::copy(TextIndexReadWrite::readFreqComprList<Id, Score>(
tbmd._cl._nofElements, tbmd._cl._startScorelist,
static_cast<size_t>(tbmd._cl._lastByte + 1 -
tbmd._cl._startScorelist),
textIndexFile_, &Id::makeFromInt),
idTable.getColumn(2).begin());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of the following (index-breaking) suggestion, which makes this code possibly simpler
(maybe we can postpone it to another PR, if this stalls your work here):

  1. We consistently directly compress and store the bits of the ID (as they are also consecutive for positive integers, the gap encoding and frequency encoding should still work). This gets rid of all the Id::makeFromBlaIndex(BlaIndex::make(...)) calls in the transform and copy calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: After some thought please remember this idea, but probably this is for future changes, as it is rather intrusive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would theoretically work fine but there is one slight Problem. This problem has do to with simple8b encoding after gap encoding. If we try to gap encode IDs the first element will be the starting ID without any encoding. Because IDs use their first few bits to determine what type of ID they are there will be a one in the first 4 bits of the ID. This then becomes a problem in simple8b encoding, since it only works for uint64_t with the first 4 bits being 0.

Comment on lines +41 to +49
std::vector<uint64_t> textRecordList(firstElements.begin(),
firstElements.end());
std::vector<WordIndex> wordIndexList(secondElements.begin(),
secondElements.end());
std::vector<Score> scoreList(thirdElements.begin(), thirdElements.end());

GapEncode<uint64_t> textRecordEncoder(textRecordList);
FrequencyEncode<WordIndex> wordIndexEncoder(wordIndexList);
FrequencyEncode<Score> scoreEncoder(scoreList);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these really need a vector or can we make them work with the lazy views directly (I will see once I get there).

src/index/TextIndexReadWrite.cpp Outdated Show resolved Hide resolved
off_t& currentOffset) {
TextIndexReadWrite::writeVectorAndMoveOffset(encodedVector_, nofElements, out,
currentOffset);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you iin this file point out to me the places (via comments) where you have changed anything except for just copying and extracting it here?

The reason is, that a lot of code requires modernization here, but I would prefer to first quickly do the extraction, and then modernize in a separate step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments of 09d4a97

Comment on lines +147 to +150
explicit GapEncode(const TypedVector& vectorToEncode);

void writeToFile(ad_utility::File& out, size_t nofElements,
off_t& currentOffset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think you can get away with a lazy view as the input to the constructor.
  2. Would the code in IndexImpl.cpp become simpler if you make a static function that does the encoding + writing in one step (same for the other encoders).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also maybe part for a separate PR, your work here is valuable, by moving it to a separate file it now has a size where we can see the possible improvements much simpler.

src/index/TextIndexReadWrite.h Outdated Show resolved Hide resolved
@Flixtastic Flixtastic requested a review from joka921 January 9, 2025 19:38
@sparql-conformance
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants