Skip to content

Conversation

@tadzik
Copy link
Contributor

@tadzik tadzik commented Jul 8, 2020

Also fixes the build to use an up-to-date version of xapian/emscripten

@petersalomonsen
Copy link
Contributor

Recent emscripten versions doesn't include IDBFS by default, it has to be added by a compiler switch -lidbfs.js (https://emscripten.org/docs/api_reference/Filesystem-API.html#file-systems)

Also it's better now to install emscripten the "regular" way in the build-pipeline, instead of using docker. See how it's done here: https://github.com/petersalomonsen/wasm-git/blob/master/.travis.yml#L8

@tadzik
Copy link
Contributor Author

tadzik commented Jul 8, 2020

Recent emscripten versions doesn't include IDBFS by default, it has to be added by a compiler switch -lidbfs.js (https://emscripten.org/docs/api_reference/Filesystem-API.html#file-systems)

Ah, that changes everything! Thanks. I'll give this a shot – I didn't even know where to look for it.

@tadzik tadzik force-pushed the tadzik/new-folder-count-api branch 4 times, most recently from e50dd89 to 7a30702 Compare July 13, 2020 12:20
… again

Somewhen between June 2019 and now one of these components doesn't
produce a working wasm anymore (complains about missing IDBFS).
Preferably we'd track down exactly when and why this happens
(and fix it), but for now let's at least have a working build.
@tadzik tadzik force-pushed the tadzik/new-folder-count-api branch from 7a30702 to 5810da2 Compare July 13, 2020 12:33

try {
{
Xapian::Enquire enquire(dbc->db);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petersalomonsen it feels like this should be doable with an mset.get_termfreq("XFseen") on the full query below here, but for some reason that always yields exactly the same number as the total number of messages. Anything obvious I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay @tadzik, just saw your message. From the docs it's not clear to me if termfreq returns number of occurrences in the resultset or the entire index, maybe @ojwb can tell.

Anyway it would be good if you created a test case in https://github.com/runbox/runbox-searchindex/blob/master/ts/xapian/xapian.test.ts so that it would be easier for others to play with this new function. I found it useful/easier with a test-driven approach when working on this library (there's also a watcher that pick up changes in the ts-files and re-run the tests).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the number of documents from the entire index that the term occurs in.

It should give exactly the same number as db.get_termfreq("XFseen") would, but if the term was in the query and contributed to the relevance weight then it returns the cached value rather than asking the database, which potentially saves having to read from disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, thanks. Is there anything else I could use here to avoid doing two separate queries?

Copy link
Contributor

@petersalomonsen petersalomonsen Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't think of any better solution right now. Seems like you have to query it twice.

Would it be an idea with a more generic method to get results count for any query ( and instead call it with and without the seen flag in the querystring from Javascript )? Then it could be used for other kinds of result counts too. Wouldn't increase performance, but give more reusability.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got to clarifying the docs here: xapian/xapian@092d915

@tadzik tadzik changed the title Hardwire xapian and emscripten versions to make sure the index builds again Add a dedicated function for calculating folder counts Jul 13, 2020
@tadzik tadzik marked this pull request as ready for review July 21, 2020 13:50
@tadzik tadzik requested a review from petersalomonsen July 21, 2020 13:51
Copy link
Contributor

@petersalomonsen petersalomonsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see a test for your new getFolderMessageCounts method. Also consider my other comment about a generic method for getting result set counts from any query.

And squash to one commit :-)

great to see you hacking with emscripten and xapian!

@tadzik
Copy link
Contributor Author

tadzik commented Jul 22, 2020

squash to one commit :-)

I can squash the maxsize fixup, but wouldn't the build process change be best kept separate, as it could've broken things independently of the new feature?

I would like to see a test for your new getFolderMessageCounts method. Also consider my other comment about a generic method for getting result set counts from any query.

The generic method could be useful potentially (so far we've been using sortedXapianQuery for this, but it proved unbearably slow for large folders) – here I wanted to make this as specialized as possible, since it's already proven to be a bit of a hotpath.

Testing is another story though, and not an easy one: tests currently rely on runbox7lib, which is built from runbox7 itself. runbox7 however depends on runbox-searchindex. So the C++ definition seemingly needs to be released first in runbox-searchindex, so that runbox7 can then update its depedendency version, build runbox7lib with the updated xapianapi, and only then the runbox7lib dependency can be used here to have the tests written for it – so seemingly the only way to break the circularity would be to reimplement part of xapianapi here for the sake of tests until runbox7lib actually gets it – did I get it right, or did I miss something?

Meanwhile, ngmakelib doesn't build runbox7lib correctly anymore, but that's another issue to bring up in another repo :)

@petersalomonsen
Copy link
Contributor

I can squash the maxsize fixup, but wouldn't the build process change be best kept separate, as it could've broken things independently of the new feature?

True, good to keep the build process change separate.

The generic method could be useful potentially (so far we've been using sortedXapianQuery for this, but it proved unbearably slow for large folders) – here I wanted to make this as specialized as possible, since it's already proven to be a bit of a hotpath.

yeah that sounds reasonable :-)

Testing is another story though, and not an easy one: tests currently rely on runbox7lib, which is built from runbox7 itself. runbox7 however depends on runbox-searchindex. So the C++ definition seemingly needs to be released first in runbox-searchindex, so that runbox7 can then update its depedendency version, build runbox7lib with the updated xapianapi, and only then the runbox7lib dependency can be used here to have the tests written for it – so seemingly the only way to break the circularity would be to reimplement part of xapianapi here for the sake of tests until runbox7lib actually gets it – did I get it right, or did I miss something?

yeah I see, I think the intention was to make that cleaner at some point. I think I would move the XapianAPI typescript definition to this library ( so that you don't have to import it from Runbox7 ). As for the other stuff imported from RB7 ( IndexingTools, MessageInfo, MailAddressInfo ), you could probably just copy these sources directly into here? These are just used for testing, and I would believe not so critical to have the same origin. Or move them in here, if it doesn't make to much of a maintenance problem for RB7.

Meanwhile, ngmakelib doesn't build runbox7lib correctly anymore, but that's another issue to bring up in another repo :)

The runbox7lib is built for the use on the server, but the most important part of it comes from here if I remember correct. So maybe consider making this the common library for server/client reuse.


string queryString = "folder:\"";
queryString += folderName;
queryString += "\"";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to build a query string and parse it to create a Xapian::Query object, and it's both slower and introduces corner cases (for example, what if folderName contains the character "?)

You can create the Xapian::Query objects for the terms directly via the API, and compose them as you want - e.g.:

Xapian::Query query("XFOLDER:" + folderName);
query &= ~Xapian::Query("XFseen");

Or you can explicitly construct a parent object specifying OP_AND_NOT if you don't like operator overloading.

);

enquire.set_query(query);
Xapian::MSet mset = enquire.get_mset(0, UINT_MAX);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in Xapian having to return a result set containing every single document matching the query, but you only want to know the number of entries there would be in that set, not what any of them actually are.

You can tell Xapian you are only interested in the number of results by specifying 0 for maxitems and then a large number for checkatleast instead:

Xapian::MSet mset = enquire.get_mset(0, 0, UINT_MAX);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That gives me a .size() of 0, but the numbers from .get_matches_estimated() seem to me matching reality (I don't have a big enough static dataset to verify this on, unfortunately) – can I reasonably assume that with a big enough checkatlest the estimated numbers will be equal to the current state of things?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry - yes, you also need to look at .get_matches_estimated() instead, and the estimate should be exact in this case since we're forcing the matcher to consider all possibly matching documents.

@tadzik
Copy link
Contributor Author

tadzik commented Aug 6, 2020

The runbox7lib is built for the use on the server, but the most important part of it comes from here if I remember correct. So maybe consider making this the common library for server/client reuse.

I've done so now. runbox7lib contents are being removed in https://github.com/runbox/runbox7/pull/714/files and moved here with this PR. I've made changes to the build process to make sure dist/ ends up containing everything necessary. Runbox7 builds and works when switched over to this (upcoming) 0.2.0 :)

I've also updated the existing tests to move away from the deprecated mocha-typescript – I now have all the tools required to actually write tests for the new API :P

tadzik added a commit that referenced this pull request Aug 7, 2020
This should make sure that Xapian doesn't actually allocate the results
themselves, but only calculate their amount.

Thanks to ojwb in #4 (comment)
tadzik added a commit that referenced this pull request Aug 7, 2020
Should give us a bit of a speedup and avoid bugs in some edge cases.

Thanks to ojwb (#4 (review))
@tadzik
Copy link
Contributor Author

tadzik commented Aug 7, 2020

@petersalomonsen I think it has it all now :)

I moved optimizations suggested by @ojwb to a separate PR here: #5. It seems like they actually slow things down (for a folder size of 1000 like in the tests), so some further digging is needed there.

Copy link
Contributor

@petersalomonsen petersalomonsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good @tadzik

@gtandersen
Copy link

Excellent @tadzik, and thanks for the help @petersalomonsen and @ojwb.

tadzik added 2 commits August 12, 2020 18:18
From now on, this package will be the home of all the relevant modules,
hopefully making the maintenance easier.

The tests have also been migrated to @testdeck/mocha, which is
apparently a successor to mocha-typescript.
@tadzik tadzik force-pushed the tadzik/new-folder-count-api branch from 8f308a1 to f6b3361 Compare August 12, 2020 16:18
@tadzik
Copy link
Contributor Author

tadzik commented Aug 12, 2020

Forcepushed to clean up history a bit, I think that's as good as it gets. Further optimizations will (hopefully) follow separately.

@tadzik tadzik merged commit 83a0499 into master Aug 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants