-
Notifications
You must be signed in to change notification settings - Fork 3
Add a dedicated function for calculating folder counts #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Recent emscripten versions doesn't include IDBFS by default, it has to be added by a compiler switch Also it's better now to install emscripten the "regular" way in the build-pipeline, instead of using docker. See how it's done here: https://github.com/petersalomonsen/wasm-git/blob/master/.travis.yml#L8 |
Ah, that changes everything! Thanks. I'll give this a shot – I didn't even know where to look for it. |
e50dd89 to
7a30702
Compare
… again Somewhen between June 2019 and now one of these components doesn't produce a working wasm anymore (complains about missing IDBFS). Preferably we'd track down exactly when and why this happens (and fix it), but for now let's at least have a working build.
7a30702 to
5810da2
Compare
|
|
||
| try { | ||
| { | ||
| Xapian::Enquire enquire(dbc->db); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petersalomonsen it feels like this should be doable with an mset.get_termfreq("XFseen") on the full query below here, but for some reason that always yields exactly the same number as the total number of messages. Anything obvious I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the delay @tadzik, just saw your message. From the docs it's not clear to me if termfreq returns number of occurrences in the resultset or the entire index, maybe @ojwb can tell.
Anyway it would be good if you created a test case in https://github.com/runbox/runbox-searchindex/blob/master/ts/xapian/xapian.test.ts so that it would be easier for others to play with this new function. I found it useful/easier with a test-driven approach when working on this library (there's also a watcher that pick up changes in the ts-files and re-run the tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the number of documents from the entire index that the term occurs in.
It should give exactly the same number as db.get_termfreq("XFseen") would, but if the term was in the query and contributed to the relevance weight then it returns the cached value rather than asking the database, which potentially saves having to read from disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know, thanks. Is there anything else I could use here to avoid doing two separate queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't think of any better solution right now. Seems like you have to query it twice.
Would it be an idea with a more generic method to get results count for any query ( and instead call it with and without the seen flag in the querystring from Javascript )? Then it could be used for other kinds of result counts too. Wouldn't increase performance, but give more reusability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally got to clarifying the docs here: xapian/xapian@092d915
petersalomonsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see a test for your new getFolderMessageCounts method. Also consider my other comment about a generic method for getting result set counts from any query.
And squash to one commit :-)
great to see you hacking with emscripten and xapian!
I can squash the maxsize fixup, but wouldn't the build process change be best kept separate, as it could've broken things independently of the new feature?
The generic method could be useful potentially (so far we've been using sortedXapianQuery for this, but it proved unbearably slow for large folders) – here I wanted to make this as specialized as possible, since it's already proven to be a bit of a hotpath. Testing is another story though, and not an easy one: tests currently rely on runbox7lib, which is built from runbox7 itself. runbox7 however depends on runbox-searchindex. So the C++ definition seemingly needs to be released first in runbox-searchindex, so that runbox7 can then update its depedendency version, build runbox7lib with the updated xapianapi, and only then the runbox7lib dependency can be used here to have the tests written for it – so seemingly the only way to break the circularity would be to reimplement part of xapianapi here for the sake of tests until runbox7lib actually gets it – did I get it right, or did I miss something? Meanwhile, ngmakelib doesn't build runbox7lib correctly anymore, but that's another issue to bring up in another repo :) |
True, good to keep the build process change separate.
yeah that sounds reasonable :-)
yeah I see, I think the intention was to make that cleaner at some point. I think I would move the XapianAPI typescript definition to this library ( so that you don't have to import it from Runbox7 ). As for the other stuff imported from RB7 ( IndexingTools, MessageInfo, MailAddressInfo ), you could probably just copy these sources directly into here? These are just used for testing, and I would believe not so critical to have the same origin. Or move them in here, if it doesn't make to much of a maintenance problem for RB7.
The runbox7lib is built for the use on the server, but the most important part of it comes from here if I remember correct. So maybe consider making this the common library for server/client reuse. |
|
|
||
| string queryString = "folder:\""; | ||
| queryString += folderName; | ||
| queryString += "\""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to build a query string and parse it to create a Xapian::Query object, and it's both slower and introduces corner cases (for example, what if folderName contains the character "?)
You can create the Xapian::Query objects for the terms directly via the API, and compose them as you want - e.g.:
Xapian::Query query("XFOLDER:" + folderName);
query &= ~Xapian::Query("XFseen");
Or you can explicitly construct a parent object specifying OP_AND_NOT if you don't like operator overloading.
| ); | ||
|
|
||
| enquire.set_query(query); | ||
| Xapian::MSet mset = enquire.get_mset(0, UINT_MAX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in Xapian having to return a result set containing every single document matching the query, but you only want to know the number of entries there would be in that set, not what any of them actually are.
You can tell Xapian you are only interested in the number of results by specifying 0 for maxitems and then a large number for checkatleast instead:
Xapian::MSet mset = enquire.get_mset(0, 0, UINT_MAX);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That gives me a .size() of 0, but the numbers from .get_matches_estimated() seem to me matching reality (I don't have a big enough static dataset to verify this on, unfortunately) – can I reasonably assume that with a big enough checkatlest the estimated numbers will be equal to the current state of things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry - yes, you also need to look at .get_matches_estimated() instead, and the estimate should be exact in this case since we're forcing the matcher to consider all possibly matching documents.
I've done so now. runbox7lib contents are being removed in https://github.com/runbox/runbox7/pull/714/files and moved here with this PR. I've made changes to the build process to make sure I've also updated the existing tests to move away from the deprecated mocha-typescript – I now have all the tools required to actually write tests for the new API :P |
This should make sure that Xapian doesn't actually allocate the results themselves, but only calculate their amount. Thanks to ojwb in #4 (comment)
Should give us a bit of a speedup and avoid bugs in some edge cases. Thanks to ojwb (#4 (review))
|
@petersalomonsen I think it has it all now :) I moved optimizations suggested by @ojwb to a separate PR here: #5. It seems like they actually slow things down (for a folder size of 1000 like in the tests), so some further digging is needed there. |
petersalomonsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good @tadzik
|
Excellent @tadzik, and thanks for the help @petersalomonsen and @ojwb. |
From now on, this package will be the home of all the relevant modules, hopefully making the maintenance easier. The tests have also been migrated to @testdeck/mocha, which is apparently a successor to mocha-typescript.
8f308a1 to
f6b3361
Compare
|
Forcepushed to clean up history a bit, I think that's as good as it gets. Further optimizations will (hopefully) follow separately. |
Also fixes the build to use an up-to-date version of xapian/emscripten