Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds full OpenSearch engine support (#200)
Use OpenSearch with all the notebooks by running: docker compose up opensearch * - 4.7 - listing organization - skg.generate_request => skg.transform_request - Fixes Ch6 bugs - 15.18 listing organization * - Applies listing print standards to all chapters - Ch3. Updates print statements, organizes variable positioning - rerank_quantity => rerank_count - What happened to 10's feature calculater? - Finish rerank_query refactor - Chapter 11: Updates formatting of displaying data - SKG: {!${defType} ...} * SKG debug fix * Rework ch3 tf-idf calculations; sync with manuscript * Ch4, 5, 6: Listings and results synchronized * Ch7 and Ch8: Listings synced * Ch9 mostly consistent * Syncs listing 8.1 * Ch10, Ch11 sync'd with manuscript and signed off * Ch8 sync, ch10 n => k * ch9: Correct, Deterministic, sync'd with manuscript, signing off * - ch12, 13, 14, and 15 finalized and sync with Manuscript Ch12 still has minor numeric variance. Ch14 still uses pre-computed model * quantization WIP * Intermediate Ch13 Quantization Faiss impl * Further improvementsQuantization code. Still needs cleanup and Listing 13.26 implemented * Another checkpoint. 3/5 working (PQ and engine left). Code 50% cleaned * Temp state. Just for reference. Mid refactor * Listing 13.21 int8 scalar quantization done * Another checkpoint * Adds stubbed rerank recall method * Small refactors as I'm working on manuscript listings * Another checkpoint. * Fix index writing error w/ PQ, increase M to get PQ recall up * Checkpoint on evaluate rerank * Checkpoint, Only one bug in calculate_recall(). Picking up engine search in the morning * Cleanup quantization code (minus last engine example) + rename notebook * Include SolrEngine * Reorganize to match manuscript listing flow * Refactor and cleanup quantization code for listings * Quantization * Revert back from quantization_type to quantization_size Reducing scope. Only scalar and binary will be supported, and both of them work fine with "size". * Finalizes quantization listings * Cleanup to sync with manuscript * - All chapters verified, functional and consistent (See lingering issues) - Sets Jupyter to use environment home directory instead of notebook director for execution - Product Display: Updates templates\search-results.html product search result html rendering to have better spacing and image sizes - Adds missing product images that emerged from removing "\N" - SolrCollection Vector Search: De-normalize returned score to be accurate with Cos/Dot similarity -------------------------------------------- - Ch5: Paramiterizes print_graph() to de-dup - Ch5: Remove old sections at end of SKG notebook - Ch6: Reduces listing complexity by extracting noisy print logic - Ch6: Tidys outputs and corrects organization of listings across cha - Ch6: Adds missing Listing 6.4 to notebook - Ch6: Refactors various spell check functions for specificity - Ch7: Further refactors SS functions - Ch7: Organizes and orders listings (and links/refs) from each notebook to be consistent with manuscript - Ch7: Fixes semantic function bug and get_enrichment NRE bug - Ch7: Adds Listing 7.14 Splade - Ch8: Minor formatting, typo correction and cell order - Ch9: Minor formatting, search result limit=10, and small refactorings - Ch10: Correctly uses LTR.enable_ltr (not engine.enable_ltr) - Ch11: Cell organization/labelling - Ch12: Locks numpy to 1.23.5 and adds Numpy seed for consistent session synthesis - Ch13: Cell organization, consistent result ranking, formatting, Fix 13.12 bug, - Ch13: De-normalize vector scores from engine vector search - Ch13: Tidy's cross-encoder code - Ch14: Minor formatting, path corrections ------------------------------------------ Lingering issues: - Image path: Undo hack when image is rendered from HTML (not using Jupyter Home path, using Chapter path), Fix image renderings in Ch11 - Ch7: Still needs fix for Iframe clearing 5 seconds after being rendered, verify semantic-search endpoint - Ch12: Still 2 squirelly results. 99% consistent otherwise - Ch13: 13.3 Amendum failing (Has been for some time?) * Finalizes cross encoder code * Basic OS implementation * OpenSearch Spell checking * Vector search, schema improvements, spark * UBI: UBI Plugin added to image, docker organization, opensearch configs, basic * Generating Signals collection from UBI * Unsaved * Fixes signals boosting * This is a checkpoint * Make opensearch-docker-entrypoint.sh executable * Fixes build. * Fixes build and correctly integrates with ubi 1.0.0 for OS 2.14 * Docker file updates * check * check again * check again * check again again * Fixes keystroke error that broke 9.18 * Check * check * Updates, organizes and cleans dep stack. Updates applied throughout the stack * OpenSearch integration: - Abstract all references to "solr" - Fixes filter issue (with *) - Fixes outdoor schema - Partial cleaning imports - implements engine.name (never used in codebase, but mentioned in literature as an available property) --- Pending known issues: supplement query to account for edismax logic * Intermediate checkpoint. Still fixing Ch9 and cleaning * - Adds optional engine override param - Intermediate locking of Solr dependent chapters to Solr - Fixes various bugs regarding opensearch integrations * Ch8 completely functional except index time boosting query side * Chapter 8 finished for Opensearch except with Spark Bulk Indexing failures in 8.8 * UBI Checkpoint: 3 pending issues listed in notebook * Corrects two queries missing un-aliased variable references * Corrects 8.8 * Fix index loader so 8.8 runs + improve 8.8 variables/revert line order * Add back new line near end of listing 8.8 * OS LTR all features except model idempotency and explore candidates. Needs cleaning * All LTR working except fuzzy/bigrams * OpenSearch LTR complete * Corrects ch 13, 14, 15. improves 8.8 * - Implements Opensearch Sparse Semantic Search completely semantic functions - Finalizes SemanticSearch abstractions and loading - Opensearch code cleaned * All chapter's verified * Corrects vector size for outdoors embedding collection * - Removes all usages and importing of set_engine - Removes many other unused imports - Adds collection.get_engine_name() which is used in several multi-methods - Identified spark view from collection creation not using "_id" for id's. Issue does not currently effect codebase - Adds versions to 4 un-versioned dependencies - removes set_engine example from welcome (and fixes bad english) - Fixes health check messages - implements to_queries for semantic search query generation which got lost somewhere along the way Chapter fixes: - Ch4 Cleans up logs - Ch5 Removes hard coding of engine and adds dual-indexing - Ch6 Removes engine hardcoding, reruns notebooks with default engine - Ch7 Reruns notebooks with default engine - Ch8 removes hard coding - Ch9 Verifies Opensearch functionality and reruns notebooks with default engine - Ch10 Reruns with default engine, removes extra logging - Ch12 Reruns with default engine, removes extra logging - Ch13 Verifies Opensearch functionality, reruns with default search engine - Ch14 re-verified chapter with default search engine * - Reruns 6.2 to have no logging output - Fixes id being returned from spark for opensearch - dual indexing for ch7 * Add link to engines/README.md --------- Co-authored-by: Daniel Crouch <dcrouch26@users.noreply.github.com> Co-authored-by: Trey Grainger <code@treygrainger.com>
- Loading branch information