From 52d0fc59a34583339f835e3ac60469feb8eed95b Mon Sep 17 00:00:00 2001 From: Bill Dueber Date: Thu, 26 Sep 2024 13:23:55 -0400 Subject: [PATCH 1/4] Start working on docs --- README.md | 117 +++++++++++++++++++--------------------------------- compose.yml | 9 ---- 2 files changed, 43 insertions(+), 83 deletions(-) diff --git a/README.md b/README.md index 6a289b7..5bbcda4 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,47 @@ # Dromedary -- Middle English Dictionary Application -A new discovery system for the Middle English Dictionary. +A new(-ish, these days) discovery system for the Middle English Dictionary. + +## Repositories + +* **Public Repository for app**: https://github.com/mlibrary/dromedary +* **Private repository for argo build**: https://github.com/mlibrary/middle-english-argocd + +If you need access to the private repo, get in touch with A&E. + +## Setup your development environment + +The development environment is set up to use `docker compose` to manage +the rails application, solr, and zookeeper (used to manage solr). + +To build and start running the application: + +```shell +docker compose build +docker compose up -d +``` + +### Test access to the application and solr + + +* **Error page**: http://localhost:3000/. Don't let that confuse you. +* **Splash page**: http://localhost:3000/m/middle-english-dictionary. +* **Solr admin**: + * **url**: http://localhost:9172 + * **username**: solr + * **password** SolrRocks + +**NOTE** At this point you can't do any searches, because there's no data in the +solr yet. + +### Indexing a file locally + +The local + + + + +# OLD STUFF * [Indexing new data](docs/indexing.md), when new data is made available. @@ -52,70 +93,8 @@ docker-compose build --build-arg ARCH=arm64 solr ```shell docker-compose up -d ``` -NOTES -* The ***sidekiq*** container will exit because we have yet to do a ***bundle install!***. -> ### Install bundler -> ```shell -> docker-compose exec -- gem install 'bundler:~>2.2.21' -> RUN bundle config --local build.sassc --disable-march-tune-native -> ``` -> This was moved into the Dockerfile so it is no longer is necessary but is left here as a reminder so it will not be forgotten. -> -> Need to revisit why setting the bundler version is necessary in the first place! -### Configure bundler -```shell -docker-compose exec -- app bundle config set rubygems.pkg.github.com -``` -The above command creates the following file: ./bundle/config -```yaml ---- -BUNDLE_RUBYGEMS__PKG__GITHUB__COM: "" -``` -NOTES -* [Personal access tokens (classic)](https://github.com/settings/tokens) Token you have generated that can be used to access the [GitHub API](https://docs.github.com/en) -- read:packages. -* Replace with your personal access token. -### Bundle install -```shell -docker-compose exec -- app bundle install -``` -NOTES -* Environment variable **BUNDLE_PATH** is set to **/var/opt/app/gems** in the **Dockerfile**. -### Yarn install -```shell -docker-compose exec -- app yarn install -``` -NOTES -* Investigate using a volume for **node_modules** directory like we do for **gems** -### Setup databases -```shell -docker-compose exec -- app bundle exec rails db:setup -``` -If you need to recreate the databases run db:drop and then db:setup. -```shell -docker-compose exec -- app bundle exec rails db:drop -docker-compose exec -- app bundle exec rails db:setup -``` -NOTES -* Names of the databases are defined in **./config/database.yml** -* The environment variable **DATABASE_URL** takes precedence over configured values. -### Create solr collections -```shell -docker-compose exec -- solr solr create_collection -d dromedary -c dromedary-development -docker-compose exec -- solr solr create_collection -d dromedary -c dromedary-test -``` -If you need to recreate a core run delete and create_core (e.g. dromedary-test) -```shell -docker-compose exec -- solr solr delete -c dromedary-test -docker-compose exec -- solr solr create_collection -d dromedary -c dromedary-test -``` -NOTES -* Names of the solr cores are defined in **./config/blacklight.yml** file. -* The environment variable **SOLR_URL** takes precedence over configured values. -### Start development rails server -```shell -docker-compose exec -- app bundle exec rails s -b 0.0.0.0 -``` + Verify the application is running http://localhost:3000/ ## Bring it all down then back up again ```shell @@ -124,13 +103,3 @@ docker-compose down ```shell docker-compose up -d ``` -```shell -docker-compose exec -- app bundle exec rails s -b 0.0.0.0 -``` -The gems, database, and solr redis use volumes to persit between the ups and downs of development. -When things get flakey you have the option to simply delete any or all volumes after you bring it all down. -If you remove all volumes just repeat the [Development quick start](#development-quick-start), otherwise -you'll need to run the appropriate steps depending on which volumes you deleted: -* For gems run the [Bundle install](#bundle-install) step. -* For database run the [Setup databases](#setup-databases) step. -* For solr run the [Create solr collections](#create-solr-collections) step. diff --git a/compose.yml b/compose.yml index 7920d83..2a8bccd 100644 --- a/compose.yml +++ b/compose.yml @@ -43,15 +43,6 @@ services: - -b - 0.0.0.0 - db: - image: postgres:12-alpine - ports: - - "5432:5432" - environment: - - POSTGRES_PASSWORD=postgres - - PGDATA=/var/lib/postgresql/data/db - volumes: - - db:/var/lib/postgresql/data solr: build: solr/. From fc3efa354978090058bba0dde002fcabdbd935d2 Mon Sep 17 00:00:00 2001 From: Bill Dueber Date: Thu, 26 Sep 2024 13:50:19 -0400 Subject: [PATCH 2/4] Mre --- README.md | 15 ++++++- docs/indexing.md | 115 +---------------------------------------------- 2 files changed, 14 insertions(+), 116 deletions(-) diff --git a/README.md b/README.md index 5bbcda4..9eae04f 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,6 @@ docker compose up -d ### Test access to the application and solr - * **Error page**: http://localhost:3000/. Don't let that confuse you. * **Splash page**: http://localhost:3000/m/middle-english-dictionary. * **Solr admin**: @@ -34,11 +33,23 @@ docker compose up -d **NOTE** At this point you can't do any searches, because there's no data in the solr yet. + ### Indexing a file locally -The local +NOTE: You can't index a file locally through the administration interface -- that's +hooked directly to an AWS bucket, and won't affect your local install at all +(it'll replace the `preview` solr data). + +* Make sure + +```shell +docker compose run app -- bin/index_new_file.rb /.zip +``` +Give it however long it takes (a couple minutes for a minimal file, +and around an hour for a full file). +You'll know it's done when the # OLD STUFF diff --git a/docs/indexing.md b/docs/indexing.md index 8995f45..d3f6294 100644 --- a/docs/indexing.md +++ b/docs/indexing.md @@ -1,115 +1,2 @@ -# Indexing (new) MED data - -Note on permissions: Updating the data requires: - * An account on the dev machine (nectar) that is a member of the group 'dromedary-staging' on the dev machine (to -put `In_progress_MEC_files.zip` up there) - * Permission to deploy dromedary via `moku` - -Contact A&E if you need either. - -## Kubernetes - -You must be behind the LIT firewall to build the data image. -`$ docker build -t ghcr.io/mlibrary/dromedary/dromedary_data:latest -f data.Dockerfile .` -If the package is private (currently the case) then the pod will need an [ImagePullSecret](https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/) - - -## Synopsis - -* Download content of [the box folder](https://umich.box.com/s/ah2imm5webu32to343p2n6xur828zi5w), -which creates `In_progress_MEC_files.zip`. Copy that file to the dev server (currently -nectar) in the directory `/hydra-dev/dromedary-data/build` - -From any machine behind the LIT firewall, run (cut/paste) the following commands for -the instance you're targeting. - -* Prepare the data for indexing (always done on staging, doesn't actually affect anything right away): - * `ssh deployhost exec dromedary-staging "bin/dromedary newdata prepare /hydra-dev/dromedary-data/build/In_progress_MEC_files.zip"` -* Actually index the data, which takes the relevant site out of commission for a while (30mn?) - * Testing: `ssh deployhost exec dromedary-testing "bin/dromedary newdata index"` - * Staging: `ssh deployhost exec dromedary-staging "bin/dromedary newdata index"` - * Production: `ssh deployhost exec dromedary-production "bin/dromedary newdata index"` - -Anyone deploying must have been given permission to do so from A&E. - -## Overview - -Deployment of both code and data is done from a unix-like command line. This can -be in a terminal window on you Macintosh or from any server we use in the -library (e.g., malt). - -To deploy new data, there are just a few steps: - -* [Deploy new _code_](docs/deploying.md) if need be (almost certainly not) -* Get the new data as a .zip file and upload it to the development machine -* _Prepare_ the new data for indexing -* _Index_ the new data - -## 1. Get the new data as a zip file and put it on the dev server - -[2019-12-16 the dev server is "nectar"] - -When new data is ready, Paul will announce that it's ready to go -and in Box. - -* Got to [the box folder](https://umich.box.com/s/ah2imm5webu32to343p2n6xur828zi5w) -where we keep everything and click on the "Download" button. This will put a -file called "In_progress_MEC_files.zip" wherever things download for you. -* Copy that file to the development box. Right now the dev server is `nectar`, -but if things change just substitute the new machine. - -You now have the new files on your desktop/laptop. We need to get them -to the dev server. - -_If you have a program you use to upload/download from servers_ go ahead -and use it. If not, the shell command on a Mac would be: - -```bash -scp ~/Downloads/In_progress_MEC_files.zip nectar:/hydra-dev/dromedary-data/build -``` - -If this doesn't make sense or you get errors (maybe you're not allowed to -deploy dromedary?) ask A&E for help. - -## Prepare the new data for later indexing (do this *once*) - -We need to turn the raw data into something the program can use. This -doesn't change anything running, it just makes new files on disk -that can be used later to actually update a running instance. - -_This only needs to be done once_. All three instances -(testing/staging/production) use the same prepared data for indexing, -and we always run the prepare on staging because otherwise there -are permissions errors. - -(You can just copy and paste this) - -```bash -ssh deployhost exec dromedary-staging "bin/dromedary newdata prepare /hydra-dev/dromedary-data/build/In_progress_MEC_files.zip" -``` - -_This will take quite a while_! Like, upwards of an hour. - -Sadly, there's no way to get any feedback from the deployhost commands, so -unless you see an error you just have to assume it's all -going well. - -## Index the new data into a specific instance - -Now the data can be indexed into one of our three instanced: testing, staging, -or production. - -_Each instance has it's own solr and must have its index created separately_! - -NOTE: While indexing is occurring, the instance in question will display a message stating that -"The MED is temporarily unavailable". Once indexing is finished, everything -will be back to normal. - - * Testing: `ssh deployhost exec dromedary-testing "bin/dromedary newdata index"` - * Staging: `ssh deployhost exec dromedary-staging "bin/dromedary newdata index"` - * Production: `ssh deployhost exec dromedary-production "bin/dromedary newdata index"` - -This doesn't take nearly as long as the prepare -- more like 30mn or so. - - +# Overview of the indexing process From d63a7e665128cd60fa1eda450ff976c6243ba071 Mon Sep 17 00:00:00 2001 From: Bill Dueber Date: Thu, 26 Sep 2024 23:29:08 -0400 Subject: [PATCH 3/4] Finish a lot of the docs --- README.md | 81 +++++----- app/controllers/catalog_controller.rb | 7 +- compose.yml | 9 ++ docs/application_code.md | 52 +++++++ docs/autocomplete_setup.md | 84 +++------- docs/configuration.md | 64 ++++++++ docs/dromedary_executable.md | 87 ----------- docs/indexing.md | 65 ++++++++ docs/setting_up.md | 71 +++++++++ ...tting_up_dev_environment_on_unix_or_mac.md | 147 ------------------ docs/setting_up_with_k8s.md | 50 ------ docs/solr.md | 2 + lib/med_installer/indexing_steps.rb | 5 - 13 files changed, 339 insertions(+), 385 deletions(-) create mode 100644 docs/application_code.md create mode 100644 docs/configuration.md delete mode 100644 docs/dromedary_executable.md create mode 100644 docs/setting_up.md delete mode 100644 docs/setting_up_dev_environment_on_unix_or_mac.md delete mode 100644 docs/setting_up_with_k8s.md create mode 100644 docs/solr.md diff --git a/README.md b/README.md index 9eae04f..60af0bd 100644 --- a/README.md +++ b/README.md @@ -2,54 +2,59 @@ A new(-ish, these days) discovery system for the Middle English Dictionary. -## Repositories +Confusingly, there are three separate repositories: + * [dromedary](https://github.com/mlibrary/dromedary), this repo, is the + **Rails application**. The name was given the project + when someone decided we should start naming project with nonsense words. + * [middle_english_dictionary](https://github.com/mlibrary/middle_english_dictionary) is + not, as one might expect, the Middle English Dictionary code. Instead, + it's the code that pulls out indexable data from each little + XML file, inserts things like links to the OED and DOE, and serves + as the basis for building solr documents. + * [middle-english-argocd](https://github.com/mlibrary/middle-english-argocd)(_private_) is the argocd setup which deals with environment + variables and secrets, and serves to push the application to production. It also + has a small-but-valid .zip file under the `sample_data` directory. -* **Public Repository for app**: https://github.com/mlibrary/dromedary -* **Private repository for argo build**: https://github.com/mlibrary/middle-english-argocd +## Documentation +* [Setting up a development environment](docs/setting_up.md) runs through + how to get the docker-compose-based dev environment off the ground and + index some data into its local solr. +* [Overview of indexing](docs/indexing.md) talks about what the indexing + process does, where the important files are, and what code might be + interesting. +* [Configuration](docs/configuration.md) does a _very_ brief run through + the important ENV values. In general, the [compose.yml](compose.yml) file, + the argocd repository, and especially [lib/dromedary/services.rb](lib/dromedary/services.rb) + will be the best place to see what values are available to change. _Don't do that + unless you know what you're doing, though_. +* [Solr setup](docs/solr_setup.md) looks at the interesting bits of the + solr configuration, in particular the suggesters (for autocomplete). +* [Tour of the application code](docs/application_code.md) is a quick look at how + the MED differs from a stock Rails application. +* [Deployment to production](docs/deployment.md) shows the steps for building the + correct image and getting it running on the production cluster, as well as + how to roll back if something went wrong. -If you need access to the private repo, get in touch with A&E. +### Access links +* **Public-facing application**: https://quod.lib.umich.edu/m/middle-english-dictionary/ +* **"Preview" application with exposed Admin panel**: https://preview.med.lib.umich.edu/m/middle-english-dictionary/admin -## Setup your development environment +### About upgrading -The development environment is set up to use `docker compose` to manage -the rails application, solr, and zookeeper (used to manage solr). +This repo currently runs on Ruby 2.x and Blacklight 5.x, and there are no plans +to upgrade either. -To build and start running the application: -```shell -docker compose build -docker compose up -d -``` - -### Test access to the application and solr - -* **Error page**: http://localhost:3000/. Don't let that confuse you. -* **Splash page**: http://localhost:3000/m/middle-english-dictionary. -* **Solr admin**: - * **url**: http://localhost:9172 - * **username**: solr - * **password** SolrRocks - -**NOTE** At this point you can't do any searches, because there's no data in the -solr yet. +
 
 
-### Indexing a file locally
 
-NOTE: You can't index a file locally through the administration interface -- that's 
-hooked directly to an AWS bucket, and won't affect your local install at all
-(it'll replace the `preview` solr data).
+
-* Make sure +
+
-```shell -docker compose run app -- bin/index_new_file.rb /.zip -``` - -Give it however long it takes (a couple minutes for a minimal file, -and around an hour for a full file). -You'll know it's done when the # OLD STUFF @@ -107,6 +112,7 @@ docker-compose up -d Verify the application is running http://localhost:3000/ + ## Bring it all down then back up again ```shell docker-compose down @@ -114,3 +120,6 @@ docker-compose down ```shell docker-compose up -d ``` + +Note that there's no data in it yet, so lots of actions will throw errors. It's time +to index some data. \ No newline at end of file diff --git a/app/controllers/catalog_controller.rb b/app/controllers/catalog_controller.rb index 777327e..f6bda9a 100644 --- a/app/controllers/catalog_controller.rb +++ b/app/controllers/catalog_controller.rb @@ -187,7 +187,12 @@ class CatalogController < ApplicationController # solr request handler? The one set in config[:default_solr_parameters][:qt], # since we aren't specifying it otherwise. - # config.add_search_field 'Keywords', label: 'Everything' + + ######################### WHAT ARE THE DOLLAR-SIGN VARIABLES??? ############ + # These are sent to solr as the actual string (e.g., solr gets "$everything_qf"). + # They are then expanded by solr withing the solr process based on configuration + # files there. See, e.g., solr/dromedary/conf/solrconfig_med/entry_searches/everything_search.xml + # for an example. config.add_search_field("anywhere", label: "Entire entry") do |field| field.qt = "/search" diff --git a/compose.yml b/compose.yml index 2a8bccd..7920d83 100644 --- a/compose.yml +++ b/compose.yml @@ -43,6 +43,15 @@ services: - -b - 0.0.0.0 + db: + image: postgres:12-alpine + ports: + - "5432:5432" + environment: + - POSTGRES_PASSWORD=postgres + - PGDATA=/var/lib/postgresql/data/db + volumes: + - db:/var/lib/postgresql/data solr: build: solr/. diff --git a/docs/application_code.md b/docs/application_code.md new file mode 100644 index 0000000..3c44b68 --- /dev/null +++ b/docs/application_code.md @@ -0,0 +1,52 @@ +# Tour of the application code + +In general, the MED is a "normal" Blacklight (v5) application, with a lot +of added stuff. + +Like most Blacklight apps, the heart of the configuration is +in the [Catalog Controller](../app/controllers/catalog_controller.rb), +specifying the search and facet fields. This is repeated for the +other controllers, and they're fairly straightforward so long as +you're willing to take on faith that Rails will find things at the +right time. + + +## Models +The files in [models](../app/models) are unusual for a Rails app in that we're not using +a backing database, and thus not deriving them from ActiveRecord. +The directly is empty of anything interesting. + +The _actual_ objects to represent each of the many, many layers of a dictionary +entry are actually defined in the [middle_english_dictionary](https://github.com/mlibrary/middle_english_dictionary) +repository. The nomenclature can be confusing, since it's derived from +the jargon of the field, but none of the objects are particularly complex. + +## Presenters + +The meat of the interface is actually built into the presenters. +[quotes presenter](../app/presenters/quotes/index_presenter.rb) +is indicative of the setup, pulling in lots of XSLT and getting values +out of the document with XSLT transforms or by querying the +`Nokogiri::XML` document directly. + +## lib/med_installer + +...is all indexing code, and isn't used in the rest of the application. See +[indexing.md](indexing.md) for a brief overview. + +## /solr + +...has all the solr configuration in it, as well as the +[Dockerfile](../solr/Dockerfile) and the [container initialization](../solr/container/solr_init.sh) +code. See [solr.md](solr.md) for more. + +## Everything else + +...is basically just normal views and utility objects +(most importantly the [services.rb](../lib/dromedary/services.rb) file). + + + + + + diff --git a/docs/autocomplete_setup.md b/docs/autocomplete_setup.md index ea520ff..a60c245 100644 --- a/docs/autocomplete_setup.md +++ b/docs/autocomplete_setup.md @@ -4,21 +4,17 @@ The blacklight autocomplete setup is pretty brittle and needs some mucking about with to get it to work with multiple different input boxes that target different solr autocomplete endpoints. -## Three things that need doing +## Three things that needed doing * Add a new autocomplete handler in the solr config * Make changes to `config/autocomplete.yml` -* Add the option to the dropdown where the users determines what to search - -If you're adding a whole new autocomplete input field in the HTML, -you'll also need to: -* Create the input box with -* Make changes to `autocomplete.js.erb` - +* Add javascript code to trigger autocomplete + to the dropdown where the users determine what to search ## Configure the new autocomplete handler in the solr config -Suggesters live in `solr/med/conf/solrconfig_med/suggesters`. You -can pattern a new one off of the ones in there. +Suggesters live in +[solr/dromedary/conf/solrconfig_med/suggesters](../solr/dromedary/conf/solrconfig_med/suggesters). +You can pattern a new one off of the ones in there. Of course, if you're using an existing handler in a new context (say, you want autocomplete for headwords again in an advanced search box), you @@ -26,60 +22,26 @@ can just use the handler that's already been defined and skip ahead to making and configuring the new search input field and dropdown. Things to note: -* The name in `` in the top section is any name you -make up to identify this suggester. -* ... but the `suggest.dictionary` in the bottom section *must* match that name -* ... and same with the `` in the bottom + +* You can pick any names for the suggester handler, but it + must be used _identically_ in three places: + * The top `` + * The name of the `suggest.dictionary` + * The reference in the `` block. * The `field` is the name of the field you're basing this on. It *must* -be a stored field! +be a stored field! The code we have now builds a special field for this +instead of trying to force an existing field to work. * `"suggestAnalyzerFieldType"` is probably the same fieldType used for the field you're indexing in this typeahead field, but if you have the know-how and think it should be different, go for it. -## Add new suggester to autocomplete in `config/autocomplete.yml` - -```yaml -# Autocomplete setup. -# The format is: -# search_name: -# solr_endpoint: path_to_solr_handler -# search_component_name: mySuggester -# -# The search_name is the name given the search in the -# `config.add_search_field(name, ...)` in catalog_controller -# -# The "keyword" config mirrors the default blacklight setup - -default: &default - keyword: - solr_endpoint: path_to_solr_handler, - search_component_name: "mySuggester" - h: - solr_endpoint: headword_only_suggester - search_component_name: headword_only_suggester - hnf: - solr_endpoint: headword_and_forms_suggester - search_component_name: headword_and_forms_suggester - oed: - solr_endpoint: oed_suggester - search_component_name: oed_suggester - -development: - <<: *default - -test: - <<: *default - -production: - <<: *default - -``` +## Add new suggester to the autocomplete configuration +Pattern match from another entry +and add it to [autocomplete.yml](../config/autocomplete.yml) ## Catalog controller configuration -Now load the autocomplete setup into your blacklight configuration. - ```ruby # Autocomplete on multiple fields. See config/autocomplete.yml @@ -87,7 +49,13 @@ config.autocomplete = ActiveSupport::HashWithIndifferentAccess.new Rails.applica ``` -## Adding a whole new dropdown +## Adding a whole different search box + +I don't expect this will happen at this point, but the knowledge may +well come in handy on other project. + +### Adding a whole new dropdown + * Put a data attribute `data-autocomplete-config` on your text box to reference which typeahead configuration should be used (e.g, `h` or `hnf` in the config example above). @@ -116,7 +84,5 @@ the correct index when a user picks, e.g., "headword only." ## Side note: this overrides blacklight code I don't actually moneky-patch, but I do use `Module#prepend`. The code -is in `config/initializers/autcomplete_override_code.rb` +is in [autocomplete_override_code.rb](../config/initializers/autocomplete_override_code.rb). -If Blacklight ever changes the autocomplete setup to allow this sort of -thing, we'll need to re-evaluate whether these extensions are necessary. diff --git a/docs/configuration.md b/docs/configuration.md new file mode 100644 index 0000000..d649eb4 --- /dev/null +++ b/docs/configuration.md @@ -0,0 +1,64 @@ +# Configuration + +Configuration of this application has...grown organically. Thus, lots +of things are spread out over a few locations. + +## ./config + +**The normal rails `config` directory** has all the "normal" stuff +in it, including the [`routes.rb`](../config/routes.rb) file which can be inspected to +understand what things are exposed. + +Additions to the normal rails/blacklight stuff include: + * [autocomplete.yml](../config/autocomplete.yml) configures the exposed + names and the solr targets for each autocomplete field's `suggester` handler. + * [autocomplete_overrid_code.rb](../config/initializers/autocomplete_override_code.rb), + which provides extra code (via `module prepend`) to deal with the fact + that we're running multiple suggesters. + * [load_local_config.rb](../config/load_local_config.rb) had so much + stripped away that it's now mostly utility code to reload the + `hyp_to_bibid` data from the current solr and extract data from + the name of the underlying collection. + +## Controllers + +The [CatalogController](../app/controllers/catalog_controller.rb) is the heart +of any Blacklight application, specifying how to talk to solr, what fields to +expose as searchable, etc. We have controllers for all the different aspects +of the site (e.g., bibliography, quotes), so make sure you're looking at +the right one. + +## Mystery Solr Configuration in the controllers + +The solr configuration in the controllers includes variables that look like +`$everything_qf` where normally you'd have a list of fields. The decision was +made to actually store this configuration in _solr_, so sending that +magic reference will cause solr (on its end) to use the values defined in +the XML up there. See ([headword_and_forms_search.xml](../solr/dromedary/conf/solrconfig_med/entry_searches/headword_and_forms_search.xml)) +for a representative sample. + + +## The Dromedary::Services object + +The [services object](../lib/dromedary/services.rb) is really the heart of +all the configuration. Every effort has been made to push everything +though it, as opposed to directly using ENV variables and such. + +Instead of just being a passthrough for the environment, though, the +Services object also includes useful things derived from those +variables, including things like a connection object for the +configured solr. + +Essentially everything you need to understand how the application is +influence "from the outside" (e.g., the argocd config) is in this file. + +## AnnoyingUtilities + +The [annoying_utilities.rb](../lib/annoying_utilities.rb) file once did +all the little, annoying substitutions that were necessary to run the application +on two relatively-different staging and production platforms. Now that +it's all container-based, all of that code has been ripped out and +replaced with thin wrappers around `Dromedary::Services`. Mentioned +here because it's still in the code in some places and might be +confusing. + diff --git a/docs/dromedary_executable.md b/docs/dromedary_executable.md deleted file mode 100644 index 06c97c3..0000000 --- a/docs/dromedary_executable.md +++ /dev/null @@ -1,87 +0,0 @@ -# The `bin/dromedary` helper script - -The dromedary source has a script in `bin/dromedary` that can be used -for all sorts of things by developers. - -Just running `bin/dromedary` gives you a list of the top-level commands, -and running `bin/dromedary subcommand -h` will show you _that_ level -of commands. - -There are only a few that are useful -- most of the rest are used to build -up these. - -## NOTE: REMOTE EXEC - -Any `bin/dromedary` command can be run on one of the installed -instances with the format: - -`bin/dromedary remote ""`. - -So, for example, to reload a solr core on staging, I'd do - -`bin/dromedary remote staging "solr reload"` - -This constructs an ssh string that calls through deployhost. - -## Deploying the code - -_Time to run_: 10mn? But doesn't matter much because nothing switches over -until the whole deploy is done. - -* `bin/dromedary deploy [staging|training|production] [branch]` Runs a deploy. Exactly -the same as using the ssh command. - -Note that if the solr config has changed, you'll also want to reload the core: - -* `bin/dromedary remote "solr reload"` - -## Maintenance mode - -_Time to run_: Nothingish. - -* `bin/dromedary maintenance_mode on` -* `bin/dromedary maintenance_mode off` - -## Index and such - -_Time to run_: An hour or more for the prepare step; about half an hour -for the reindex. - -* `bin/dromedary newdata prepare ` Extract data from the zipfile -into the build directory and convert to files needed for indexing. -* `bin/dromedary newdata index`. Index files from the build directory -into solr. - -There are also finer-grained versions of these - -* `bin/dromedary extract ` Extract from the zipfile into the data dir -* `bin/dromedary convert ` Convert raw xml to files we actually use -* `bin/dromedary index copy_from_build` Copy built files from build dir to data dir for indexing -* `bin/dromedary index entries` Index just the entries from `#{data_dir}/`. -* `bin/dromedary index bib` Index just the biblio stuff -* `bin/dromedary index full` Index both entries and bib with one command and then rebuild the suggesters. -* `bin/dromedary index hyp_to_bibid`. Create the mapping from HYP ids (RID) to bib IDs - - -## Solr - -_Time to run_: Totally depends on the state of the index. -The `reload` is instantaneous, though, which is nice. - -All these will give more information if called with `-h`. -```bash -> bin/dromedary solr -h - -Commands: - dromedary solr commit # Force solr to commit - dromedary solr empty # Delete all documents in the solr - dromedary solr install # Download and install solr to the given directory - dromedary solr optimize # Optimize solr index - dromedary solr rebuild_suggesters # Tell solr to rebuild all the suggester indexes - dromedary solr reload # Tell solr to reload the solr config without restarting - dromedary solr shell # Get a shell connected to solr, optionally with collections - dromedary solr start [RAILS_ENV] # Start the solr referenced in .solr - dromedary solr stop [RAILS_ENV] # ...or stop it - dromedary solr up # Check to see if solr is up - -``` diff --git a/docs/indexing.md b/docs/indexing.md index d3f6294..d4c4cab 100644 --- a/docs/indexing.md +++ b/docs/indexing.md @@ -1,2 +1,67 @@ # Overview of the indexing process +Indexing the MED data is a little different than most of what we +do in that it's not really _data_. What we have is 50k +little _documents_, and trying to treat them as data lost +about six weeks when we started this project. + +**The only source of truth for what happens during indexing** is +the steps in the file [indexing_steps.rb](../lib/med_installer/indexing_steps.rb). +At one point this was all driven by a CLI application, and there are still +vestiges of it lying around (including some calls to the old CLI code +in that file). + +Details about access to solr and such are pulled in through +environment variables. See [configuartion](configuration.rb) +for a few more details. + +The indexing process has a few steps: + +* **Unzip** the file. This assumes the structure it's always had, + with smaller zip files within it for each letter. +* **Extract XML files** and for each one create: + * a `MiddleEnglishDictionary::Entry` object + * various bibliography-related objects (manuscript, stencil, etc.) + * a more useful mapping to the external dictionaries (Dictionary of Old Enlish + and the Oxford English Dictionary) + * **Create the solr configset and collection** based on values + from `services.rb`. We make a new configset every time, even though + it hardly ever changes, because it's cheap and less confusing. + * **Index the solr documents** using `traject` and two rules sets: + * [main_indexing_rules](../indexer/main_indexing_rules.rb) does most of + the heavy lifting, both building indexes and stuffing the actual + XML snippets into the solr document for later display. It loads up the + [quote_indexer.rb](../indexer/quote/quote_indexer.rb) as well. + * [bib_indexing_rules](../indexing/bib_indexing_rules.rb) are, obviously, + more focused on the bibliographic data (author, manuscript, etc.). + * **Create the "hyperbib" mapping file**. The bibliographic bits used + to be called the "HyperBib" (back when "Hyper" meant "HyperMedia"). A + file mapping bibliographic entries to word entries is created + and stored in solr as a single unique record (with `type=hyp_to_bibid`). + It's read into memory when the application boots up or the alias + changes which underlying collection it's connected to. + * **Build the suggester indexes**. The MED has several `suggester` handlers + defined in the solr config which are used to provide field-specific + typeahead in the search box. These are "built" by sending Solr a + command to build them. + * **Move the `med-preview alias`**. Upon completion, the collection alias + `med-preview` will be swapped from the collection is was pointing at + before to the one we just created. The "release" is just doing the + same thing with the `med-production` alias. + +Again: the only code that runs during indexing is the stuff in or referenced +by [indexing_steps.rb](../lib/med_installer/indexing_steps.rb). I've run the +indexer under code coverage culled from that, but there's no +guarantee that there isn't some dead indexing code lying in wait for the +unprepared. + +## XSLT Files + +The MED has been using XSLT to transform the little XML documents into something else +since the beginning, and we leveraged that knowledge to develop this version. + +Essentially, the XSL is used to pull the data we want out of the XML files. +An attempt was initially made to treat the XML as just a serialization of +an underlying data model, but the structures vary wildly. Treating each +file as a _document_ is the model under which they were created, and we +eventually followed suit. diff --git a/docs/setting_up.md b/docs/setting_up.md new file mode 100644 index 0000000..01fdb1b --- /dev/null +++ b/docs/setting_up.md @@ -0,0 +1,71 @@ +# Setting up a development environment + +The development environment is set up to use `docker compose` to manage +the rails application, solr, and zookeeper (which is used to manage solr). + +## Requirements + +You'll need docker and docker-compose ready to run. Everything else should be +taken care of inside the containers. + +To build and start running the application: + +```shell +docker compose build +docker compose up -d +``` + +## Test access to the application and solr + +* **Not the home page**: http://localhost:3000/. Don't let that confuse you. +* **Actual home page**: http://localhost:3000/m/middle-english-dictionary. +* **MED admin** http://localhost:3000/m/middle-english-dictionary/admin +* **Solr admin**: + * **url**: http://localhost:9172 # You can change the port in the `compose.yml` file. + * **username**: solr + * **password** SolrRocks + +**NOTE** At this point you can't do any searches, because there's no data in the +solr yet. + + +### Indexing a file locally + +NOTE: You _can't index a file locally through the administration interface_ -- that's +hooked directly to an AWS bucket, and won't affect your local install at all +(it'll replace the `med-preview` solr data in the production environment!). + +There is a very small file, useful for development, +in the private [middle-english-argocd](https://github.com/mlibrary/middle-english-argocd) +repository at `sample_data/MED_A_SMALL.zip`. It has about 150 records, up through a bunch +of the ones that start with `ab`, which is enough to test indexing, typeahead, and +searching. + +```shell +docker compose run app -- bin/index_new_file.rb data/MED_A_SMALL.zip +``` + +Give it however long it takes (a few minutes for the minimal file, +and up to an hour for a full file). + +You'll know it's done when the [admin page](http://localhost:3000/m/middle-english-dictionary/admin) +shows that the new collection is set and is aliased by `med-preview` or, of course, +by watching the logs scroll by. + + +### Test the full application + +At this point, you should have working typeahead (well, for a few words that start with _ab_) +and search capabilities. + + +## Working on the application + +NOTE that the solr is not set up to be durable (e.g., every time you bring down solr, +the data is lost). If you're just working on the app, you can bring just the app +container up and down by itself, and leave solr/zookeeper running so as to not lose the index. + +```shell +docker compose down app +docker compose up app +``` \ No newline at end of file diff --git a/docs/setting_up_dev_environment_on_unix_or_mac.md b/docs/setting_up_dev_environment_on_unix_or_mac.md deleted file mode 100644 index e0eea04..0000000 --- a/docs/setting_up_dev_environment_on_unix_or_mac.md +++ /dev/null @@ -1,147 +0,0 @@ -# Setting up a Dromedary Development Environment - -...on a unix-like (linux or macintosh) - -Dromedary has three basic parts: - -* The custom code, meaning - * the dromedary application (this repository) -- the rails application - * the middle_english_dictionary gem -- code to pull data out of the - XML files and turn it into useful ruby objects used by dromedary. -* The data, which includes: - * the entries as individual XML files, which has entry and quote information - * the file `bib_all.xml` which holds all the bib information - * a `MED2OED_links.` mapping MED entries to OED entries (where `YYYYMM` - is the year/month of its creation) - * a `MED2DOE_links.` mapping MED entries to DOE entries (where `YYYYMM` - is the year/month of its creation) -* A solr installation - - -## 1. Setting up the application - -We assume you have `git`, `ruby`, and `java` set up. If you don't, and you're -not sure how to proceed, find someone to help. Everyone at the library is -very friendly :-) - -```bash -# Choose a place to put everything -export MEDDIR=~/devel/med - -# Make the directory if it doesn't exist and go there -mkdir -p $MEDDIR -MEDDIR=$(cd $MEDDIR; pwd) # make it absolute -cd $MEDDIR - - -# Clone the repository and get the gems -# You must be on the lit network (via being at the office -# or using the VPN) for the bundle install to work. - -git clone git@github.com:mlibrary/dromedary.git -cd $MEDDIR/dromedary -bundle install --path=.bundle - -# IF you need to mess with the middle_english_dictionary gem -# because the structure of the XML files has changed, -# get it, too. - -# (only if you need it) -# git clone git@github.com:mlibrary/middle_english_dictionary.git - -``` - -You now have the custom code, templates, etc. we use in the application. - -## 2. Setting up solr - -Unfortunately, there's no magic link that will get you the latest version of -solr. You'll have to go to [the solr download page](http://lucene.apache.org/solr/downloads.html) -to get a link to a solr `.tgz` distribution. We target at least solr 6.6.5. - -```bash -export SOLR_URL=http://archive.apache.org/dist/lucene/solr/6.6.5/solr-6.6.5.tgz - -cd $MEDDIR -curl $SOLR_URL | tar xzf - - -# This will create a directory called, e.g., 'solr-6.6.5'. Create a -# symlink so we can find it more easily as $MEDDIR/solr - -ln -s $MEDDIR/solr-6.6.5 $MEDDIR/solr - -``` - -Finally, make a directory in which to keep the data - -```bash -mkdir -p $MEDDIR/data -``` - -You should now have a directory structure of the form: -```ruby -med - - data - - dromedary - - solr # symlink to below - - solr-X.Y.Z -``` - -## 3. Making a local index in your local solr - -### a. Settings - -First, you'll need to create `config/local.settings.yml`. Obviously, -change the `/Users/dueberb/devel/med` bit to point to your own -med directory. - -NOTE: this doesn't currently work as written (26/01/2022). Modifying -`config/settings/development.yml` to have the correct data dir and add the -build dir does seem to work. - -```yaml -data_dir: /Users/dueberb/devel/med/data -build_dir: /Users/dueberb/devel/med/data/build - -blacklight: - url: http://localhost:9639/solr/med - -``` - -The `blacklight.url` will be used to start/stop/reload the med core. - -### b. Fire up solr and load the core for the first time - -Starting solr is easy: `bin/dromedary solr start` - -May need to ensure there is a `solr.xml` file in `dromedary/solr`. -Also may need to add `lucene-analyzers-icu-X.Y.Z.jar` where X/Y/Z is the solr version - to `dromedary/solr/lib` to get `ICUNormalizer2CharFilterFactory`. - -Loading the core is more annoying for now - -``` -cd solr/med -curl "http://localhost:9639/solr/admin/cores?action=CREATE&name=med&config=solrconfig.xml&dataDir=data&instanceDir=$(pwd)&wt=json" -``` - -If you accidentally left a `core.properties` file laying around, it'll tell you that the core already -exists, even if it isn't currently loaded. You can just run the curl command again. - -### c. Index the data for the first time - -First, get the `In_progress_MEC_files.zip` file as showing in [the indexing document](indexing.md). -(Note: this zip may not include a directory.) -Assuming you get it downloaded into ~/Downloads, you can: - -`bin/dromedary newdata prepare ~/Downloads/In_progress_MEC_files.zip` -`bin/dromedary newdata index` - -The former pre-processes the data into `../data/build`, the other -actually copies it to the real data dir and indexes it to solr. - -### d. Fire up the server and take it for a ride! - -`bundle exec puma` - - diff --git a/docs/setting_up_with_k8s.md b/docs/setting_up_with_k8s.md deleted file mode 100644 index dd6171a..0000000 --- a/docs/setting_up_with_k8s.md +++ /dev/null @@ -1,50 +0,0 @@ -# Setting up MED on Kubernetes - -blah - -## Solr Operator - -For secure, reliable Solr in the kubernetes cluster! Cribbed heavily from [the HathiTrust docs](https://github.com/hathitrust/hathitrust_catalog_indexer/blob/78631a3d0831653f038222b644e6ffc83d5f8294/solr/solrcloud/README.md). - -### Set up via Helm -The LIT k8s cluster already has the appropriate Helm charts installed, but you will need to install `helm` on the machine you are using to interact with Kubernetes. (Installing helm charts on minikube is beyond the scope of this documentation.) - -From inside the `dromedary` github repository: -```bash -$ helm install middle-english apache-solr/solr \ - --version 0.6 \ - --namespace middle-english-testing \ - -f solr-helm-values.yml -``` -(or staging, or production) - -You can retrieve the `admin` password that will have been created for you like so: -```bash -kubectl -n middle-english-testing get secret middle-english-solrcloud-security-bootstrap -o jsonpath='{.data.admin}' | base64 -d -``` -Depending on your shell and the characters in the password, you may or may not be able to assign this password to a shell variable! - -Port-forward to do the next steps, in Lens or in the terminal: -```bash -kubectl -n middle-english-testing port-forward service/middle-english-solrcloud-common 8983:80 -``` - -### Upload the configuration -First, zip up the configuration: -```bash -cd [git repo directory]/solr/med/conf/ -zip -r ../../../middle-english.zip . -``` - -Then upload it: -```bash -curl -u "admin:$SOLR_PASS" -X PUT --header "Content-Type: application/octet-stream" \ - --data-binary @middle-english.zip \ - "http://localhost:8983/api/cluster/configs/middle-english" -``` - -### Create a collection -Like a core, but cloud-y. -```bash -curl -u "admin:$SOLR_PASS" "http://localhost:8983/solr/admin/collections?action=CREATE&name=middle-english&numShards=1&replicationFactor=3&maxShardsPerNode=2&collection.configName=middle-english" -``` \ No newline at end of file diff --git a/docs/solr.md b/docs/solr.md new file mode 100644 index 0000000..14ea5a8 --- /dev/null +++ b/docs/solr.md @@ -0,0 +1,2 @@ +# Solr configuration for the MED + diff --git a/lib/med_installer/indexing_steps.rb b/lib/med_installer/indexing_steps.rb index daf51d7..01a8cb0 100644 --- a/lib/med_installer/indexing_steps.rb +++ b/lib/med_installer/indexing_steps.rb @@ -86,11 +86,6 @@ def index @build_collection = create_configset_and_collection! - # Delete any leftover crap from when we had different aliases. - @connection.aliases.each do |a| - a.delete! unless [Dromedary::Services[:production_alias], Dromedary::Services[:preview_alias]].include? a.name - end - create_combined_documents # TODO: SolrCloud::Collection should have a `#url` method, for god's sake From ec6d89bd1363b9c74a0b3e46477d8258158bf9ae Mon Sep 17 00:00:00 2001 From: Bill Dueber Date: Fri, 27 Sep 2024 09:50:30 -0400 Subject: [PATCH 4/4] Finish docs --- docs/solr.md | 113 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 113 insertions(+) diff --git a/docs/solr.md b/docs/solr.md index 14ea5a8..7dbfbbb 100644 --- a/docs/solr.md +++ b/docs/solr.md @@ -1,2 +1,115 @@ # Solr configuration for the MED +The MED data isn't super-complicated, and the solr install follows suit. + +There are a few things worth noting. + +## Use of XML ENTITY declarations to include files + +The configuration makes use of XML includes (the `ENTITY` declarations +that pull in other files) to keep things a little easier to manage. +For both the `solrconfig.xml` and (to a lesser extent) the +`schema.xml` files, the most interesting stuff is in the +subdirectories, implemented as files expanded into the +main configuration. + +## Extra .jar file + +Additional .jar files are used, in particular the ICU code for +both tokenization and normalization. The necessary files are located +in [solr/lib](../solr/lib) and are copied into the image via the +[solr Dockerfile](../solr/Dockerfile) + +## Search parameters defined in the solr configuration + +The actual search parameters (what fields to search, relevance boosts, etc) +are actually in the solr configuration and just referenced in the +Blacklight code. So, when the CatalogController references +`$headword_and_forms_qf`, you need to look at +[headword_and_forms_search.xml](../solr/dromedary/conf/solrconfig_med/entry_searches/headword_and_forms_search.xml) +to see what's going on and make changes. + +All the configuration assumes we're targeting an eDismax search handler. +There are `qf` and `pf` variables defined right in the `solrconfig.xml` +for each of the handlers, but those are there to act as defaults. The +"real" configuration is in the smaller, included XML files. + +## Multiple search handlers + +Most solr configurations at the library use a single primary handler for searches, +usually exposed as `select`. This `solrconfig.xml` uses several separate requestHandlers, +each tuned to what kind of search the user is doing: +* _search_ for entries +* _bibsearch_ for bibliography, and +* _quotesearch_ for quotation searches. + +There as also couple special handlers for dealing with individual documents +and the hyp-to-bibid mapping record. +In all cases, what's in the actual `solrconfig.xml` file is a skeleton, +with the meat of the definitions in the files under +[solrconfig_med](../solr/dromedary/conf/solrconfig_med) + +## Multiple Suggesters + +In solr parlance, a "suggester" is a handler and associated index +designed expressly for doing autocomplete. Stock Blacklight only +deals with a single index, so some changes have been made +to allow distinct indexes to be used depending on what field +has been selected in the search dropdown. + +Each suggester is built off a particular field, and needs to be rebuilt +whenever the index changes (which is done when doing a normal +indexing routine via the Admin page or locally with +[index_new_file.rb](../bin/index_new_file.rb)). + +The suggester terminology is a bit opaque, so adding a new suggester is +probably best done by pattern matching on what's already there. + +Here's an example, from [headword_only_suggester.xml](../solr/dromedary/conf/solrconfig_med/suggesters/headword_only_suggester.xml) + +```xml + + + &common_suggester_components; + headword_only_suggester + me_text + headword_only_suggestions + headword_only_suggester_index + + + + + + + true + 15 + headword_only_suggester + + + headword_only_suggester + + + +``` + +Things to note: + +* You can pick any names for the suggester handler, but it + must be used _identically_ in three places: + * The top `` in the `searchComponent` at the top + * The name of the `suggest.dictionary` in the `requestHandler` + * The reference in the `` block at the bottom. +* The `field` is the name of the field you're indexing. It *must* + be a stored field! The current setup builds up a special field for this, + instead of trying to make an existing field work. +* `"suggestAnalyzerFieldType"` is probably the same fieldType used for the + field you're indexing in this typeahead field -- it controls what + analysis chain (if any) is run on the data, and is just a refernce to + a type as defined with a `fieldType` in your `schema.xml`. Generally, + you'll want to use the same `fieldType` as you do for the data in + the searchable parts of the index. + +If you're building a new suggester for another field, you'll need to +make sure to also reference it correctly in the search dropdown +and add it to [autocomplete.yml](../config/autocomplete.yml). \ No newline at end of file