Releases: unt-libraries/catalog-api
Release v2.0
Administrative Notes
You've probably noticed that this project appeared to be abandoned for the last few years. We moved to an internal GitLab repository in 2020 while working on some very large updates needed for our production systems. We've finally reached a point where things are stable enough to release again publicly.
Beware that this release comprises several years of changes; the API itself is not even backwards compatible with the previous version. Here in the release notes I am providing an overview of the big changes to watch out for, but there are too many to list comprehensively.
Going forward, I plan to take a hybrid approach to managing and updating this project: most of the day-to-day work will happen internally in GitLab, and I will push changes to GitHub when it makes sense to do so.
The problem I have consistently run into with this project is that I have never fully separated the generic / publicly-useful components from the components that implement very UNT-specific features. Of course the latter have had to take precedence at the expense of the former.
I am not aware of anyone else using this project in production, so I have not been as careful as I would otherwise be to make sure there's a stable migration path (with appropriate deprecations, etc.).
Migration Notes
I highly, highly recommend redeploying from scratch and not attempting to migrate old installations.
If you have custom code built on this, I sincerely and wholeheartedly apologize.
I've moved what was the master
branch for this project (on GitHub) to old-master
and what was the discover
branch to old-discover
. The new default branch is main
.
Overview of Large-Scale Changes
Solr
-
The old Solr version that was bundled in the repository (I think it was version 4.5) has been removed. It's been replaced with the
/solrconf
directory, which contains only the applicable configuration files, schemas, etc. -
The Docker setup now uses the official Solr images.
-
The Solr configuration files have been updated to work with Solr 9.2.
-
Solr cores have been changed.
bibdata
no longer exists. It has been replaced bydiscover-01
anddiscover-02
, the cores that power our Blacklight instance.marc
no longer exists. We never used the MARC record API view in production, and the index was large enough that we decided just to get rid of it.libguides
was never actually used by thecatalog-api
project, and we moved these files so they're managed by a different repository.
-
Added optional support for multi-server Solr architectures. For each Haystack connection, you can now specify a separate
SEARCH
URL andUPDATE
URL. Search/query requests are made againstSEARCH
and update requests are made toUPDATE
. This should support either a legacy / user-managed replication scenario or a SolrCloud-based one. -
Added optional support for specifying how Solr replication happens when a hard "commit" is issued. (I.e., you can issue a manual command to perform replication instead of using polling.)
Django and Python Dependencies
-
The target Django version is 3.2.
-
The target Python version is 3.10.
-
All other dependencies are as up-to-date as possible, except pymarc, which is pinned to the 4.x line for now (due to breaking changes in v5).
The API
-
The
marc
API resource has been removed because we never used it. -
The
bib
API resource is completely different — we've reimplemented all fields to utilize a different Solr schema, the one that our Blacklight instance uses (discover-01
anddiscover-02
). -
The
id
field forbib
s,item
s, andersource
s is now the III record number (minus the check digit), not the internal Sierra database ID.
Export Processes
-
All of the old SolrMarc-based processes have been removed, and SolrMarc has been removed as a dependency.
-
All-new Python-based MARC-translation processes have been added. These were developed specifically for building data for our Blacklight app. These are in
export.sierramarc
andexport.marcparse
.
Redis
-
I highly recommend that people upgrade the version of Redis being used to the latest.
-
Added support for supplying Redis passwords via protected Env variables, using the legacy single-password method for the 'default' user. The Celery and Appdata Redis instances can have separate passwords.
-
Lots of optimization and changes to the
utils.redisobjs
Redis interface.
New / Changed Settings
There are several new or updated settings. See the README and/or the env template file (in /django/sierra/sierra/settings/.env.template) for details.
Release v1.4
v1.4 makes back-end improvements geared toward improving our ability to keep this software up-to-date. It mostly comprises adding tests and testing infrastructure, updating dependencies and making requisite changes to the software, and refactoring to better support more targeted tests.
The goal is to have fully updated dependencies (to Django 1.11) and move to Python 3 by 2020. Then we'll work on supporting Django 2.
Upgrading to v1.4
For Docker TESTING and DEV environments, after updating an existing working copy to the latest version, you should do the following:
- Rebuild the Docker images (
./docker-compose.sh build
). - Re-initialize test data. (
./init-dockerdata.sh -f tests
) - If you can blow away your dev environment, I'd recommend running
./init-dockerdata.sh -f dev
as well, but note that will completely reset everything, including your Solr indexes. - Otherwise, issuing a
./docker-compose.sh run --rm manage-dev migrate
command should do the trick.
Non-Docker environments should either redeploy or update the working copy and run migrations depending on the particulars of the environment.
Note that, unless you're redeploying or re-initializing your Django database, you'll want to go in and delete the AllToSolr
ExportType manually. (I didn't write a data migration to do this.)
Changes that May Affect Custom Catalog Api Code
To be completely honest this has been in the works for about a year, and I have not really kept track of what changes might break custom code. If you've forked this repository and done a lot of customization, you'll want to be careful when applying this update.
Changes to the export
app are most extreme, but I've tried to keep the methods and signatures for the export.exporter.Exporter
base class the same, so this shouldn't break any custom Exporter
sub-classes you may have created. Changes to signatures and such in v1.3 were actually more problematic for existing implementations. There are some new abstract base classes that you may find helpful, but the core API is still the same.
Deprecated and/or Removed Functionality
- The
export.batch_exporters.AllToSolr
exporter that was deprecated in v1.3 has been removed in v1.4. - The
export.basic_exporters.BibsDownloadMarc
exporter is now deprecated. It will be removed in the next version, v1.5.
Django 1.8 and More-recent Dependencies
The supported version of Django is now 1.8, from 1.7. Other dependencies have been updated as far as they can go while still supporting Django 1.8.
Tests, Tests, and More Tests
To ensure the dependency-update process didn't break anything, we've added near-comprehensive test coverage; not necessarily strict unit tests, but tests that at least make sure to touch all of the major end-user-facing functions. The full test suite currently runs ~2300 individual tests. This is the main reason this update took nearly a year—with tests now in place, future dependency updates should be much faster and easier.
Tests were particularly challenging to write due to this system's dependence on a legacy vendor database (Sierra) plus Solr, which doesn't have any existing test-data-generation modules equivalent to model-mommy. Test data and fixtures are included to simulate data flow from Sierra (via a test postgreSQL Sierra database, including fixtures exported from our live catalog) into Solr. I've built modules for generating Solr test data a la model-mommy (utils.test_helpers.solr_factories
and utils.test_helpers.solr_test_profiles
) and added some helper factories for generating objects that work well as pytest fixtures to clean up Solr test data after each test or after each pytest module, depending on what's needed. This has led to a somewhat more complex test infrastructure than I really wanted, but it works.
Also, the Sierra DB tests that were originally meant to be run via the normal Django test runner have been moved to pytest. This addresses issue #40.
Changes to Sierra Models
Django 1.8 made some changes to models that broke some of the workarounds we'd employed, e.g., to accommodate the lack of single primary keys in some Sierra tables and Django's lack of support for composite keys. This improves models in the following ways.
-
Adds
base.fields.VirtualCompField
— a virtual composite field that works as a proper PK for Sierra models. Notably, it does NOT work in relationships as a foreign key, but the Sierra models don't need to employ it that way. -
Adds
base.models.ModelWithAttachedName
base class for Sierra Property or Type models that have an associated PropertyName or TypeName table containing names or labels for a particular property by language, linked to theIIILanguage
table. Previously it was assumed these labels always used the default language, but this base class implements support for multiple languages. A new setting has been added (III_LANGUAGE_CODE
) to define the primary language, but the class allows you to request a label using a different language, if it's available in your system. -
ForeignKey
relations that had aunique=True
property have been converted toOneToOne
relationship types.
New manage.py
Command for Creating APIUsers
Previously, in order to create new APIUsers, you had to use the Python shell and run Python commands manually. This adds the importapiusers
management command via the api
app to allow you to import a CSV file to generate new APIUsers from the command line.
Refactored Code
This set of changes contains a LOT of refactored code to help break things into more granular units to make them more testable. Although I wasn't really aiming to do consistently granular unit tests, in some cases the methods or functions I wrote originally were awful and needed to be refactored.
Particularly, all exporters have been completely refactored. I've added a few types of intermediate base classes for helping compose complex export workflows more easily.
Bug Fixes
There are some fixes for various bugs or issues that came to light while writing tests, particularly surrounding API behavior.
-
Using API filters that used the
utils.solr.Queryset
exclude
method now work correctly. Previously, in certain circumstances, they would lead to default View querysets losing their default filters. -
API filter operators such as
gt
,gte
,lt
, etc. now handle queries against string fields that contain spaces. Previously these filters were not escaping the spaces correctly in the underlying Solr query, which led to errors. -
Incorrect syntax for API filters often led to 500 server errors if/when Solr raised an error because it couldn't parse the underlying query (which was not useful or informative for end users). Now incorrect API filter syntax raises a proper 400 error with a message explaining the issue.
-
API filters using the
in
andrange
operators can now escape or quote commas to include them as part of a data value. Previously commas were only used as delimiters in the query argument. -
Shelflistitem manifests are now created/updated/refreshed for the correct item locations whenever any exporter that utilizes the shelflistitem Haystack index runs. The list of item locations needing a shelflistitem manifest refresh is stored and tracked as items are indexed, whether it's an ItemsToSolr, BibsAndAttachedToSolr, or other type of exporter. (Which addresses #16.)
BibsDownloadMarc Exporter Deprecation
- The
exporter.basic_exporters.BibsDownloadMarc
exporter is now deprecated. All of the Solrmarc-related conversion logic has been moved from exporters down into a lower-level custom Solr backend for Haystack. A separate exporter that converts bibs to MARC and saves them to the filesystem is no longer needed.
Release v1.3.1
In this release I've added a hotfix for issue #46, where an exporter job that runs and finds no qualifying records to fetch from Sierra raises an uncaught exception. Now the empty record set is detected during job initialization and a "nothing to do" message is logged, skipping the code that raised the error.
Release v1.3
This comprises changes to improve how Exporters and Celery tasks run, including tracking tasks in Redis via export.tasks.JobPlan
, improved prefetching, new .env settings for better memory management, better DB connection management, and better handling of return values from exporter jobs.
Changes that May Affect You or Your Custom Catalog Api Code:
-
export.exporter.Exporter
class'parallel
attribute was removed. Now all exporters are assumed to be run in asynchronous fashion via Celery chords. -
The
export.exporter.Exporter
class and subclass'export_records
anddelete_records
methods' signature has changed:vals
is no longer a valid kwarg. -
New
export.exporter.Exporter
method:compile_vals
. The default implementation should continue to work similarly to the previous release. Subclasses may implement their own method to control how the lists of return values fromexport_records
anddelete_records
(for that class) are combined to pass tofinal_callback
. -
export.batch_exporters.AllToSolr
is now deprecated. A warning will be written to the exporter log if you try to use it. It will be removed entirely in v1.4. -
New Django and .env settings:
EXPORTER_MAX_RC_CONFIG
andEXPORTER_MAX_DC_CONFIG
. Now you can set themax_rec_chunk
and/ormax_del_chunk
parameters for each exporter on an env-by-env basis, to help manage memory differently for environments that may be more or less memory-constrained. (The new settings are optional: if provided, they override class attributes, which are still used by default.)
Other General Changes from v1.2 to v1.3
-
Pernicious issues with the reliability of large, long-running export jobs are finally resolved. The task that chops a job into smaller chunks and dispatches those to individual Celery tasks now creates and uses a
JobPlan
, cached in Redis (apart from the Celery result broker), to manage that work. Jobs are now divided into one or more chords of predictable sizes (200 chunks by default), rather than creating a single chord of potentially thousands of chunks. This ensures that errors are handled appropriately and thefinal_callback
method always runs. Plus, now, if a job raises errors that lead to an entire chunk being skipped, that portion of theJobPlan
persists in Redis, and you can see exactly what record PKs were skipped after the job completes. -
Most exporter classes in
export.basic_exporters
were not using comprehensive enough lists ofprefetch_related
andselect_related
relationships, which led to some inefficiencies.BibsAndAttachedToSolr
, which is the job we run nightly to sync bib records, was especially problematic. This release updates the relevant exporter attributes, which leads to some large performance gains. (To help mitigate the increased memory usage, new Django / .env settings were added to changemax_rec_chunk
andmax_del_chunk
configuration on an env-by-env, exporter-by-exporter basis,) -
I've always had occasional weird problems with the Celery tasks trying to reuse stale database connections, which would raise exceptions. Because I didn't understand the problem well originally, my previous solution had been pretty awful--drop the default connection at the beginning of each task and then use extensive try/except blocks to try to catch
OperationalError
exceptions and simply retry whatever caused the exception. I did some research into the issue and discovered that it isn't an uncommon problem with Celery: Celery tasks fall outside the normal Django request cycle, and so it doesn't manage connections for you the same way. This release fixes the problem more completely: each task is wrapped in a function that callsclose_if_unusable_or_obsolete
on each connection before and after the task runs. This ensures that connections are closed properly before and after each task. -
Overall, with the improvements in this release, we have seen a 4-5X increase in our production exporters' throughput. We can run
BibsToSolr
over our entire database (~3.2 million records) in around 5 hours. This used to be a multi-day operation.