Merge pull request #1721 from CentreForDigitalHumanities/feature/docu…

…ment-frontend-settings update + expand documentation
CentreForDigitalHumanities · Dec 6, 2024 · 2c8093b · 2c8093b
2 parents 8910a1a + 36bda81
commit 2c8093b
Show file tree

Hide file tree

Showing 17 changed files with 362 additions and 116 deletions.
diff --git a/documentation/Celery.md b/documentation/Celery.md
@@ -11,11 +11,11 @@ Celery is used for
 
 ## Running celery
 
-See the repository readme for installation instructions. For development, it is possible to run I-analyzer without celery if you are not intending to use any of the functions listed above.
+See [first-time setup](./First-time-setup.md) for installation instructions. For development, it is possible to run I-analyzer without celery if you are not intending to use any of the functions listed above.
 
 ### Redis
 
-Celery uses [redis](https://www.redis.io/) as a backend for task data. Start Redis by running `redis-server` in a terminal.
+Celery uses [redis](https://www.redis.io/) as a backend for task data. After installing prerequistes, start Redis by running `redis-server` in a terminal.
 
 If you wish to run Redis from a non-default hostname and/or port, or pick which database to use, specify this in your `/backend/ianalyzer/settings_local.py` as
 
@@ -61,7 +61,7 @@ Then open `localhost:5555` in your browser to see the flower interface.
 ## Developing with celery
 
 - The arguments and outputs for celery tasks must be JSON-serialisable. For example, a task function can have a user ID string as an argument, but not a `CustomUser` object.
-- Use `group` to run tasks in parallel and `chain` to run tasks in series.
+- Use `group` to run tasks in parallel and `chain` to run tasks in series. You can use groups in chains, chains in groups, chains in chains, etc.
 - You can use flower (see above) for an overview of your celery tasks. Note that groups and chains are not tasks themselves, and will not show up as tasks on Flower.
 - For easier debugging and testing, keep your tasks simple and outfactor complicated functionality to 'normal' functions.
 

diff --git a/documentation/Corpus-database-models.md b/documentation/Corpus-database-models.md
@@ -25,7 +25,7 @@ Python definitions can be loaded into the database with the `loadcorpora` Django
 
 This command will parse any configured python corpora and save a database representation for them. If the python corpus cannot be loaded, the `Corpus` object will still exist in the database, but it will be inactive.
 
-If a corpus by the same name already exists in the database, the command will completely overwrite its `CorpusConfiguration` and `Field` instances. This means that, aside from adjusting permissions, changing the database representation of a corpus with a Python definition is always temporary. If you want to make permanent changes to the corpus, adjust the Python definition and run `loadcorpora` again.
+If a corpus by the same name already exists in the database, the command will completely overwrite its `CorpusConfiguration` and `Field` instances. This means that changing the database representation of a corpus with a Python definition is always temporary (except for adjusting permissions). If you want to make permanent changes to the corpus, adjust the Python definition and run `loadcorpora` again.
 
 ## Corpus visibility
 
@@ -46,4 +46,4 @@ Removing a corpus from the settings will not delete the `Corpus` object. It has
 
 Since the underlying `Corpus` is not actually deleted, related search history, downloads, tags, and permissions will be preserved. If you reinstate the corpus in settings, all of these will function as before.
 
-At this point, you can also remove the `Corpus` object completely, which will remove all related data.
+At this point, you can activate the corpus again and use it as a database-only corpus, or you can remove the `Corpus` object completely, which will remove all related data.
diff --git a/documentation/Corpus-definitions.md b/documentation/Corpus-definitions.md
@@ -49,6 +49,7 @@ Database-only corpora do not support some advanced functionality. Notably:
 - word models (i.e. word embeddings)
 - media attachments to documents
 - updating the data of a corpus instead of re-indexing it from scratch
+- named entity annotations
 
 ### Python class
 

diff --git a/documentation/Corpus-validation.md b/documentation/Corpus-validation.md
@@ -16,7 +16,9 @@ A corpus that does not meet this check cannot be searched in the frontend. This
 
 A corpus must pass this check to be set to `active` - which enables the corpus in the search interface.
 
-The `ready_to_publish` validation is not used directly when handling views, because it can include some non-trivial checks. For Python corpora, `active` is simply set by running `ready_to_publish()` after importing the corpus definition.
+The `ready_to_publish` validation is not executed when handling views, because it can include some non-trivial checks. Instead, we check whether `active` is `True`, which implies that the corpus passed this validation.
+
+For Python corpora, `active` is automatically set by running `ready_to_publish()` after importing the corpus definition. Database-only corpora are inactive by default, and have to be activated manually, which will trigger the validation.
 
 ## API
 

diff --git a/documentation/Django-project-settings.md b/documentation/Django-project-settings.md
@@ -6,7 +6,7 @@ This file describes how to configure project settings in Django.
 
 We keep different settings files to handle different environments.
 
-`settings.py` is the default settings file in a development file. The version in the repository is replaced in our deployment setup. This means that what you write here will affect all development environments, but not production environments. Developers can override settings in their own environment using `settings_local`, but this is a good place for sensible defaults.
+`settings.py` is the default settings file in a development setting. The version in the repository is replaced in our deployment setup. This means that what you write here will affect all development environments, but not production environments. Developers can override settings in their own environment using `settings_local`, but this is a good place for sensible defaults.
 
 `common_settings.py` is intended for "universal" project settings that apply in both production and development servers. It is imported by `settings.py` on both development and production.
 
@@ -52,9 +52,7 @@ The values in the dictionary give specifications.
 
 By default, an elasticsearch server will have security features enabled; you can turn this off for a local development server (see [first-time setup](./First-time-setup.md)). Otherwise, the server configuration must specify an API key.
 
-Create an API key for the server: see [creating an API key](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html). Note down the `'id'` and `'api_key'` values of the response.
-
-Add the following values to the configuration:
+To create an API key for the server, see [creating an API key](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html). Note down the `'id'` and `'api_key'` values of the response. Add the following values to the configuration:
 
 - `'certs_location'`: Fill in the following path: `{your_elasticsearch_directory}/config/certs/http_ca.crt`
 - `'api_id'`: the ID of the API key
@@ -67,6 +65,8 @@ If you name one of the servers `'default'`, it will act as the default for all c
 
 If you don't assign a default server this way, the server for each corpus must be configured explicitly in `CORPUS_SERVER_NAMES` (see below).
 
+Unit tests for the backend will assume that there is a default server configured and use that one. Unit tests can create tests indices (always named `test-*`), which will be deleted during teardown.
+
 ### `CORPORA`
 
 A dictionary that specifies Python corpus definitions that should be imported in your project.
@@ -79,7 +79,7 @@ CORPORA = {
 }
 ```
 
-The key of the corpus must match the name of the corpus class. This match is not case-sensitive, and your key may include extra non-alphabetic characters (they will be ignored when matching). For example, `'times'` is a valid key for the `Times` class. It will usually match the filename as well, but this is not strictly necessary.
+The key of the corpus must match the name of the corpus class. This match is not case-sensitive, and your key may include extra non-alphabetic characters (they will be ignored when matching). For example, `'times'` is a valid key for the `Times` class, and so is `'TIMES_1'`. It will usually match the filename as well, but this is not strictly necessary.
 
 ### `CORPUS_SERVER_NAMES`
 

diff --git a/documentation/Downloads.md b/documentation/Downloads.md
@@ -0,0 +1,63 @@
+# Downloads
+
+I-analyzer offers several types of downloads to users. This document gives a high-level overview of the types of downloads that exist and where they are implemented.
+
+## Downloading search results
+
+We distinguish between two types of downloads: *direct* download and *scheduled* downloads.
+
+For the user, a direct download means their browser will start downloading the file then and there. With a scheduled download, the user will receive an [email](./Email.md) when their download is complete. Scheduled downloads are only available if the user is signed in.
+
+I-analyzer will automatically choose which type of download to use, based on the number of documents. The cutoff point is configured in the [frontend environment](./Frontend-environment-settings.md#directdownloadlimit).
+
+### Direct downloads
+
+Direct downloads are executed synchronously. There is an API endpoint to request the download, which will return the requested file.
+
+### Scheduled downloads
+
+Scheduled downloads are run with [Celery](./Celery.md).
+
+The server will query elasticsearch to fetch matching documents. This is done in batches of 10.000 documents using the [scroll api](https://elasticsearch-py.readthedocs.io/en/v8.15.1/api/elasticsearch.html#elasticsearch.client.Elasticsearch.scroll).
+
+Documents are written to a CSV file in the server file system ([configured with `CSV_FILES_PATH`](./Django-project-settings.md#csv_files_path)) per batch. This means the server does not need to store the complete results in memory.
+
+When the CSV file is complete, the user receives an email.
+
+When the user downloads the complete file, they can choose additional options; at this point, this is just a choice for the file encoding. (We offer utf-16 encoding for compatability with Microsoft Excel.)
+
+File encoding is less time-consuming to process than fetching data, so it is handled at this point rather than in the initial processing. It also means the user can request a different encoding without re-doing the download.
+
+When the user requests the download, the backend will either stream the file as-is, or, if the encoding needs to be changed, save a *converted* CSV file and stream that.
+
+## Downloading visualisation results
+
+### Downloading image files
+
+When a user views a visualisation, they can always choose between a graphical view and a table.
+
+With the graphical view, the user can download the graph as a PNG file. We use the `html-to-image` library to render the image from the page. The [VisualizationComponent](../frontend/src/app/visualization/visualization.component.ts) contains a method to select the HTML element that should be rendered, based on the type of visualisation.
+
+### Downloading table data
+
+The table view can be downloaded as a CSV. This file is generated by the frontend, using the data it already has available.
+
+### Downloading full data
+
+Some visualisations base their result on a sample of documents to limit computation time, but offer the user an option to download statistics for the full data.
+
+This happens for the term frequency visualisation and the ngram visualisation.
+
+For these download, a request is sent to the backend and handled asynchronously - similar to the scheduled download. When the user downloads the file, they can choose the encoding, and also pick between long and wide format.
+
+## Downloads in the database
+
+The [`Download` model](../backend/download/models.py) is used to keep track of a user's downloads.
+
+The table includes all search results downloads, and full data downloads for visualisations. It does not include other visualisation downloads, as those are generated in the frontend.
+
+## Download limit
+
+Each user account has a download limit. By default, this is 10.000 documents. You can set this in the admin site, to allow individual users to download more documents.
+
+Use this with caution on production servers. Note that the server may also have request timeouts that will effectively prevent users from being able to download large files, even if they are allowed to generate them.