diff --git a/documentation/Celery.md b/documentation/Celery.md index 6b14e4058..cf0c8e412 100644 --- a/documentation/Celery.md +++ b/documentation/Celery.md @@ -11,11 +11,11 @@ Celery is used for ## Running celery -See the repository readme for installation instructions. For development, it is possible to run I-analyzer without celery if you are not intending to use any of the functions listed above. +See [first-time setup](./First-time-setup.md) for installation instructions. For development, it is possible to run I-analyzer without celery if you are not intending to use any of the functions listed above. ### Redis -Celery uses [redis](https://www.redis.io/) as a backend for task data. Start Redis by running `redis-server` in a terminal. +Celery uses [redis](https://www.redis.io/) as a backend for task data. After installing prerequistes, start Redis by running `redis-server` in a terminal. If you wish to run Redis from a non-default hostname and/or port, or pick which database to use, specify this in your `/backend/ianalyzer/settings_local.py` as @@ -61,7 +61,7 @@ Then open `localhost:5555` in your browser to see the flower interface. ## Developing with celery - The arguments and outputs for celery tasks must be JSON-serialisable. For example, a task function can have a user ID string as an argument, but not a `CustomUser` object. -- Use `group` to run tasks in parallel and `chain` to run tasks in series. +- Use `group` to run tasks in parallel and `chain` to run tasks in series. You can use groups in chains, chains in groups, chains in chains, etc. - You can use flower (see above) for an overview of your celery tasks. Note that groups and chains are not tasks themselves, and will not show up as tasks on Flower. - For easier debugging and testing, keep your tasks simple and outfactor complicated functionality to 'normal' functions. diff --git a/documentation/Corpus-database-models.md b/documentation/Corpus-database-models.md index 1680c4a70..f2012842c 100644 --- a/documentation/Corpus-database-models.md +++ b/documentation/Corpus-database-models.md @@ -25,7 +25,7 @@ Python definitions can be loaded into the database with the `loadcorpora` Django This command will parse any configured python corpora and save a database representation for them. If the python corpus cannot be loaded, the `Corpus` object will still exist in the database, but it will be inactive. -If a corpus by the same name already exists in the database, the command will completely overwrite its `CorpusConfiguration` and `Field` instances. This means that, aside from adjusting permissions, changing the database representation of a corpus with a Python definition is always temporary. If you want to make permanent changes to the corpus, adjust the Python definition and run `loadcorpora` again. +If a corpus by the same name already exists in the database, the command will completely overwrite its `CorpusConfiguration` and `Field` instances. This means that changing the database representation of a corpus with a Python definition is always temporary (except for adjusting permissions). If you want to make permanent changes to the corpus, adjust the Python definition and run `loadcorpora` again. ## Corpus visibility @@ -46,4 +46,4 @@ Removing a corpus from the settings will not delete the `Corpus` object. It has Since the underlying `Corpus` is not actually deleted, related search history, downloads, tags, and permissions will be preserved. If you reinstate the corpus in settings, all of these will function as before. -At this point, you can also remove the `Corpus` object completely, which will remove all related data. +At this point, you can activate the corpus again and use it as a database-only corpus, or you can remove the `Corpus` object completely, which will remove all related data. diff --git a/documentation/Corpus-definitions.md b/documentation/Corpus-definitions.md index c4fab4ae8..9b4c8d9b2 100644 --- a/documentation/Corpus-definitions.md +++ b/documentation/Corpus-definitions.md @@ -49,6 +49,7 @@ Database-only corpora do not support some advanced functionality. Notably: - word models (i.e. word embeddings) - media attachments to documents - updating the data of a corpus instead of re-indexing it from scratch +- named entity annotations ### Python class diff --git a/documentation/Corpus-validation.md b/documentation/Corpus-validation.md index 42942bea7..3c2de72d1 100644 --- a/documentation/Corpus-validation.md +++ b/documentation/Corpus-validation.md @@ -16,7 +16,9 @@ A corpus that does not meet this check cannot be searched in the frontend. This A corpus must pass this check to be set to `active` - which enables the corpus in the search interface. -The `ready_to_publish` validation is not used directly when handling views, because it can include some non-trivial checks. For Python corpora, `active` is simply set by running `ready_to_publish()` after importing the corpus definition. +The `ready_to_publish` validation is not executed when handling views, because it can include some non-trivial checks. Instead, we check whether `active` is `True`, which implies that the corpus passed this validation. + +For Python corpora, `active` is automatically set by running `ready_to_publish()` after importing the corpus definition. Database-only corpora are inactive by default, and have to be activated manually, which will trigger the validation. ## API diff --git a/documentation/Django-project-settings.md b/documentation/Django-project-settings.md index 14ee8d264..bad084343 100644 --- a/documentation/Django-project-settings.md +++ b/documentation/Django-project-settings.md @@ -6,7 +6,7 @@ This file describes how to configure project settings in Django. We keep different settings files to handle different environments. -`settings.py` is the default settings file in a development file. The version in the repository is replaced in our deployment setup. This means that what you write here will affect all development environments, but not production environments. Developers can override settings in their own environment using `settings_local`, but this is a good place for sensible defaults. +`settings.py` is the default settings file in a development setting. The version in the repository is replaced in our deployment setup. This means that what you write here will affect all development environments, but not production environments. Developers can override settings in their own environment using `settings_local`, but this is a good place for sensible defaults. `common_settings.py` is intended for "universal" project settings that apply in both production and development servers. It is imported by `settings.py` on both development and production. @@ -52,9 +52,7 @@ The values in the dictionary give specifications. By default, an elasticsearch server will have security features enabled; you can turn this off for a local development server (see [first-time setup](./First-time-setup.md)). Otherwise, the server configuration must specify an API key. -Create an API key for the server: see [creating an API key](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html). Note down the `'id'` and `'api_key'` values of the response. - -Add the following values to the configuration: +To create an API key for the server, see [creating an API key](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html). Note down the `'id'` and `'api_key'` values of the response. Add the following values to the configuration: - `'certs_location'`: Fill in the following path: `{your_elasticsearch_directory}/config/certs/http_ca.crt` - `'api_id'`: the ID of the API key @@ -67,6 +65,8 @@ If you name one of the servers `'default'`, it will act as the default for all c If you don't assign a default server this way, the server for each corpus must be configured explicitly in `CORPUS_SERVER_NAMES` (see below). +Unit tests for the backend will assume that there is a default server configured and use that one. Unit tests can create tests indices (always named `test-*`), which will be deleted during teardown. + ### `CORPORA` A dictionary that specifies Python corpus definitions that should be imported in your project. @@ -79,7 +79,7 @@ CORPORA = { } ``` -The key of the corpus must match the name of the corpus class. This match is not case-sensitive, and your key may include extra non-alphabetic characters (they will be ignored when matching). For example, `'times'` is a valid key for the `Times` class. It will usually match the filename as well, but this is not strictly necessary. +The key of the corpus must match the name of the corpus class. This match is not case-sensitive, and your key may include extra non-alphabetic characters (they will be ignored when matching). For example, `'times'` is a valid key for the `Times` class, and so is `'TIMES_1'`. It will usually match the filename as well, but this is not strictly necessary. ### `CORPUS_SERVER_NAMES` diff --git a/documentation/Downloads.md b/documentation/Downloads.md new file mode 100644 index 000000000..680bcbca2 --- /dev/null +++ b/documentation/Downloads.md @@ -0,0 +1,63 @@ +# Downloads + +I-analyzer offers several types of downloads to users. This document gives a high-level overview of the types of downloads that exist and where they are implemented. + +## Downloading search results + +We distinguish between two types of downloads: *direct* download and *scheduled* downloads. + +For the user, a direct download means their browser will start downloading the file then and there. With a scheduled download, the user will receive an [email](./Email.md) when their download is complete. Scheduled downloads are only available if the user is signed in. + +I-analyzer will automatically choose which type of download to use, based on the number of documents. The cutoff point is configured in the [frontend environment](./Frontend-environment-settings.md#directdownloadlimit). + +### Direct downloads + +Direct downloads are executed synchronously. There is an API endpoint to request the download, which will return the requested file. + +### Scheduled downloads + +Scheduled downloads are run with [Celery](./Celery.md). + +The server will query elasticsearch to fetch matching documents. This is done in batches of 10.000 documents using the [scroll api](https://elasticsearch-py.readthedocs.io/en/v8.15.1/api/elasticsearch.html#elasticsearch.client.Elasticsearch.scroll). + +Documents are written to a CSV file in the server file system ([configured with `CSV_FILES_PATH`](./Django-project-settings.md#csv_files_path)) per batch. This means the server does not need to store the complete results in memory. + +When the CSV file is complete, the user receives an email. + +When the user downloads the complete file, they can choose additional options; at this point, this is just a choice for the file encoding. (We offer utf-16 encoding for compatability with Microsoft Excel.) + +File encoding is less time-consuming to process than fetching data, so it is handled at this point rather than in the initial processing. It also means the user can request a different encoding without re-doing the download. + +When the user requests the download, the backend will either stream the file as-is, or, if the encoding needs to be changed, save a *converted* CSV file and stream that. + +## Downloading visualisation results + +### Downloading image files + +When a user views a visualisation, they can always choose between a graphical view and a table. + +With the graphical view, the user can download the graph as a PNG file. We use the `html-to-image` library to render the image from the page. The [VisualizationComponent](../frontend/src/app/visualization/visualization.component.ts) contains a method to select the HTML element that should be rendered, based on the type of visualisation. + +### Downloading table data + +The table view can be downloaded as a CSV. This file is generated by the frontend, using the data it already has available. + +### Downloading full data + +Some visualisations base their result on a sample of documents to limit computation time, but offer the user an option to download statistics for the full data. + +This happens for the term frequency visualisation and the ngram visualisation. + +For these download, a request is sent to the backend and handled asynchronously - similar to the scheduled download. When the user downloads the file, they can choose the encoding, and also pick between long and wide format. + +## Downloads in the database + +The [`Download` model](../backend/download/models.py) is used to keep track of a user's downloads. + +The table includes all search results downloads, and full data downloads for visualisations. It does not include other visualisation downloads, as those are generated in the frontend. + +## Download limit + +Each user account has a download limit. By default, this is 10.000 documents. You can set this in the admin site, to allow individual users to download more documents. + +Use this with caution on production servers. Note that the server may also have request timeouts that will effectively prevent users from being able to download large files, even if they are allowed to generate them. diff --git a/documentation/Frontend-deep-routing-and-state-management.md b/documentation/Frontend-deep-routing-and-state-management.md index 2aa5eac0c..7ef7bec77 100644 --- a/documentation/Frontend-deep-routing-and-state-management.md +++ b/documentation/Frontend-deep-routing-and-state-management.md @@ -1,6 +1,12 @@ # Frontend: deep routing and state management -Complex state management, especially state management that uses deep routing, is preferably handled through models. Typically, a model is just a normal class that you can instantiate in a component: +State management in the frontend can get messy very quickly. I-analyzer captures information about the state of the frontend in the query parameters in the URL. This allows users to bookmark queries, visualisations, etc., but can make state management more complicated. This document gives a broad introduction to how we approach state management and synchronising query parameter in the frontend. + +## Using models + +Complex state management, especially state management that uses deep routing, is preferably handled through models. A model is essentially just an abstraction of the behaviour of the application. It may contain variables, observables, functions, etc. to "model" that behaviour and keep an internal state. + +At its simplest, a model is just a class that you can instantiate in a component: ```typescript class MyModel { @@ -8,7 +14,7 @@ class MyModel { } ``` -A _model_ handles data, API calls, and the state of the user's workflow, and is kept separate from Angular components. Components act as _views_ and/or _controllers_, and their script is aimed at translating between the UI and the model. +A model handles data, API calls, and the state of the user's workflow, and is kept separate from Angular components, directives, and pipes. The job of a component/directive/pipe is to translate between the user interface and the model. This keeps components lightweight and makes them more flexible: because the core data is implemented independently, you can split a component, swap a checkbox for a dropdown, etc. without messing up the data. @@ -16,9 +22,11 @@ This keeps components lightweight and makes them more flexible: because the core Usually, a model is created in the highest component that needs it, by calling `new MyModel()` - either when the component is created or when the model becomes relevant. You can share an instance of a model between components by providing it as input. +If this results in cumbersome input chains, you can also use [dependency injection]((https://v16.angular.io/guide/dependency-injection)) to provide models. You can make model `Injectable` to provide it via dependency injection; for example, the [AuthService](../frontend/src/app/services/auth.service.ts) is used to model the user's authentication, and the [DropdownService](../frontend/src/app/shared/dropdown/dropdown.service.ts) is used to model the state of a single dropdown. Alternatively, you can provide a service that exposes a model, such as the [CorpusService](../frontend/src/app/services/corpus.service.ts) which components can use to access the `Corpus` model of the active corpus. + ### Dependency injection: providing services to models -More complex models may require services to make API calls, or listen to the global state of the application. +More complex models may require application services to make API calls, or listen to the global state of the application. In those cases, provide the service as a private member variable in the constructor of the model: @@ -32,7 +40,7 @@ class MyModel { } ``` -When you create the model in a component (or directive), the required services should be required to the component through dependency injection. +If the model instance is not provided through dependency injection, you will need to inject the dependencies in the component that creates the model. For example: ```typescript @Component{ @@ -50,43 +58,48 @@ class MyComponent { } ``` -### Dependency injection: providing models +## Route parameters: StoreSync + +A simple model can just use internal variables to track whatever state it needs to. However, if the model is tracking some query parameters in the route, state management becomes more complex. -In some cases, it may make sense to add an `@Injectable` decorator to the model so it can be provided through dependency injection. This makes sense when, in your injection context, you'll have a clear sense of "the" instance that can be provided, so you're not handling multiple instances. Making a model available through DI also requires that all arguments used in the constructor are also available through DI. +Many different parts of the application have *something* to do with the route and depend on each other. Binding the state of a model to the route can also create non-responsive controls or feedback loops if not handled correctly. -The advantage is that: -- the arguments to the the model are automatically injected -- it can simplify the lifecycles of components/directives because the model is immediately available in the constructor, rather than the first `ngOnInit`/`ngOnChanges` call. +To handle route synchronisation, models can extend the [`StoreSync` class](/frontend/src/app/store/store-sync.ts.ts). Objects in this class connect to a `Store` that will keep track of their parameters. -However, we don't currently apply this anywhere. +A "store" is essentially an abstraction for "a place that stores information". The query parameters in the route are a type of store, but an object in memory can also function as a store. Different types of stores are explained in in more detail below. As an abstract concept, it has a few important properties: -## Storing model states +- It is a [key-value store](https://en.wikipedia.org/wiki/Key%E2%80%93value_database). A model will always work with specific keys in the store, and ignore everything else. +- Multiple models can be connected to the same store, allowing each model to store some information in it. This describes the behaviour we want to implement: multiple parts of the application synchronising their state with the route parameters. +- Because the store construction is primarily designed for route parameters, the *values* in the store are typically short and simple. They're parameters, not data. This isn't enforced on a technical level, but it's useful in understanding how stores are used. -To unify the state management of models, models should extend the [`StoreSync` class](/frontend/src/app/store/store-sync.ts.ts). Objects in this class connect to a `Store` that will keep track of their parameters. +If two models connect to the same store but their parameters do not overlap, they will act independently. If their parameters overlap, the state of those parameters will be synchronised between the models. -Stores keep track of data that is updated over time, and are explained in more detail below. You can connect multiple models to the same store. (Usually from different classes.) +### Implementation A minimal implementation of a stored model would look something like this: ```typescript +// the type of the model's internal state interface MyState { foo: string } class MyModel extends StoreSync { - keysInStore = ['foo']; + keysInStore = ['foo']; // lists the keys in the store that the model connects to constructor(store: Store) { - super(store); - this.connectToStore(); + super(store); // instantiate the parent class + this.connectToStore(); // subscribe to the store } + // translates the representation in the store to the internal state of the model storeToState(params): MyState { return { foo: params['foo'] || '' }; } + // translates the internal state of the model to the representation in the store stateToStore(state: MyState) { return { foo: state.foo || null @@ -95,10 +108,10 @@ class MyModel extends StoreSync { } ``` -The `StoreSync` class provides the main ways in which you can interact with the model: +As a child of the `StoreSync` class, `MyModel` will now include: - a BehaviorSubject `state$` which tracks the latest state of the model based on the store. - a method `setParams()` which updates some or all of the properties in the model's state by sending an update to the store. -- a method `complete()`: the model will stop observing the store and reject any further updates. It will also send an update to the store to reset its own state. +- a method `complete()`: the model will stop observing the store and reject any further updates. It will also send an update to the store to reset its parameters. See [store-sync.ts](/frontend/src/app/store/store-sync.ts) for the exact specification. In practice, the methods can be used like this: @@ -110,9 +123,10 @@ See [store-sync.ts](/frontend/src/app/store/store-sync.ts) for the exact specifi class MyComponent implements OnInit, OnDestroy { myModel: MyModel; + constructor(routerStoreService: RouterStoreService) { } + ngOnInit() { - const store = new SimpleStore(); - this.myModel = newModel(store); + this.myModel = newModel(this.routerStoreService); console.log(this.myModel.state$.value); // { 'foo': '' } @@ -134,38 +148,10 @@ Notes: The methods `storeToState` and `stateToStore` have to be implemented on the model class. They translate between stored strings and whatever is more convenient as an internal state. This is trivial in the example above, but often comes in handy. These functions must be each other's inverse. There should be unit tests to confirm this. -Note that the constructor of `MyModel` calls the method `connectToStore`. This initialises the `state$` observable based on the current state of the store, and creates a subscription to the store. You should call this method in the constructor. It's not called in the constructor of `StoreSync` because you may want to set some properties specific to your model before your call it (`connectToStore` uses `storeToState` to set the initial state). +Note that the constructor of `MyModel` calls the method `connectToStore`. This initialises the `state$` observable based on the current state of the store, and creates a subscription to the store. You should call this method in the constructor. It's not called in the constructor of `StoreSync` because you may want to set some properties specific to your model before you call it (`connectToStore` uses `storeToState` to set the initial state). `keysInStore` specifies the specific keys in the store's state that the model interacts with. The model will only listen to changes in those keys, and will reset them when it is completed. -### Using StoreSync as a base class for components or directives - -It is technically possible to use `StoreSync` as a parent class for a component or directive, rather than a data model. That will look something like this: - -```typescript -type Data = { foo: string }; - -@Component{ - selector: 'my-component', - templateUrl: './my-component.component.html', -} -class MyComponent extends StoreSync implements OnDestroy { - - constructor( - routerStoreService: RouterStoreService, - ) { - super(routerStoreService); - this.connectToStore(); - } - - ngOnDestroy() { - this.complete(); - } -} -``` - -This isn't recommended, as it suggests your component is handling significant state management that would be more maintainable if were outfactored to a model (for the reasons describe in the first section). - ## Stores A store keeps track the states of one or more models. There are two store classes: @@ -178,6 +164,36 @@ Crucially, `StoreSync` models don't care which store class you use: they impleme - `currentParams()`: get the current state synchronously, rather than as an observable - `paramUpdates$`: an observer to which updates are pushed -The state of a store is always an object. Each model class listens to a pre-defined list of keys in this object. If you set the value of a key to `null`, it will be removed. +The state of a store can be represented as a key-value object. Each model class listens to a pre-defined list of keys in this object. If you set the value of a key to `null`, it will be removed. Because the `RouterStoreService` stores data in the address bar of the browser, the value of stored keys is always a string. So if you call `store.paramUpdates$.next({a: 5})`, the `params$` observable will return the value as `{a: '5'}`. The `SimpleStore` mimicks this behaviour for consistency. + +## What to store and where to store it + +Keep in mind that the primary purpose of the store system is to handle route parameters, and the primary purpose of route parameters is reproducibility for researchers. Just because a model has some notion of a "state" does not mean it needs to use a `Store`. + +If a model does use the route, it may make sense that the model is tracking *some* keys in the store, but also keeps some extra variables or observables to handle things that don't need to be represented in the route, and which will reset when you refresh the page. + +### `RouteStoreService` and `SimpleStore` + +In practice, models often use the `RouterStoreService` when they are being used in components, but in a unit test, you substitute the `SimpleStore` for an easier setup. + +However, some models, like the `QueryModel`, are used with both store types during runtime. The `QueryModel` is instantiated with the `RouterStoreService` when it concerns the main query made by the user. But if we want to construct a query to generate a link, run a request with an extra filter, etc., we can instantiate a query model with the `SimpleStore`, which will not be synchronised with the route. + +This is an important reason why stores are separated from models, instead of being built into them. + +## Testing store-synced models + +As mentioned above, `StoreSync` models are typically instantiated with the `RouterStoreService` during runtime, but you can use the `SimpleStore` during testing. The [tests for the `SearchTabs` model](../frontend/src/app/search/search-tabs.spec.ts) are a minimal example of such tests. + +Examples of tests: + +- For a possible `state`, assert that `model.storeToState(model.stateToStore(state))` equals `state`. +- Initialise the model with an empty store and check the initial state. +- Initialise the model with a non-empty store and check the initial state. This simulates loading the page from a link with query parameters. +- Try calling a method of the model that should update the state, and check the effect. +- Update the store directly, and check that the model reflects it. This simulates what happens when the user uses back/forward navigation in the browser, or another model updating the same parameter. (The latter is not always applicable, but the former is.) + +The purpose of these tests is to verify the model's conversion between its internal state and the store, and the ways in which the model is meant to react to changes in the parameters. + +For an individual model, you do *not* need to test the core logic of the `StoreSync` and `Store` classes, such as whether the model actually ignores other keys in the store, or whether `SimpleStore` and `RouterStoreService` are compatible. (Those classes have their own unit tests; feel free to expand those, of course.) diff --git a/documentation/Frontend-environment-settings.md b/documentation/Frontend-environment-settings.md new file mode 100644 index 000000000..8c9688b3f --- /dev/null +++ b/documentation/Frontend-environment-settings.md @@ -0,0 +1,92 @@ +# Frontend environment settings + +The frontend contains an `environment.ts` file that can be used to edit settings for the specific environment. + +## List of settings + +### `production` + +Type: boolean + +This will [enable production mode](https://v16.angular.io/api/core/enableProdMode) for Angular. + +### `appName` + +Type: string + +The name of the application that should be shown to users. This is used in page titles and the like. + +### `aboutPage` + +Type: string + +Different servers typically require different about pages. We keep several pages in [`/frontend/src/assets/about/`](/frontend/src/assets/about/); this setting determines which file is used. + +The name must match the filename without a path or file extension. + +### `apiUrl` + +Type: string + +The URL to the backend API endpoint. If the backend is served on the same domain as the frontend, you can use an absolute path (e.g. `/api/`) instead of a full URL. Relative paths (e.g. `api/`) are not supported. + +### `adminUrl` + +The URL to the Django admin site. See the documentation for `apiUrl`. + +### `samlLogoutUrl` + +The URL to the page where SAML users can log out. See the documentation for `apiUrl`. + +### `showSolis` + +Type: boolean + +Whether to show the option for SAML login in the login or registration form. + +### `runInIFrame` + +Type: boolean + +Set to `true` if this instance is intended to be embedded in an iframe, rather than visited directly. + +This will affect the styling of the site; the main change is that the main navigation and footer will be hidden, which allows the site to fit into a different page. Note that this also limits options for users to navigate the site. + +The effect is purely aesthetic. It does not adjust any server-side configurations, e.g. to set the `X-Frame-Options` header. + +### `directDownloadLimit` + +Type: number + +Sets the cutoff point between [direct downloads and scheduled downloads](./Downloads.md#downloading-search-results). + +### `version` + +Type: string + +Sets the semantic version of I-analyzer that is displayed in the footer. + +You could set this manually, but in most cases, you will import it from `version.ts`. That file is updated when you build the frontend, based on the version number in [`package.json`](../package.json). See [Making a release](./Making-a-release.md). + +### `sourceURL` + +Type: string + +The URL to the source repository, which is linked in the footer. + +Change this if you create a fork of I-analyzer. + +### `logos` + +Can be used to add additional logos to the page footer. + +Type: either `undefined` or an array of objects. Each item must match the following interface: + +```ts +interface Logo { + path: string, // URL of the image source + url: string, // URL that the image should link to + alt: string, // alt text for the image + width: number, // width of the image in pixels +} +``` diff --git a/documentation/Frontend-models.md b/documentation/Frontend-models.md new file mode 100644 index 000000000..2b9fa7e1f --- /dev/null +++ b/documentation/Frontend-models.md @@ -0,0 +1,64 @@ +# Frontend: models + +This document covers some of the core models that make up the frontend. + +I-analyzer has [a general system for route parameter management](./Frontend-deep-routing-and-state-management.md) which many of these models use. + +## QueryModel + +The QueryModel is an essential model to the functionality of I-analyzer. In essence, a QueryModel represents a query, which defines a set of document in a corpus. Different parts of the application build on queries to represent pages results, visualisations, etc. + +When the frontend creates an [API query](./Query-api.md) for a request, it typically uses a QueryModel to generate a [compound query clause](https://www.elastic.co/guide/en/elasticsearch/reference/current/compound-queries.html) that defines which documents to select. + +The internal structure of the query model is based on the query interface that the application offers to users. It consists of an optional query string (using [simple query string syntax](https://www.elastic.co/guide/en/elasticsearch/reference/8.11/query-dsl-simple-query-string-query.html#simple-query-string-syntax)), a list of fields on which the query string should applied, and a list of filters. See the [elasticsearch documentation about query context and filter context](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html). + +### Using the query model + +The query model controls a lot of variables (the query string, search fields, and the setting for each filter). Modules that represent the *outcome* of a query (documents, statistics, etc.) observe the `update` subject on the query model. This is a single subject that signals when the query has changed in some way that affects the results. + +If a module is representing the results of a query, it should always use the `update` subjects and not subscribe to individual parameters. That way, if we ever need to change the parameters of the query model, other modules will continue functioning as normal. + +Some modules may require an adjusted version of the query. For instance, the multiple choice filter fetches the options for a keyword field, but that means it needs to ignore an active filter on the field itself. In such cases, you can use the `clone()` method of the query model to create a copy that can be adjusted without affecting the original. + +In other cases, it makes sense to construct a new query model that is not based on the "main" query of the user. For instance, depending on the corpus settings, documents can include a link to their "context". To do this, you can simple create a new `QueryModel` instance and use the `toQueryParams()` method to generate a link. + +(When cloning a query model or creating a new one, the new model should not use the same store as the main query, as that would synchronise the models.) + +### Search filters + +The query model includes an array of `filters`; each is an object that implements the `FilterInterface`. The concrete class of the filter is based on the type of data it controls. + +- If a field has a filter widget, it will have a `RangeFilter`, `DateFilter`, etc. suited to the type of widget. +- Some corpus fields don't show a filter widget by default, but still support filtering on a specific value. The "document context" system makes use of this feature. These fields are also tracked in the model as `AdHocFilter`s. +- Finally, a query may also include a `TagFilter` which allows the user to sort on their assigned tags. + +The route parameters will store a filter's value if it is active, and nothing if it is inactive. However, filters also have a "toggle" option, which allows the user to turn off an active filter but still retain its value. + +For the route parameters, there is no distinction between deactivating a filter or resetting it: in both cases, the filter is not used, so the distinction does not matter for reproducing results. This means that the translation between the internal state of the filter and the route parameters is not exactly one-to-one. + +## Results + +It's a common situation that you need a model that will: + +- observe a query model +- keep track of a few additional parameters +- request some data based on the query model and its own parameters + +The [`Results` class](../frontend/src/app/models/results.ts) provides the basic logic to do this, so it's a suitable base for fetching search results, visualisation data, or statistics based on the query. + +A typical example is the [`FrequentWordsResults` class](../frontend/src/app/models//results.ts) which fetches the most common words in a text field. This is used to generate the wordcloud visualisation. + +Like any model connected to a store, a `Results` model must implement `stateToStore` and `storeToState`. This translates the model's own parameters to their representation in the store. + +In addition, the model must implement a `fetch` method. This describes how to get the latest result based on the query and parameters. In a typical scenario, `fetch` makes a request to the backend and returns the response. + +The `Results` class is responsible for knowing when to fetch new results, switching to the latest result, catching errors, and multicasting result observables. To keep track of the results, you can observe the `result$`, `error$` and `loading$` observables on the model. + +If you subscribe to any of these observables, `fetch` will be called every time the query or parameters update. (If there are no observers, nothing is fetched.) If you want to cache results, you'll usually need to store them as class members. For instance, the [`MapDataResults` class](../frontend/src/app/models/map-data.ts) caches the coordinates for the centre of the map, so it is not refetched every time. + +## Found document + +The [`FoundDocument` class](../frontend/src/app/models/found-document.ts) represents a single document in the corpus. (It's not called `Document` to avoid confusion with the DOM interface.) + +The core usage is just to display the values of the document, but the class also includes some methods to view and assign tags, link to the context of the document, view annotations, etc. + diff --git a/documentation/Indexing-on-server.md b/documentation/Indexing-on-server.md index 53fb3d92b..3a613ff85 100644 --- a/documentation/Indexing-on-server.md +++ b/documentation/Indexing-on-server.md @@ -46,7 +46,7 @@ Note that removing an alias does not remove the index itself, but removing an in ## Indexing from multiple corpus definitions If you have separate datasets for different parts of a corpus, you may combine them by setting the `ES_INDEX` variable in the corpus definitions to the same `overarching-corpus` index name. -Then, you can define multiple corpora in the deployment module, e.g., +Then, you can define multiple corpora in the deployment settings module, e.g., ``` CORPORA = { 'corpus1': 'path/to/corpus1', diff --git a/documentation/Making-a-release.md b/documentation/Making-a-release.md index b1d0aa943..6971d41bf 100644 --- a/documentation/Making-a-release.md +++ b/documentation/Making-a-release.md @@ -53,6 +53,6 @@ The release notes should include: ## Deploy on production -Check your list of configuration changes and update the deployment module, if needed. +Check your list of configuration changes and update the deployment configuration, if needed. Deploy the `master` branch on the production server. diff --git a/documentation/Notes-for-development.md b/documentation/Notes-for-development.md index a590e5f47..64f47c387 100644 --- a/documentation/Notes-for-development.md +++ b/documentation/Notes-for-development.md @@ -13,17 +13,16 @@ The above steps do not actually install the package; you can do this at any stag ## Testing -Backend tests exist in the `backend` directory. They are typically located in a `tests` subdirectory of the module they apply to. Run tests by calling `pytest` (or `python -m pytest`) from `/backend`. Assess code coverage by running `coverage run --m py.test && coverage report`. +### Backend -When writing new backend tests, you can use the fixtures in the `conftest.py` for the module. For example, in the `api` module, you can do the following in order to test a view. +Backend tests exist in the `backend` directory. They are typically located in a `tests` subdirectory of the package they apply to. Run tests by calling `pytest` (or `python -m pytest`) from `/backend`. Assess code coverage by running `coverage run --m py.test && coverage report`. -```py -def test_some_view(client): - response = client.get('/some/route') - assert response.status_code == 200 - # etcetera -``` +When writing new backend tests, you can use the fixtures in the `conftest.py` for the package. [`backend/conftest.py`](../backend/conftest.py) defines fixtures for the whole project, include some that are used automatically. -For further details, consult the source code in `conftest.py` of the module. +For example, the project conftest defines an `auth_user` fixture that creates a user account; this is widely used to test authentication and user data. The [conftest for the `tag` app](../backend/tag/conftest.py) includes a fixture `auth_user_tag` that creates a tag for the user, which is a useful starting point for many of the tests in this app, but not used elsewhere in the project. + +Some backend tests require Elasticsearch. If the backend cannot connect to Elasticsearch during testing, these tests will be skipped. (So if you see a lot of skipped tests in the test output, it's because Elasticsearch isn't available.) + +### Frontend Tests are also available for the `frontend`, they should be run from that directory using Angular. Frontend tests can be run with `yarn test-front`. diff --git a/documentation/Overview.md b/documentation/Overview.md index 0c1750b09..09b37eeed 100644 --- a/documentation/Overview.md +++ b/documentation/Overview.md @@ -2,63 +2,67 @@ The application consists of a backend, implemented in [Django](https://www.djangoproject.com/) and a frontend implemented in [Angular](https://angular.io/). -## Directory structure +## Backend -The I-analyzer backend (`/backend`) is a python/Django app that provides the following functionality: - -- A 'users' module that defines user accounts. - -- A 'corpora' module containing corpus definitions and metadata of all corpora that are defined in Python. (Corpora can also be defined as database objects.) For each Python corpus added in I-analyzer, this module defines how to extract document contents from its source files and sets parameters for displaying the corpus in the interface, such as sorting options. - -- An 'addcorpus' module which manages the functionality to extract data from corpus source files (given the definition) and save this in an elasticsearch index. Source files can be XML or HTML format (which are parsed with `beautifulsoup4` + `lxml`) or CSV. This module also provides the basic data structure for corpora. - -- An 'es' module which handles the communication with elasticsearch. The data is passed through to the index using the `elasticsearch` package for Python (note that `elasticsearch-dsl` is not used, since its [documentation](https://elasticsearch-dsl.readthedocs.io/en/latest) at the time seemed less immediately accessible than the [low-level](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html) version). - -- An 'api' module that that enables users to search through an ElasticSearch index of a text corpus and stream search results into a CSV file. The module also performs more complex analysis of search results for visualisations. - -- A 'visualizations' module that does the analysis for several types of text-based visualisations. - -- A 'downloads' module that collects results into csv files. - -- A 'wordmodels' module that handles functionality related to word embeddings. - -`ianalyzer/frontend` is an [Angular 13](https://angular.io/) web interface. - -# Backend - -The backend has three responsibilities: +The backend has three core responsibilities: - managing the database containing information on users and access rights to corpora - communication with Elasticsearch - analysis tasks -`ianalyzer/settings.py` is the source of truth for all kinds of settings and variables. Local envirnment settings can be added in `ianalyzer/settings_local.py`, which will be imported in the settings file. +### Packages + +The backend consists of the following packages: + +- `es`: core functionality for working with elasticsearch +- `addcorpus`: models and core functionality for corpus definitions +- `indexing`: create an index for a corpus +- `users`: user models and management +- `tag`: allow users to tag documents +- `api`: parse queries, task-related views, search history +- `visualization`: compute visualisation data +- `download`: generate and manage CSV downloads +- `media`: return media attachments for documents +- `corpora`: corpus definitions +- `corpora_test`: corpus definitions for unit tests +- `ianalyzer`: central project app + +The following graph defines a rough map of the dependencies between packages. + +```mermaid +graph TD; + es-->addcorpus; + addcorpus-->users; + addcorpus-->wordmodels; + addcorpus-->media; + tag-->api; + visualization-->download; + users-->tag; + api-->visualization; + addcorpus-->indexing; + media-->corpora; + media-->corpora_test; +``` + +To avoid circular imports, it's preferred that you follow the directions laid out in the graph. For example, if you want to write a function that will have something to do with tags and visualisations, add a module in `visualizations` that imports modules from `tag`, rather than vice versa. ### Database -The database uses postgreSQL. The top-level `manage.py` provides commands to run migrations from the command line (see also README.md). The tables of the databaseare defined `models.py` files for each app in the backend. The location of the database and access credentials are provide in `ianalyzer/settings.py` (or `settings_local.py`). - -The following models are important: `CustomUser` add additional fields to Django's `User` model. Users are linked to `Group`s, which gives them access to a collection of corpora. Therefore, `Group` has a foreign key (many-to-many) with `Corpus`, which corresponds to the names of corpora a user can access. -In addition, the `Query` and `Download` models are used for the user's search and download history, respectively. +The database uses postgreSQL. If you're not familiar with database management in a Django project, see the [Django documentation on database models](https://docs.djangoproject.com/en/5.1/topics/db/models/). -Django provides an admin intervace for the application: Users and Groups objects can be created and edited, and Corpora can be linked to Groups. This interface lives next to the frontend provided through Angular, and is only accessible to staff users. +The SQL database is used to handle user-related data and corpus metadata. It does *not* store the contents of corpora; this is handled through Elasticsearch. ### Elasticsearch -The backend provides functionality to make an Elasticsearch index through the command line: `manage.py` calls `es_index.py` to do so. `es_index.py` in turn relies on the settings in `ianalyzer/settings.py` of where the corpus definitions and the source data are located. The corpus definitions of already integrated corpora are currently bundled in `backend/corpora`. The corpus definitions can be located anywhere on the filesystem, however. -Currently, the frontend constructs the request body, largely based on an Elasticsearch Simple Query String, and this is forwarded by the backend to Elasticsearch (in `es/views.py`). The response from Elasticsearch is then passed back to the frontend. +I-analyzer uses [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html) to manage and search the data in corpora. -### Analysis -The backend also contains functions to: -- conduct analysis for the wordcloud, term frequency, ngrams and related words visualisations in the frontend. -- assemble CSV downloads of search results. +Typically, you will use I-analyzer to create, manage, and search an index for a corpus in Elasticsearch. (You could also use I-analyzer to search an existing index, but the application is not designed for this.) Where possible, we try to rely on the search and analysis functions built into Elasticsearch, rather than build our own. # Frontend -The core of the frontend application is the `src/app/search/search.component`, providing the template for the user interface. This interface allows users to query results using Simple Query String syntax and offers various filters. The search component can show either the search results component, which shows documents matching the search, or visualisations. +The frontend is an Angular web application. + +The core of the frontend application is the `src/app/search/search.component`, providing the template for the user interface. This interface allows users to query results using Simple Query String syntax and offers various filters. The search component can show either the search results component, which shows documents matching the search, visualisations, or a download menu. There are various visualizations in `src/app/visualization/`, with `visualization.component` as the main component which checks which visualization type is to be displayed. As a rule, visualisations that directly show the results of an aggregation search formulate aggregation request in the frontent, while other visualizations (wordcloudngram, term frequency) let the backend handle analysis. For corpora with word models, the `src/app/word-models/` provides an interface with various visualisations for viewing word similarity. - - -For more about users, authentication and authorization, see [Authentication and authorization](./Authentication-and-authorization.md) diff --git a/documentation/README.md b/documentation/README.md index d4bf977a4..0920bcd67 100644 --- a/documentation/README.md +++ b/documentation/README.md @@ -25,6 +25,7 @@ This directory contains documentation for developers. - [Celery](./Celery.md) (used for downloads and visualisations) - [Email](./Email.md) (used for downloads and authentication) +- [Downloads](./Downloads.md) - [Adding word models](./Adding-word-models.md) - [Query API](./Query-api.md) diff --git a/documentation/Writing-a-corpus-definition-in-JSON.md b/documentation/Writing-a-corpus-definition-in-JSON.md index 31cc42312..ecbd5650b 100644 --- a/documentation/Writing-a-corpus-definition-in-JSON.md +++ b/documentation/Writing-a-corpus-definition-in-JSON.md @@ -1,8 +1,10 @@ # Writing a corpus definition in JSON -Database-only corpora support a JSON format for creating corpus definitions. This format is implemented in the backend API of I-analyzer. Like Python definitions, a JSON definition can be used to store and share a configuration for a corpus. +Database-only corpora support a JSON format for creating or editing corpus definitions. Like Python definitions, a JSON definition can be used to store and share a configuration for a corpus. -The format is defined in [corpus.schema.json](/backend/addcorpus/schemas/corpus.schema.json). +The format is defined in [corpus.schema.json](/backend/addcorpus/schemas/corpus.schema.json). You can find an example in the [test JSON definition](../backend/corpora_test/basic/mock_corpus.json). + +We do not (currently) have a guide to writing JSON definitions, though the JSON schema includes descriptions for each field. ## Importing and exporting definitions @@ -11,4 +13,4 @@ You can import and export JSON definitions through the frontend. Visit `/corpus- Some notes on importing and exporting JSON definitions: - A JSON definition is less detailed than the database model. This is because the database model must also support Python corpora (which offer more customisation) and legacy options. If you edit a corpus through the admin, exporting it to JSON and importing it again may include some normalisation. -- Some properties of the corpus are not handled through the JSON interface, though they are supported in database-only corpora. Currently, these can only be configured in the admin. These are the corpus image, documentation pages, and data directory. +- Some properties of the corpus are not handled through the JSON interface, though they are supported in database-only corpora. Currently, these can only be configured in the admin. These are the corpus image, documentation pages, and data directory. You an edit these properties in the admin site once you have uploaded the JSON definition. diff --git a/documentation/Writing-a-corpus-definition-in-Python.md b/documentation/Writing-a-corpus-definition-in-Python.md index 6e3b1984f..aae170b64 100644 --- a/documentation/Writing-a-corpus-definition-in-Python.md +++ b/documentation/Writing-a-corpus-definition-in-Python.md @@ -49,7 +49,7 @@ The following attributes are required for a corpus to function. | `max_date` | `datetime.date` | The maximum date for the data - analogous to `min_date`. | | `category` | `str` | The type of data in the corpus. See the [options for categories](/backend/addcorpus/constants.py). | | `languages` | `List[str]` | A list of IETF tags of the languages used in your corpus. Corpus languages are intended as a way for users to select interesting datasets, so only include languages for which your corpus contains a meaningful amount of data. The list should go from most to least frequent. You can also include `''` for "unknown". | -| `es_index` | `str` | The name of the elasticsearch index. In development, the corpus name will do. On a production cluster, you may need to use a particular prefix. | +| `es_index` | `str` | The name of the elasticsearch index. In development, the corpus name will do. On a production cluster, you may need to use a particular prefix. If the name starts with `test-`, the index may be deleted when running unit tests; do this for test corpora, don't do it elsewhere. | | `data_directory` | `Optional[str]` | Path to the directory containing source files. Always get this from the setttings. You can also set this to `None`; usually because you are getting source data from an API instead of a local directory. | | `fields` | `List[Field]` | The fields for the corpus. See [defining fields](#definining-fields). | diff --git a/frontend/src/environments/environment.ts b/frontend/src/environments/environment.ts index bdd3770b3..844aa0583 100644 --- a/frontend/src/environments/environment.ts +++ b/frontend/src/environments/environment.ts @@ -2,6 +2,8 @@ // The build system defaults to the dev environment which uses `environment.ts`, but if you do // `ng build --env=prod` then `environment.prod.ts` will be used instead. // The list of which env maps to which file can be found in `.angular-cli.json`. + +// see /documentation/Frontend-environment-settings.md for a description of available settings import { version } from './version'; export const environment = {