feat: Insert history and versions endpoint #33

n-dusan · 2024-04-23T14:07:46Z

Closes #17, #24

This PR is the first-pass implementation for inserting data from an RDF git repo into the database. We refer to the type of data in RDF that we work with as changes, versions or historical information about documents or collections. We refer to documents as legal material in a stele. Documents can get organized/categorized into (sub)directories, which we call collections. In addition to tracking changes to documents, we also track changes to collections of documents.

Rationale

The end goal is to implement an API that lists all the significant dates (that we refer to also as versions) that affected legal documents or collections for a stele across history.

Notes

To support the end-goal, this pr's adds:

database access and models the tables necessary for the versions API;
load/parse change information from a stele RDF data repository.

For database access we used sqlx, for parsing/working with the RDF, we used sophia.

Overview

Added initial database SQL tables necessary for implementing the historical versions API.
Added db module that's supposed to be generic over database connections. Currently supports connecting and inserting changes to SQLite database with sqlx. There's also Postgres, because we left open the possibility to easily extend into Postgres. Postgres support is still TBD. If we do decide that only SQLite is the way to go, we'll remove the Postgres code in a separate pr.
The changes module parses the changes into a graph and inserts the data into db. The rdf module has utility methods supposed to reduce boilerplate and make working with the graph easier.
migrations directory has the initial database schema/tables.
models contains sqlx structs that map to database tables.
statements module has SQL inserts and queries modules for SQL statements.

Introduce the /_api/versions endpoint. Idea being we query the database and, depending on the url request, return either all document or collection changes. The changes are mapped onto a response::Versions json response.

To implement in a separate pr

Add integration testing framework that has a small RDF repository fixture. Then we call in the insert history command and assert expected states.
I'm thinking of potentially adding traits that tie database methods to the model in models. Something like a PublicationManager, for example. Using traits could make the inserts and queries more legible.

We'll need to update the insert changes logic again, so I vote to defer this and this issue at that point.

tombh

This looks really good! I can't see any significant issues!

The biggest thing to me is actually the commit history. I think it'd be useful to start getting into the habit of git commit --amend and git rebase --interactive so that as much as possible the commits tell a narrative-like story. I guess you'll be doing a squash rebase to merge this PR? Of course that's fine, but I think it'd be worth bearing in mind how the commit history reads as you're actually in the process of developing the PR. I'm guessing that with the help of --amend and rebase --interactive this PR could probably be represented in something like ~5 commits? Well-constructed commits are in themselves documentation, think of how git blame can add context to future developers (including yourself!) when doing refactors or bug fixing. And of course expressive, story-telling commits can add another medium of context and communication to help the reviewer.

With such habits it can also be good to try to make sure that every commit passes CI. This adds a layer of confidence to each commit, expressing to future developers that reliability of the given code changes. It can also help with merge conflicts, especially those arising from rebased merges, where each commit's conflicts can be more reliably resolved by running the test suite at each step of the rebase.

tombh · 2024-05-09T13:24:21Z

src/db/models/publication.rs

+    #[allow(clippy::unwrap_in_result, clippy::unwrap_used)]
+    fn from_row(row: &AnyRow) -> Result<Self, sqlx::Error> {
+        Ok(Self {
+            name: row.try_get("name").unwrap(),


Is there a reason not to use ? instead of unwrap()?

Was not sure sqlx integrates with anyhow this well, but it worked! Thank you

tombh · 2024-05-09T13:24:45Z

src/db/models/publication.rs

+            date: row.try_get("date").unwrap(),
+            stele: row.try_get("stele").unwrap(),
+            revoked: row.try_get("revoked").unwrap(),
+            last_valid_publication_name: row.try_get("last_valid_publication_name").ok(),


Same here, is ? not possible?

tombh · 2024-05-09T13:25:17Z

src/db/models/publication_version.rs

+    #[allow(clippy::unwrap_in_result, clippy::unwrap_used)]
+    fn from_row(row: &AnyRow) -> Result<Self, sqlx::Error> {
+        Ok(Self {
+            version: row.try_get("version").unwrap(),


? instead of unwrap()?

tombh · 2024-05-09T13:42:24Z

src/db/statements/queries.rs

+        DatabaseKind::Sqlite => {
+            let mut connection = conn.pool.acquire().await?;
+
+            let statement = format!("


Have you seen the sqlx::query_file_as! macro? https://github.com/launchbadge/sqlx/tree/main/examples/postgres/files It allows you to keep SQL statements in .sql files. I think it's something to bear in mind as SQL statements get bigger. It's obviously not critical, but I think being able to write SQL in actual .sql files helps keep SQL sane, because then we get things like editor syntax highlighting and linters like https://github.com/sqlfluff/sqlfluff

Did not see it, that looks awesome. I've got a second pr ready for review hopefully soon, so I'd defer adding separate sql files later. Created an issue.

tombh · 2024-05-09T13:47:34Z

src/history/rdf/graph.rs

+/// Stelae representation of an RDF graph.
+pub struct StelaeGraph {
+    /// The underlying graph.
+    pub g: FastGraph,


What about fast_graph instead of g?

tombh · 2024-05-09T13:51:22Z

src/history/rdf/graph.rs

+        loop {
+            let el_uri = format!("http://www.w3.org/1999/02/22-rdf-syntax-ns#_{i}");
+            let elem_iri = SimpleTerm::Iri(IriRef::new_unchecked(MownStr::from_str(&el_uri)));
+            if self


I think this would be easier to read if it were first put in a variable, like you do with item a few lines below.

Thanks, changed it

tombh · 2024-05-09T13:51:37Z

src/history/rdf/graph.rs

+    pub fn items(&self) -> anyhow::Result<Vec<SimpleTerm>> {
+        let container = &self.uri;
+        let mut i = 1_u32;
+        let mut l_ = vec![];


A more descriptive name might be useful.

tombh · 2024-05-09T13:57:40Z

src/server/publish.rs

+                "error: could not connect to database. Confirm that DATABASE_URL env var is set correctly."
+            );
+            tracing::error!("Error: {:?}", err);
+            process::exit(1);


This is a minor point, but generally we only want one central call to exit() in a program. The idea being that we bubble up all errors with ? and only at some common overarching point in the code do we decide to exit based on the received error. This can get hard with threads and async, etc. But ideally, centralising the exit() can help us avoid some unexpected bugs and generally improive architecture. But don't worry about it too much, it's just something to bear in mind.

Thanks, that makes a lot of sense. I've created an issue for it for now.

tombh · 2024-05-09T13:58:43Z

src/server/publish.rs

@@ -220,7 +251,7 @@ pub fn init_app(
                                "Failed to initialize routes for root Stele: {}",
                                root.get_qualified_name()
                            );
-                            std::process::exit(1);
+                            process::exit(1);


Minor point about centralising exit()

tombh · 2024-05-09T13:58:49Z

src/server/publish.rs

@@ -259,7 +290,7 @@ pub fn init_app(
                                            "Failed to initialize routes for Stele: {}",
                                            guarded_stele.get_qualified_name()
                                        );
-                                        std::process::exit(1);
+                                        process::exit(1);


Minor point about centralising exit()

tombh · 2024-05-09T14:37:46Z

Also, would you be interested in a commit lint? It's really easy to add to CI, add a commitlintrc.yaml following https://commitlint.js.org/reference/configuration.html (Eg https://github.com/openlawlibrary/pygls/blob/main/commitlintrc.yaml) and the Github Action is then a oneliner: npx commitlint --from origin/main --to HEAD --verbose --config commitlintrc.yaml

*use sqlx* Intended for serving historical documents and loading data from Postgres or SQlite *add generic connection over database* - Introduce a `db` module responsible for managing database - Add `connect` function that's generic over SQLite and Postgres *init database and apply migrations* Defaults to local SQLite database if `DATABASE_URL` is missing. After connecting, run any/all migrations that are necessary. *generate initial empty migration* To manage migrations, it's first necessary to install `sqlx-cli`. - `cargo install sqlx-cli`. Then, `sqlx migrate add <name>` to add a new migration. *introduce build.rs build script* Adding new migrations won't trigger cargo recompilation, so sqlx strongly suggests we trigger a recompilation on new migrations. *add top level db module* *connect to db on app start* After adding another field to the `AppState` struct the integration tests started failing. However, for the archive tests we don't really care about having database connection. To resolve, introduce a trait `GlobalState` that `init_app` implements and initializes the archive. This way we're flexibile in our testing framework and can use different structs to initialize the global app state. *create db.sqlite3 if not exists* *add postgres directory* We are not sure whether we'll support Postgres, but add directory

- Add `sophia` dependency for working with the rdf - Add `history` module - Use opaque `Any` type for database This abstracts away the concrete database pool that's used by underlying sqlx. Trade-off is that we need the database-url to differentiate between databases

The purpose of the submodule is to store any/all functions that work with rdf. Currently the underlying implementation relies on `sophia` to parse the graph and retrieve triples. Add `namespaces.rs` module that stores rdf namespaces. Currently stores the `oll` and `dcterms` ontology. I expected the `dcterms` to be supported out-of-the-box, but `sophia` supports only the most frequently used namespaces. *Add sqlx chrono feature* Will be necessary to work with dates in the database. *Add statements module for inserts and queries in db* The idea being that this module stores queries/insert statements for interacting with both databases. *Add chrono as dependency* Necessary because chrono exports serde Serialize and Deserialize traits. We then use these traits to map sqlx queries to structs. *Add models mod for sqlx structs* Created a centralized place for all database related structs. This approach is convenient because the sqlx queries will then serialize/deserialize to these structs using serde. *Rename from upsert to create* Much easier to reason about what the function is supposed to do. *Wrap insert changes into an atomic transaction*

The idea being we iterate over all publications using libgit2, and then for each publication we parse .rdf xml files into an rdf graph. After the graph is loaded, we iterate over document and collection versions from the graph and insert each into db. *Add more sqlx structs for database* *Wrap sophia graph with our StelaeGraph* Goal is to move similar code for working with rdf graph into a separate wrapper struct. This simplifies working with graphs as it hides away the rdf graph results complexity. *Use composite keys instead of autoincrement synthetic keys* *Split functionalities for working with rdf graph on* Since the graph queries will vary on the type of information that we need, the easiest is to create utility functions that return (s,p,o) depending on the queries.

This commit has the following significant changes: *Bulk insert multiple document and collection change objects into db* Idea being we create a query-builder and insert multiple values into an insert statement. On Windows, sqlx throws an error when inserting 4000+ rows at once. To resolve, chunk the rows to 2000 per bulk insert. A dd `BATCH_SIZE` constant, indicating how many rows will get inserted at once. Set to 2000 per insert. *Upgrade rustc compiler version* *Add utility methods for working with the `sophia` graph* RDF elements can be stored in a container, e.g. <rdf:_1 resource="..."> <rdf:_2 resource="..."> To resolve, introduce a `Bag` struct with `items` helper method. The idea being we iterate over all elements inside a container. This logic is similar to rdflib implementation of `items` in Python. To better understand this iteration logic, please see `rdflib/container.py` in rdflib Python. *Add missing sqlx structs to map objects to database rows* *Revoke same date publications* Set a boolean revoked to True for publications that exist on same date. Revoked publications aren't accounted for when querying for versions for a publication. *Support for publication partial update* Re-use the existing insert function, as most of the code is the same. The only difference is that we skip parsing/loading publications which are already inserted in the db. For publication directories which aren't in the DB, we continue inserting changes.

Makes the variables easier to read.

n-dusan · 2024-05-14T09:43:16Z

@tombh Thank you, this has been insightful! Agree about the commit history, so I've brought the PR down from 98 to 7 commits.

Also a commit linter would be amazing. But I'd say we add that in another pr.

I don't think we need to address all of these comments right now, since I have another PR ready for review hopefully soon. I've created a couple of issues [1, 2, 3].

tombh

Just a couple of ok()s to look at. Otherwise it looks good!

tombh · 2024-05-19T20:02:07Z

src/db/models/publication.rs

+            stele: row.try_get("stele")?,
+            revoked: row.try_get("revoked")?,
+            last_valid_publication_name: row.try_get("last_valid_publication_name").ok(),
+            last_valid_version: row.try_get("last_valid_version").ok(),


The .ok()s can be ? as well right?

Looks like sqlx can't map ? to Option in FromRow trait by default. So can't be ? right now :-(

tombh · 2024-05-19T20:02:32Z

src/db/models/publication_version.rs

+            version: row.try_get("version")?,
+            publication: row.try_get("publication")?,
+            stele: row.try_get("stele")?,
+            build_reason: row.try_get("build_reason").ok(),


.ok() to ?

tombh · 2024-05-19T20:02:43Z

src/db/statements/queries.rs

+                .bind(name)
+                .fetch_one(&mut *connection)
+                .await
+                .ok()


.ok() to ?

tombh · 2024-05-19T20:03:05Z

src/db/statements/queries.rs

+                .bind(stele)
+                .fetch_one(&mut *connection)
+                .await
+                .ok()


.ok() to ?

- Introduce a `routing` module, intended as a central place to map all routing in stelae app. Introduced the `_api` scope with the versions endpoint - feat(api): initial versions endpoint - feat(app): call register_api routing function - feat(api): extract stele name from request

This commit introduces a new endpoint in stelae to get all the meaningful dates on which a document or a collection changed. Below is a summary of significant changes: - Add a new `api` module as a centralized place to look at all the endpoints stelae has. - refact: rename publish to app - feat(db): add Manager traits for model The reasoning being it's easier visually figure out which database queries work with which data structure. - feat(db): add version struct - feat(db): implement document and collection version queries Idea being we query for all the versions in which document or a collection changed. - feat(api): get all versions for a publication Executes the database query for document or library versions. - feat(api): map request/response body to structs Idea being it's more legibile to reason about the structure of the api. Rely on serde to serialize/deserialize values from and into json. - Introduce the `requests` and `response` modules for mapping the endpoint response to historical data for presentation - feat(api): support for historical messages In addition to the versions, we also display HTML messages that: i) show on which historical publication the user is currently on, ii) through which dates the currently viewed version was valid from (start and end date). If comparing versions, we show how many changes happened between the compare and version date.

- fix(api): assign Current value to variable after historical messages - refact: separate formatting logic into associated function `format_display_date` - lots of clippy lints

For the API, move the `versions.rs` into `versions/mod.rs`. The two modules for request and response are now in `request/mod.rs` and `response/mod.rs`. Historical messages are in the `response/messages.rs` module

These modules should make the code base more legibile. These changes include: - Moving `serve` current documents endpoint into `server/api/serve` module. This way we'll centralize all future endpoint logic in the `api` module - Centralize routing into `routes` mod. We'll expect to add new 'static' routing, while also keeping our dynamic logic in the same place. - Centralize global app state into `state` mod. - Small updates following the re-org of current codebase refact: rename structs in `state.rs` module to fix linting issues Address clippy lint warnings/pedantic errors related to generics in function arguments

Add `initialize_guarded_dynamic_routes` and `initialize_dynamic_routes` functions which, depending on whether the guarded header exists, initialize the app with or without the guarded header.

For queries which we want to exit out of earlier, use `?`. However, I've left .ok for queries which should continue executing (so expecting a null value), instead of propagating an error to the caller where not expected.

n-dusan · 2024-05-20T14:32:54Z

Thanks for the rebase pr! I've now rebased your #40 into this PR and addressed comments from the other PR and most of the .ok's are now ?. Had to keep some of them, since in some cases we don't really want to early exit the function and propagate the error, so None was expected.

n-dusan added the enhancement New feature or request label Apr 23, 2024

n-dusan self-assigned this Apr 23, 2024

n-dusan force-pushed the ndusan/add-database-access branch from bf11102 to 46b4c8d Compare April 23, 2024 14:09

n-dusan marked this pull request as ready for review April 23, 2024 17:47

n-dusan requested a review from tombh April 23, 2024 18:24

tombh requested changes May 9, 2024

View reviewed changes

n-dusan added 5 commits May 13, 2024 20:40

n-dusan force-pushed the ndusan/add-database-access branch from 6445909 to bffdd23 Compare May 13, 2024 18:41

This was referenced May 14, 2024

Add commit lint #34

Open

Use sqlx::query_file_as! macro #35

Open

Centralize error handling #36

Open

n-dusan added 2 commits May 14, 2024 11:36

refact: use ? with anyhow for FromRow trait

5ea1eda

refact: use more descriptive names

e8f7b81

Makes the variables easier to read.

n-dusan requested a review from tombh May 14, 2024 09:46

n-dusan mentioned this pull request May 15, 2024

feat: Versions API #38

Closed

tombh requested changes May 19, 2024

View reviewed changes

n-dusan added 8 commits May 20, 2024 16:30

fix(api): correctly build the versions response

894f64c

- fix(api): assign Current value to variable after historical messages - refact: separate formatting logic into associated function `format_display_date` - lots of clippy lints

refact: re-organize versions endpoint

676a9d8

For the API, move the `versions.rs` into `versions/mod.rs`. The two modules for request and response are now in `request/mod.rs` and `response/mod.rs`. Historical messages are in the `response/messages.rs` module

feat: add unit tests for historical messages

4400c15

refact(api): split init code into two separate functions

3194010

Add `initialize_guarded_dynamic_routes` and `initialize_dynamic_routes` functions which, depending on whether the guarded header exists, initialize the app with or without the guarded header.

refact(api): use ? instead of .ok where possible

8e1b6a3

For queries which we want to exit out of earlier, use `?`. However, I've left .ok for queries which should continue executing (so expecting a null value), instead of propagating an error to the caller where not expected.

n-dusan requested a review from tombh May 20, 2024 14:33

n-dusan changed the title ~~feat: Insert history~~ feat: insert history and versions endpoint May 20, 2024

n-dusan changed the title ~~feat: insert history and versions endpoint~~ feat: Insert history and versions endpoint May 20, 2024

tombh approved these changes May 21, 2024

View reviewed changes

n-dusan merged commit a3a77bc into main May 21, 2024
4 checks passed

n-dusan deleted the ndusan/add-database-access branch May 21, 2024 11:45

n-dusan mentioned this pull request May 21, 2024

Serve historical documents #24

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Insert history and versions endpoint #33

feat: Insert history and versions endpoint #33

n-dusan commented Apr 23, 2024 •

edited

Loading

tombh left a comment

tombh May 9, 2024

n-dusan May 13, 2024

tombh May 9, 2024

tombh May 9, 2024

tombh May 9, 2024

n-dusan May 14, 2024

tombh May 9, 2024

tombh May 9, 2024

n-dusan May 14, 2024

tombh May 9, 2024

n-dusan May 14, 2024

tombh May 9, 2024

n-dusan May 14, 2024 •

edited

Loading

tombh May 9, 2024

tombh May 9, 2024

tombh commented May 9, 2024

n-dusan commented May 14, 2024

tombh left a comment

tombh May 19, 2024

n-dusan May 20, 2024 •

edited

Loading

tombh May 19, 2024

tombh May 19, 2024

tombh May 19, 2024

n-dusan commented May 20, 2024 •

edited

Loading

feat: Insert history and versions endpoint #33

feat: Insert history and versions endpoint #33

Conversation

n-dusan commented Apr 23, 2024 • edited Loading

Rationale

Notes

Overview

To implement in a separate pr

tombh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n-dusan May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tombh commented May 9, 2024

n-dusan commented May 14, 2024

tombh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n-dusan May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n-dusan commented May 20, 2024 • edited Loading

n-dusan commented Apr 23, 2024 •

edited

Loading

n-dusan May 14, 2024 •

edited

Loading

n-dusan May 20, 2024 •

edited

Loading

n-dusan commented May 20, 2024 •

edited

Loading