Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Insert history and versions endpoint #33

Merged
merged 15 commits into from
May 21, 2024
Merged

Conversation

n-dusan
Copy link
Contributor

@n-dusan n-dusan commented Apr 23, 2024

Closes #17, #24

This PR is the first-pass implementation for inserting data from an RDF git repo into the database. We refer to the type of data in RDF that we work with as changes, versions or historical information about documents or collections. We refer to documents as legal material in a stele. Documents can get organized/categorized into (sub)directories, which we call collections. In addition to tracking changes to documents, we also track changes to collections of documents.

Rationale

The end goal is to implement an API that lists all the significant dates (that we refer to also as versions) that affected legal documents or collections for a stele across history.

Notes

To support the end-goal, this pr's adds:

  1. database access and models the tables necessary for the versions API;
  2. load/parse change information from a stele RDF data repository.

For database access we used sqlx, for parsing/working with the RDF, we used sophia.

Overview

  • Added initial database SQL tables necessary for implementing the historical versions API.
  • Added db module that's supposed to be generic over database connections. Currently supports connecting and inserting changes to SQLite database with sqlx. There's also Postgres, because we left open the possibility to easily extend into Postgres. Postgres support is still TBD. If we do decide that only SQLite is the way to go, we'll remove the Postgres code in a separate pr.
  • The changes module parses the changes into a graph and inserts the data into db. The rdf module has utility methods supposed to reduce boilerplate and make working with the graph easier.
  • migrations directory has the initial database schema/tables.
  • models contains sqlx structs that map to database tables.
  • statements module has SQL inserts and queries modules for SQL statements.

Introduce the /_api/versions endpoint. Idea being we query the database and, depending on the url request, return either all document or collection changes. The changes are mapped onto a response::Versions json response.

To implement in a separate pr

  • Add integration testing framework that has a small RDF repository fixture. Then we call in the insert history command and assert expected states.
  • I'm thinking of potentially adding traits that tie database methods to the model in models. Something like a PublicationManager, for example. Using traits could make the inserts and queries more legible.

We'll need to update the insert changes logic again, so I vote to defer this and this issue at that point.

@n-dusan n-dusan added the enhancement New feature or request label Apr 23, 2024
@n-dusan n-dusan self-assigned this Apr 23, 2024
@n-dusan n-dusan marked this pull request as ready for review April 23, 2024 17:47
@n-dusan n-dusan requested a review from tombh April 23, 2024 18:24
Copy link
Contributor

@tombh tombh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! I can't see any significant issues!

The biggest thing to me is actually the commit history. I think it'd be useful to start getting into the habit of git commit --amend and git rebase --interactive so that as much as possible the commits tell a narrative-like story. I guess you'll be doing a squash rebase to merge this PR? Of course that's fine, but I think it'd be worth bearing in mind how the commit history reads as you're actually in the process of developing the PR. I'm guessing that with the help of --amend and rebase --interactive this PR could probably be represented in something like ~5 commits? Well-constructed commits are in themselves documentation, think of how git blame can add context to future developers (including yourself!) when doing refactors or bug fixing. And of course expressive, story-telling commits can add another medium of context and communication to help the reviewer.

With such habits it can also be good to try to make sure that every commit passes CI. This adds a layer of confidence to each commit, expressing to future developers that reliability of the given code changes. It can also help with merge conflicts, especially those arising from rebased merges, where each commit's conflicts can be more reliably resolved by running the test suite at each step of the rebase.

#[allow(clippy::unwrap_in_result, clippy::unwrap_used)]
fn from_row(row: &AnyRow) -> Result<Self, sqlx::Error> {
Ok(Self {
name: row.try_get("name").unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use ? instead of unwrap()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was not sure sqlx integrates with anyhow this well, but it worked! Thank you

date: row.try_get("date").unwrap(),
stele: row.try_get("stele").unwrap(),
revoked: row.try_get("revoked").unwrap(),
last_valid_publication_name: row.try_get("last_valid_publication_name").ok(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, is ? not possible?

#[allow(clippy::unwrap_in_result, clippy::unwrap_used)]
fn from_row(row: &AnyRow) -> Result<Self, sqlx::Error> {
Ok(Self {
version: row.try_get("version").unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? instead of unwrap()?

DatabaseKind::Sqlite => {
let mut connection = conn.pool.acquire().await?;

let statement = format!("
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen the sqlx::query_file_as! macro? https://github.com/launchbadge/sqlx/tree/main/examples/postgres/files It allows you to keep SQL statements in .sql files. I think it's something to bear in mind as SQL statements get bigger. It's obviously not critical, but I think being able to write SQL in actual .sql files helps keep SQL sane, because then we get things like editor syntax highlighting and linters like https://github.com/sqlfluff/sqlfluff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not see it, that looks awesome. I've got a second pr ready for review hopefully soon, so I'd defer adding separate sql files later. Created an issue.

/// Stelae representation of an RDF graph.
pub struct StelaeGraph {
/// The underlying graph.
pub g: FastGraph,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about fast_graph instead of g?

loop {
let el_uri = format!("http://www.w3.org/1999/02/22-rdf-syntax-ns#_{i}");
let elem_iri = SimpleTerm::Iri(IriRef::new_unchecked(MownStr::from_str(&el_uri)));
if self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be easier to read if it were first put in a variable, like you do with item a few lines below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed it

pub fn items(&self) -> anyhow::Result<Vec<SimpleTerm>> {
let container = &self.uri;
let mut i = 1_u32;
let mut l_ = vec![];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more descriptive name might be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

"error: could not connect to database. Confirm that DATABASE_URL env var is set correctly."
);
tracing::error!("Error: {:?}", err);
process::exit(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor point, but generally we only want one central call to exit() in a program. The idea being that we bubble up all errors with ? and only at some common overarching point in the code do we decide to exit based on the received error. This can get hard with threads and async, etc. But ideally, centralising the exit() can help us avoid some unexpected bugs and generally improive architecture. But don't worry about it too much, it's just something to bear in mind.

Copy link
Contributor Author

@n-dusan n-dusan May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes a lot of sense. I've created an issue for it for now.

@@ -220,7 +251,7 @@ pub fn init_app(
"Failed to initialize routes for root Stele: {}",
root.get_qualified_name()
);
std::process::exit(1);
process::exit(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point about centralising exit()

@@ -259,7 +290,7 @@ pub fn init_app(
"Failed to initialize routes for Stele: {}",
guarded_stele.get_qualified_name()
);
std::process::exit(1);
process::exit(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point about centralising exit()

@tombh
Copy link
Contributor

tombh commented May 9, 2024

Also, would you be interested in a commit lint? It's really easy to add to CI, add a commitlintrc.yaml following https://commitlint.js.org/reference/configuration.html (Eg https://github.com/openlawlibrary/pygls/blob/main/commitlintrc.yaml) and the Github Action is then a oneliner: npx commitlint --from origin/main --to HEAD --verbose --config commitlintrc.yaml

*use sqlx*

Intended for serving historical documents and loading data from Postgres
or SQlite

*add generic connection over database*

- Introduce a `db` module responsible for managing database
- Add `connect` function that's generic over SQLite and Postgres

*init database and apply migrations*

Defaults to local SQLite database if `DATABASE_URL` is missing.
After connecting, run any/all migrations that are necessary.

*generate initial empty migration*

To manage migrations, it's first necessary to install `sqlx-cli`.
- `cargo install sqlx-cli`.

Then, `sqlx migrate add <name>` to add a new migration.

*introduce build.rs build script*

Adding new migrations won't trigger cargo recompilation,
so sqlx strongly suggests we trigger a recompilation on new migrations.

*add top level db module*

*connect to db on app start*

After adding another field to the `AppState` struct the integration tests started failing.
However, for the archive tests we don't really care about having database connection.
To resolve, introduce a trait `GlobalState` that `init_app` implements and initializes the archive.
This way we're flexibile in our testing framework and can use different structs to initialize the global app state.

*create db.sqlite3 if not exists*

*add postgres directory*

We are not sure whether we'll support Postgres, but add directory
- Add `sophia` dependency for working with the rdf
- Add `history` module
- Use opaque `Any` type for database

  This abstracts away the concrete database pool that's used by underlying sqlx.
  Trade-off is that we need the database-url to differentiate between databases
The purpose of the submodule is to store any/all functions that work
with rdf. Currently the underlying implementation relies on `sophia` to
parse the graph and retrieve triples.

Add `namespaces.rs` module that stores rdf namespaces. Currently stores
the `oll` and `dcterms` ontology. I expected the `dcterms` to be
supported out-of-the-box, but `sophia` supports only the most frequently
used namespaces.

*Add sqlx chrono feature*

Will be necessary to work with dates in the database.

*Add statements module for inserts and queries in db*

The idea being that this module stores queries/insert
statements for interacting with both databases.

*Add chrono as dependency*

Necessary because chrono exports serde Serialize and Deserialize traits.
We then use these traits to map sqlx queries to structs.

*Add models mod for sqlx structs*

Created a centralized place for all database related structs.
This approach is convenient because the sqlx queries will then
serialize/deserialize to these structs using serde.

*Rename from upsert to create*

Much easier to reason about what the function is supposed to do.

*Wrap insert changes into an atomic transaction*
The idea being we iterate over all publications using libgit2, and then
for each publication we parse .rdf xml files into an rdf graph.
After the graph is loaded, we iterate over document and collection
versions from the graph and insert each into db.

*Add more sqlx structs for database*

*Wrap sophia graph with our StelaeGraph*

Goal is to move similar code for working with rdf graph into a separate wrapper struct.
This simplifies working with graphs as it hides away the rdf graph results complexity.

*Use composite keys instead of autoincrement synthetic keys*

*Split functionalities for working with rdf graph on*

Since the graph queries will vary on the type of information that we
need, the easiest is to create utility functions that return (s,p,o)
depending on the queries.
This commit has the following significant changes:

*Bulk insert multiple document and collection change objects into db*

Idea being we create a query-builder and insert multiple values into an insert statement.

On Windows, sqlx throws an error when inserting 4000+ rows at once.
To resolve, chunk the rows to 2000 per bulk insert. A
dd `BATCH_SIZE` constant, indicating how many rows will get inserted at once. Set to 2000 per insert.

*Upgrade rustc compiler version*

*Add utility methods for working with the `sophia` graph*

RDF elements can be stored in a container, e.g.
<rdf:_1 resource="...">
<rdf:_2 resource="...">
To resolve, introduce a `Bag` struct with `items` helper method.
The idea being we iterate over all elements inside a container. This logic
is similar to rdflib implementation of `items` in Python.
To better understand this iteration logic, please see `rdflib/container.py` in rdflib Python.

*Add missing sqlx structs to map objects to database rows*

*Revoke same date publications*

Set a boolean revoked to True for publications that exist on same date.
Revoked publications aren't accounted for when querying for versions for a publication.

*Support for publication partial update*

Re-use the existing insert function, as most of the code is the same.
The only difference is that we skip parsing/loading publications which
are already inserted in the db. For publication directories which aren't
in the DB, we continue inserting changes.
@n-dusan
Copy link
Contributor Author

n-dusan commented May 14, 2024

@tombh Thank you, this has been insightful! Agree about the commit history, so I've brought the PR down from 98 to 7 commits.

Also a commit linter would be amazing. But I'd say we add that in another pr.

I don't think we need to address all of these comments right now, since I have another PR ready for review hopefully soon. I've created a couple of issues [1, 2, 3].

@n-dusan n-dusan requested a review from tombh May 14, 2024 09:46
@n-dusan n-dusan mentioned this pull request May 15, 2024
Copy link
Contributor

@tombh tombh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of ok()s to look at. Otherwise it looks good!

stele: row.try_get("stele")?,
revoked: row.try_get("revoked")?,
last_valid_publication_name: row.try_get("last_valid_publication_name").ok(),
last_valid_version: row.try_get("last_valid_version").ok(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .ok()s can be ? as well right?

Copy link
Contributor Author

@n-dusan n-dusan May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like sqlx can't map ? to Option in FromRow trait by default. So can't be ? right now :-(

version: row.try_get("version")?,
publication: row.try_get("publication")?,
stele: row.try_get("stele")?,
build_reason: row.try_get("build_reason").ok(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.ok() to ?

.bind(name)
.fetch_one(&mut *connection)
.await
.ok()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.ok() to ?

.bind(stele)
.fetch_one(&mut *connection)
.await
.ok()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.ok() to ?

- Introduce a `routing` module, intended as a central place
to map all routing in stelae app. Introduced the `_api` scope
with the versions endpoint
- feat(api): initial versions endpoint
- feat(app): call register_api routing function
- feat(api): extract stele name from request
This commit introduces a new endpoint in stelae to get all the
meaningful dates on which a document or a collection changed.

Below is a summary of significant changes:

- Add a new `api` module as a centralized place to look at all the endpoints stelae
  has.
- refact: rename publish to app
- feat(db): add Manager traits for model

  The reasoning being it's easier visually figure out which database queries work with which data structure.

- feat(db): add version struct
- feat(db): implement document and collection version queries

 Idea being we query for all the versions in which document or a collection changed.

- feat(api): get all versions for a publication

 Executes the database query for document or library versions.

- feat(api): map request/response body to structs

 Idea being it's more legibile to reason about the structure of the api.
 Rely on serde to serialize/deserialize values from and into json.

- Introduce the `requests` and `response` modules for mapping the
  endpoint response to historical data for presentation
- feat(api): support for historical messages

 In addition to the versions, we also display HTML messages that: i)
 show on which historical publication the user is currently on, ii)
 through which dates the currently viewed version was valid from
 (start and end date). If comparing versions, we show how many
 changes happened between the compare and version date.
- fix(api): assign Current value to variable after historical messages
- refact: separate formatting logic into associated function
  `format_display_date`
- lots of clippy lints
For the API, move the `versions.rs` into `versions/mod.rs`. The two
modules for request and response are now in `request/mod.rs` and
`response/mod.rs`. Historical messages are in the `response/messages.rs`
module
These modules should make the code base more legibile.

These changes include:
- Moving `serve` current documents endpoint into `server/api/serve`
  module.

  This way we'll centralize all future endpoint logic in the `api` module

- Centralize routing into `routes` mod.

 We'll expect to add new 'static' routing, while also keeping our
 dynamic logic in the same place.

- Centralize global app state into `state` mod.

- Small updates following the re-org of current codebase

refact: rename structs in `state.rs` module to fix linting issues

Address clippy lint warnings/pedantic errors related to generics in function arguments
Add `initialize_guarded_dynamic_routes` and `initialize_dynamic_routes`
functions which, depending on whether the guarded header exists,
initialize the app with or without the guarded header.
For queries which we want to exit out of earlier, use `?`.
However, I've left .ok for queries which should continue executing (so
expecting a null value), instead of propagating an error to the caller
where not expected.
@n-dusan
Copy link
Contributor Author

n-dusan commented May 20, 2024

Thanks for the rebase pr! I've now rebased your #40 into this PR and addressed comments from the other PR and most of the .ok's are now ?. Had to keep some of them, since in some cases we don't really want to early exit the function and propagate the error, so None was expected.

@n-dusan n-dusan requested a review from tombh May 20, 2024 14:33
@n-dusan n-dusan changed the title feat: Insert history feat: insert history and versions endpoint May 20, 2024
@n-dusan n-dusan changed the title feat: insert history and versions endpoint feat: Insert history and versions endpoint May 20, 2024
@n-dusan n-dusan merged commit a3a77bc into main May 21, 2024
4 checks passed
@n-dusan n-dusan deleted the ndusan/add-database-access branch May 21, 2024 11:45
@n-dusan n-dusan mentioned this pull request May 21, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Insert changes in SQLite
2 participants