Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: awagen/kolibri
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.1.4
Choose a base ref
...
head repository: awagen/kolibri
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref

Commits on Feb 14, 2023

  1. - adding maxLoadTimeInMinutes property for limiting time loading of r…

    …esources is allowed to take, adding configured timeout to resource directive
    
    processing in JobManagerActor and WorkManagerActor
    - minor cleanup
    awagen committed Feb 14, 2023
    Copy the full SHA
    a6a215e View commit details
  2. - minor cleanup

    - adding KOLIBRI_BLOCKING_DISPATCHER_POOL_SIZE property to docker-compose example
    awagen committed Feb 14, 2023
    Copy the full SHA
    f813136 View commit details

Commits on Feb 15, 2023

  1. - correcting JudgementProviderFormat

    - on creation of JudgementProvider passing object with more information passed from the judgement source (e.g sorted judgement values up to configured max size for later calculations assuming comparison with ideal sorting (e.g NDCG)) besides pure judgement mappings
    - adding config value for max allowed time for resource loading to example docker-compose.yaml
    awagen committed Feb 15, 2023
    Copy the full SHA
    205a95b View commit details
  2. - adding config properties for the frequency of batch processing stat…

    …us updates (batchStateUpdateInitialDelayInSeconds, batchStateUpdateIntervalInSeconds)
    
    - version bump to 0.1.5
    awagen committed Feb 15, 2023
    Copy the full SHA
    2f0ab91 View commit details
  3. Copy the full SHA
    ac797be View commit details
  4. - adding ResourceDirective for judgement provider

    - adding separate structDef for supplier of JudgementProvider
    - adjusted JudgementProvider interface to get rid of retrieval of all judgements (not needed and can be large)
    - adjusting example job definitions
    - adjusting mapping in JudgementData data loading
    - adjusting tests
    awagen committed Feb 15, 2023
    Copy the full SHA
    9c917ca View commit details
  5. Copy the full SHA
    338cee6 View commit details

Commits on Feb 17, 2023

  1. - set executor for local state distributor

    - judgement availability metrics calculation
    - adding judgement availability metric aggregation type mapping
    awagen committed Feb 17, 2023
    Copy the full SHA
    74c3476 View commit details

Commits on Feb 19, 2023

  1. Copy the full SHA
    9051949 View commit details

Commits on Apr 16, 2023

  1. Merge pull request #6 from awagen/feature/processing-adjustment

    Feature/processing adjustment
    awagen authored Apr 16, 2023
    Copy the full SHA
    965ff5b View commit details
  2. - moving storage-related classes to kolibri-storage

    - adjusting build structure
    - adding folders for the intended separation of execution mechanism via akka vs zio
    awagen committed Apr 16, 2023
    Copy the full SHA
    928237f View commit details

Commits on Apr 17, 2023

  1. Copy the full SHA
    3665a20 View commit details
  2. - copying resource files to fleet-akka

    - TODO: revisit SerializationSpec due to failing SearchEvaluationDefinition serialization test
    awagen committed Apr 17, 2023
    Copy the full SHA
    99f2f41 View commit details
  3. Copy the full SHA
    c3820d3 View commit details

Commits on Apr 18, 2023

  1. - removed akka-http dependency of CredentialsProvider

    - removed akka-http-spray-json dependency
    - cleanup build.sbt
    - moved grafana, prometheus, test-files and scripts folders to fleet-akka
    - added mainClass for assembly in fleet-akka
    - moved docker-compose and Dockerfile to fleet-akka
    awagen committed Apr 18, 2023
    Copy the full SHA
    a87e138 View commit details
  2. - cleanup kolibri-base configs

    - adjustment Dockerfile, docker-compose.yml
    - move of README.md to kolibri-fleet-akka
    - assembly script kolibri-fleet-akka
    awagen committed Apr 18, 2023
    Copy the full SHA
    cb44de5 View commit details
  3. - adding assembly merge strategy for application and logback configs

    - setting docker-compose image reference to new official docker repository (replacing kolibri-base with kolibri-fleet-akka)
    awagen committed Apr 18, 2023
    Copy the full SHA
    b12483a View commit details
  4. Copy the full SHA
    0b19564 View commit details
  5. Copy the full SHA
    57badc8 View commit details
  6. - got rid of static config reference for ExecutionJsonProtocol (enabl…

    …es moving it back into kolibri-base)
    awagen committed Apr 18, 2023
    Copy the full SHA
    e287a04 View commit details

Commits on Apr 19, 2023

  1. - continue making json formats independent of AppConfig to prepare mo…

    …ving them back to kolibri-base
    awagen committed Apr 19, 2023
    Copy the full SHA
    f61d756 View commit details
  2. Copy the full SHA
    3e28282 View commit details
  3. Copy the full SHA
    7edbda3 View commit details
  4. - continue format adjustments

    awagen committed Apr 19, 2023
    Copy the full SHA
    48cb740 View commit details
  5. - continue format adjustments

    awagen committed Apr 19, 2023
    Copy the full SHA
    96167e4 View commit details
  6. Copy the full SHA
    4d4870a View commit details
  7. - license statements

    awagen committed Apr 19, 2023
    Copy the full SHA
    099de4f View commit details
  8. Copy the full SHA
    0483a45 View commit details
  9. - correting script naming

    awagen committed Apr 19, 2023
    Copy the full SHA
    0f1fb7c View commit details

Commits on Apr 20, 2023

  1. Copy the full SHA
    5bbdbf1 View commit details

Commits on Apr 21, 2023

  1. - preparing AppConfig / AppProperties with needed modules

    - adding zio libs with newest versions
    awagen committed Apr 21, 2023
    Copy the full SHA
    36f190e View commit details
  2. Copy the full SHA
    d6e58bb View commit details
  3. - added delete call to writer trait. TODO: fill in missing implementa…

    …tions
    
    - setting up properties and basic traits for handling of file-based sync via claims and directives
    - basic README
    awagen committed Apr 21, 2023
    Copy the full SHA
    b3564a9 View commit details
  4. - log configs

    - writer corrections
    awagen committed Apr 21, 2023
    Copy the full SHA
    f98f03e View commit details

Commits on Apr 22, 2023

  1. Copy the full SHA
    a97b49e View commit details
  2. - adjustment NameFormats

    - extension claim functions to include claimTopic
    awagen committed Apr 22, 2023
    Copy the full SHA
    5ca30f4 View commit details

Commits on Apr 26, 2023

  1. Copy the full SHA
    a216027 View commit details

Commits on Apr 30, 2023

  1. - added atomic storage of mappings

    - minor adjustments
    awagen committed Apr 30, 2023
    Copy the full SHA
    bcb1420 View commit details

Commits on May 1, 2023

  1. - adding get method to AtomicMapPromiseStore

    - moving calculation definitions to kolibri-definitions
    - relaxed type of RetrievalFailCause from Exception to Throwable
    - wrapping SearchEvaluationJsonProtocol formats into case class to pass dependencies / passing resourceProvider to calculation definitions
    awagen committed May 1, 2023
    Copy the full SHA
    6c50b48 View commit details

Commits on May 2, 2023

  1. Copy the full SHA
    0a7f8e9 View commit details
  2. Copy the full SHA
    2f16191 View commit details

Commits on May 3, 2023

  1. - kolibri-fleet-zio:

      - extending AppConfig
      - adding definitions of jobs and tasks and task execution
      - fixed directory path resolution
      - added simple test for task execution
      - added zio junit test runner to test dependencies
    awagen committed May 3, 2023
    Copy the full SHA
    f33d1fc View commit details
  2. kolibri-fleet-zio:

    - correction property paths in AppProperties
    - ZIOConfig for effectful global state creation
    - added finding and registering of new jobs and their job definitions
    - adjusting FileStorageJobHandler
    - adding JobState to distinguish successfully parsed job definitions from invalid formats
    - adjusting App to incorporate initialization of ZIOConfig and job finder scheduling
    awagen committed May 3, 2023
    Copy the full SHA
    f325281 View commit details

Commits on May 4, 2023

  1. - fixed conditional in ZIOCOnfig init

    - adjusted job definition to provide split in batches
    - extended functionality of JobHandler to both register new existing jobs as well as store job definitions and batch files on submission of new job
    - adjusted folder resolutions to just take job identifier
    - extended API endpoints with job submission endpoint (stores job definition and creates single batch open-task files)
    - relaxed limitation of LocalDirectoryFileWriter to insist on directory path starting from root
    awagen committed May 4, 2023
    Copy the full SHA
    757116c View commit details
  2. - for job submission endpoint fixed reaction to check whether data al…

    …ready exists (not doing anything in this case)
    awagen committed May 4, 2023
    Copy the full SHA
    30fda29 View commit details

Commits on May 5, 2023

  1. ## kolibri-datatypes

    - making IndexedGenerator extend Iterable
    - adding immutable MetricAggregation
    - adding immutable MetricDocument, splitting into packages by mutable / immutable classes
    - adding immutable Aggregators
    - adding test for immutable MetricAggregation  / MetricDocument
    ## kolibri-definitions
    - making Batch type parameter covariant
    awagen committed May 5, 2023
    Copy the full SHA
    7327f82 View commit details
  2. ## kolibri-fleet-zio

    - temp state of adjustments for batch job claiming and work handling
    - cleanup ZIODIConfig
    - FileStorageWorkHandler test
    awagen committed May 5, 2023
    Copy the full SHA
    48db46d View commit details

Commits on May 10, 2023

  1. - extending FileWriter interface (TODO: implement new methods for the…

    … non-local-filesystem writers)
    
    - adjusting folder structure for jobs / tasks and status folders
    - defined JobDirectives and Rules to derive a single JobAction on a set of job-level directives
    - adjusted JobStateHandler interface and FileStorageJobStateHandler implementation
    - added orchestrator to take care of all steps in the scheduled processing update per node
    - defining current state of jobs with OpenJobsSnapshot / JobStateSnapshot
    - defined BatchProcessingStates, JobDefinitionLoadStates, JobProcessingStates
    - added example job structure to resources/testdata
    - added test for FileStorageJobStateHandler
    - added test for OpenJobsSnapshot
    awagen committed May 10, 2023
    Copy the full SHA
    f19efe4 View commit details
  2. Copy the full SHA
    de6c9b1 View commit details

Commits on May 11, 2023

  1. ## kolibri-fleet-zio

    - added option to hardcode node-hash (mainly for testing purposes)
    - test for FileStorageClaimHandler
    - pulled test object creation from FileStorageJobStateHandlerSpec to utility class TestObjects
    awagen committed May 11, 2023
    Copy the full SHA
    3261681 View commit details
Showing 837 changed files with 29,519 additions and 23,339 deletions.
10 changes: 6 additions & 4 deletions .github/workflows/scala.yml
Original file line number Diff line number Diff line change
@@ -18,7 +18,9 @@ jobs:
with:
distribution: adopt
java-version: 11
- name: Run tests
run: sbt test
- name: Run integration tests
run: sbt IntegrationTest/test
- name: Run tests (with coverage)
run: sbt coverage test
- name: Run integration tests (with coverage)
run: sbt coverage IntegrationTest/test
- name: Generate aggregated coverage report
run: sbt coverageAggregate
178 changes: 162 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,195 @@
[![Scala CI](https://github.com/awagen/kolibri/actions/workflows/scala.yml/badge.svg?event=push)](https://github.com/awagen/kolibri/actions/workflows/scala.yml)

# Kolibri
The repository combines the distinct Kolibri projects.
Kolibri is the tool for relevancy explorations of search systems and batch processing.
It implements a task queue that only needs the status information in a storage (currently local, AWS s3, GCP gcs are implemented).
The nodes that are used for submitting new jobs, picking status information (such as requested by kolibri-watch UI) just need to be
able to connect to the defined storage (and request target systems in case this is part of the executed job). Who does what is then negotiated based on this data.
Thus whether you run it on your local, connect from your local to the cloud storage, connect with several colleagues to cloud storage from local,
deploy instances in the cloud or mix all of those does not matter.

**What you get - Relevancy Explorations**
- Not dependent on any target system client: as long as you can request the target system and get a json back,
you are good.
- UI-supported composition of experiments
- parameter permutations with full flexibility over altering url parameters, headers, bodies and effectively reduce the permutations
in a grid search by mapping specific conditional lists of parameter values to other parameter values
(such as mapping a specific query value to a list of filter values that is specific for that specific query).
- metrics (such as well-known information retrieval (IR) metrics such as DCG, NDCG, ERR, Precision, Recall as well
as histograms of fields, jaccard comparison of results for distinct search systems and more)
- judgement lists
- batching (by any parameter)
- tagging allows any granularity of generated results. Default is on a parameter level (such as query).
- UI-supported aggregation
- aggregate from folders by regex or by defining specific results to aggregate
- use group-definitions to aggregate by group
- use weight definitions to allow for a relative grading of sample-importance (such as weighting down queries that
have a lower frequency)
- UI-supported creation and visualization of summaries
- provides both an estimation of the effect the variation of the single parameters have on each calculated metric to quickly focus on the
right levers of result quality and representative parameter configurations for the group of "close to best" and "close to worst".
- visualization of metric behavior on parameter changes


**What you get - Batch Processing**

Kolibri implements a storage based task queue mechanism that does not require setup of any queue or database.
It further provides a Task-sequence based composition of jobs, that can be used for arbitrary use-cases.
While relavancy explorations (see above) are the main use-case for the development of the UI and out-of-the-box
job definitions, it can be used as a general batch processor.


![Alt text](images/kolibri.svg?raw=true "Kolibri")

## Kolibri DataTypes
Provides basic data structures used throughout Kolibri to simplify data
processing.

Documentation: <https://awagen.github.io/kolibri/kolibri-datatypes/>
Documentation: <https://awagen.github.io/kolibri_archive/kolibri-datatypes/>

## Kolibri Base
Provides cluster forming, webserver and worker nodes, and batch execution logic including
jobs regarding batch search evaluation / optimization, requesting the search system
## Kolibri Definitions
Contains functionality of composing ```what``` to execute, irrespective of the particular execution mechanism.
Provides job definitions regarding batch search evaluation / optimization, requesting the search system
and evaluating results based on judgement files and/or custom properties
of the responses of the search system.

Documentation: <https://awagen.github.io/kolibri/kolibri-base/>
## Kolibri Storage
Contains implementations for persistence, such as local as well as cloud-based readers / writers.

## Kolibri Fleet
The project was initially based on an Akka-based execution mechanism that allowed node clustering and
collaborative execution. This has been phased out to make room for a more loosely coupled mechanism,
using the ZIO framework as processing engine, and based on storage-based definition of what needs to be done,
while all connected nodes then negotiate who executes what in a claim-based mechanism.
For the last state including the akka-mode, refer to the release tag ```v0.1.5-fleet``` (the tag ```v0.1.5``` represents
the code base when the akk-based execution mode was still within the kolibri-base project before the split into
```kolibri-definitions``` and ```kolibri-fleet-akka```, while the same tag with ```-fleet``` suffix represents the
same functionality but split into above-mentioned sub-projects).
From tag ```v2.0.0``` on the kolibri-fleet-akka subproject is removed and the documentation on the official doc-website
will move to a legacy-subfolder.
The new mode opens nice, lean ways of quickly trying out stuff without the need to deploy: the mechanism does not differentiate
between deployed nodes or nodes connected from anywhere (such as your local machine or those of your colleagues),
the only thing that matters is that those nodes have access to the persistence configured (such as an S3 bucket).

Note that the updated official documentation can be found on <https://awagen.github.io/kolibri>.
You will find the legacy-documentation of the akka-based project for now under
<https://awagen.github.io/kolibri/kolibri_archive/>.

Let's have a look at kolibri-fleet-zio.

### Kolibri Fleet Zio
Provides api to post new jobs, retrieve node and processing status, provide results to the kolibri-watch UI
and handle the executions. The usage is simple: you post a job definition (which all come with a clear definition
of needed fields and types that is interpreted in the kolibri-watch frontend to ease composition),
then mark it as ready for being processed. All connected nodes will then negotiate via claims who computes which
task. The nodes themselves to not communicate in any way except via state of the persistence.
That is, every node writes a node health file, claims for tasks it selected for execution, and a processing state
in case a claim was successful and processing has started.

![Kolibri Overview](docs_material/kolibri_overview.png?raw=true "Kolibri Overview")

Should a node go down, there will not be any updates anymore to the health status or the processing status.
All other connected nodes check for timeout on both types of updates and if exceeded claim the right to clean up
the state. This leads to resetting of tasks of the problematic node to open state (thus claimable by other nodes)
as well as removal of the node health file for the respective node. This way the state of the health files can
be understood as state of currently available nodes, irrespective of where they run.
They only need access to the used persistence (read and write). In case the defined jobs require access to any
system during processing, obviously the nodes also need access to those systems.

Currently three types of persistence can be used out of the box: local file storage, AWS S3 buckets and GCP gcs.

#### Notes on throughput
To tune the throughput, check the curves for in/out task-queue element flow
(see grafana board below). If you see fast production of in-flow elements,
and slower consumption (out-flow elements), then you might want to
increase the setting of the env variable `NUM_AGGREGATORS_PER_BATCH`.
In case you see a slow producer, you might want to increase the setting
for `NON_BLOCKING_POOL_THREADS` (in general you should assign more of the
available threads to the async pool compared to the blocking pool).
You might also play with `MAX_PARALLEL_ITEMS_PER_BATCH` and
`CONNECTION_POOL_SIZE_MIN/MAX`.
If you see that at times the app overwhelms the requested target application,
you can restrict the maximal throughput via `MAX_BATCH_THROUGHPUT_PER_SECOND`.


Grafana Board:
![KolibriBase Grafana Board](images/Kolibri-Dashboard-Grafana.png?raw=true "KolibriBase Grafana Board")

## Kolibri Watch
Vue project providing a UI for Kolibri.
The UI allows to start batch executions based on templates and watch the process for jobs overall
and single batches in particular including resource consumption on the nodes.
Jobs can also be killed via UI.
Future iterations will also include result / analysis visualizations.

Documentation: <https://awagen.github.io/kolibri/kolibri-watch/>
Documentation: <https://awagen.github.io/kolibri_archive/kolibri-watch/>

Status overview of cluster:
![KolibriWatch Status](images/kolibri-watch-status.png?raw=true "KolibriWatch Status")

Creating job definitions from templates and starting jobs:
![KolibriWatch Templates](images/kolibri-watch-templates.png?raw=true "KolibriWatch Templates")
![KolibriWatch Status](images/Status.png?raw=true "KolibriWatch Status")

Finished job history:
![KolibriWatch History](images/kolibri-watch-finished-jobs.png?raw=true "KolibriWatch Finished Jobs")
![KolibriWatch History](images/History.png?raw=true "KolibriWatch Finished Jobs")

There are two main options to specify the job to execute:

**1) Creating job definitions from forms defined by structural definitions provided by the kolibri-fleet-zio API.**

Without using a pre-fill of fields by some existing template:
![KolibriWatch Templates](images/Create_Form_Empty.png?raw=true "KolibriWatch Form1")

Using a pre-fill of fields by some existing template (here a small dummy example for a test job that only waits in each batch):
![KolibriWatch Templates](images/Create_Form_SmallExample.png?raw=true "KolibriWatch Form2")

A longer example:
![KolibriWatch Templates](images/Create_Form_FillIn_Template.png?raw=true "KolibriWatch Form3")


**2) Raw edit mode of existing templates**
![KolibriWatch Templates](images/Create_Raw_Template_Edit.png?raw=true "KolibriWatch Form4")


Jobs submitted via ```Run Template``` button will only be stored as open job and displayed on the ```Status``` page.
To start any processing, we still need to place a processing directive, which can be done via ```Start``` button
besides the listed job. Note that processing can be stopped via ```Stop``` button when it is marked to be processed.
Further, the ```Delete``` option removes the whole job definition, effectively stopping any execution on it.
Note that there will be some delay between ```Stop / Delete``` and the node actually stopping the processing.

### Experiment Result Visualization
Kolibri-Watch allows visualization of experiment summary results.
This includes :

- overviews of representational configuration examples for the `best` setting group as well as
the `worst` settings group.

![KolibriWatch WinnerLooser](images/KolibriUIWinnerLooserConfigs.png?raw=true "KolibriWatch WinnerLooser Configs")

- estimation of effect of single parameter variations for the different result files. Currently this contains
`maxMedianShift` (for each setting of the parameter of interest calculate median and calculate the difference between min and max observed)
and `maxSingleResultShift` (here observe different settings of the parameter of interest by comparing pairs where all other parameters
are kept constant and only the parameter of interest varies. Calculate the difference max - min and take the max over all).

![KolibriWatch ParameterEffect](images/KolibriUIParameterEffect.png?raw=true "KolibriWatch Parameter Effect")


- Further, it is possible to create charts for single results on demand. For now line charts and histogram charts are
provided. This is planned to be extended shortly.
The `<>` symbol between chars appears if charts can be merged into each other, displaying both in the same
window. Merging works for an arbitrary number of charts.

![KolibriWatch SingleResult](images/KolibriUISingleResult.png?raw=true "KolibriWatch Single Result")



### Kolibri Grafana Dashboard

![KolibriWatch GrafanaBoard1](images/KolibriGrafana1.png?raw=true "KolibriWatch GrafanaBoard1")

![KolibriWatch GrafanaBoard2](images/KolibriGrafana2.png?raw=true "KolibriWatch GrafanaBoard2")

![KolibriWatch GrafanaBoard3](images/KolibriGrafana3.png?raw=true "KolibriWatch GrafanaBoard3")



## Subproject Handling
- executing sbt commands on single projects: include the project sub-path
in the command, such as ```sbt kolibri-base/compile```
in the command, such as ```sbt kolibri-definitions/compile```
- execute according to dependencies as defined in the root build.sbt, such as
compile in needed order ```sbt compile```

19 changes: 16 additions & 3 deletions build.sbt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
ThisBuild / scalaVersion := "2.13.2"

ThisBuild / version := "0.1.4"
ThisBuild / version := "0.2.4"

// Scala Compiler Options
ThisBuild / scalacOptions ++= Seq(
@@ -45,13 +45,26 @@ ThisBuild / evictionWarningOptions in update := EvictionWarningOptions.default
// otherwise changes from projects referenced in dependsOn here dont seem to be picked up from local
// but need local jar publishing
lazy val root = (project in file("."))
.aggregate(`kolibri-datatypes`, `kolibri-base`)
.aggregate(`kolibri-datatypes`, `kolibri-storage`, `kolibri-definitions`, `kolibri-fleet-zio`)
.settings(update / aggregate := false)
lazy val `kolibri-datatypes` = (project in file("kolibri-datatypes"))
.enablePlugins(JvmPlugin)
lazy val `kolibri-base` = (project in file("kolibri-base"))
lazy val `kolibri-storage` = (project in file("kolibri-storage"))
.dependsOn(`kolibri-datatypes` % "compile->compile")
.enablePlugins(JvmPlugin)
lazy val `kolibri-definitions` = (project in file("kolibri-definitions"))
// storage already includes datatypes, thus only adding storage
// dependency here
.dependsOn(`kolibri-storage` % "compile->compile")
.enablePlugins(JvmPlugin)
// extending Test config here to have access to test classpath
.configs(IntegrationTest.extend(Test))
.settings(
Defaults.itSettings
)
lazy val `kolibri-fleet-zio` = (project in file("kolibri-fleet-zio"))
.dependsOn(`kolibri-definitions` % "compile->compile")
.enablePlugins(JvmPlugin)
// extending Test config here to have access to test classpath
.configs(IntegrationTest.extend(Test))
.settings(
Loading