Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(data-computer): update lda parameters #153

Closed
wants to merge 14 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 159 additions & 2 deletions services/data-computer/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
# ws-data-computer@2.12.6
# ws-data-computer@2.13.1

Le service `data-computer` offre plusieurs services **asynchrones** pour des calculs et de transformations de données simples.

*Tous les services proposés acceptent uniquement en entrée des fichiers corpus standards au format tar.gz.*


## Utilisation

- [v1/tree-segment](#v1%2ftree-segment)
Expand Down Expand Up @@ -256,3 +255,161 @@ cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fe

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz
```


### v1/corpus-similarity

Compare des petits documents (Titre, phrases, petits *abstracts*) entre eux, et renvoie pour chaque document les documents qui lui sont similaires.
Il est conseillé d'utiliser cette route avec au moins 6-7 documents dans le corpus.

Il existe un paramètre optionnel `output` pour choisir le type de sortie en fonction de sa valeur:
- 0 (par défaut) : l'algorithme choisit automatiquement les documents les plus similaires à chaque document
- 1 : l'algorithme renvoie pour chaque document tous les documents, classés par ordre de proximité (les plus similaires en premier)
- *n* (avec *n* un entier plus grand que 1) : l'algorithme renvoie pour chaque document les *n* documents les plus proches, classés par ordre de proximité (les plus similaires en premier), ainsi que le score de similarité associé à chaque document.
par exemple en utilisant `example-similarity-json.tar.gz` avec le paramètre output par défaut (0), obtiendra :

> **Attention** : Le champ ID est utilisé comme référence de chaque document.

par exemple en utilisant `example-similarity-json.tar.gz` avec le paramètre output par défaut (0), obtiendra :

```json
[
{
"id": "Titre 1",
"value": {
"similarity": [
"Titre 4",
"Titre 2"
],
"score": [
0.9411764705882353,
0.9349112426035503
]
}
},
{
"id": "Titre 2",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9349112426035503
]
}
},
{
"id": "Titre 3",
"value": {
"similarity": [
"Titre 4"
],
"score": [
0.8888888888888888
]
}
},
{
"id": "Titre 4",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9411764705882353
]
}
}
]
```

Avec le paramètre output=3, on obtiendra :

```json
[
{
"id": "Titre 1",
"value": {
"similarity": [
"Titre 4",
"Titre 2",
"Titre 3"
],
"score": [
0.9411764705882353,
0.9349112426035503,
0.8757396449704142
]
}
},
{
"id": "Titre 2",
"value": {
"similarity": [
"Titre 1",
"Titre 4",
"Titre 3"
],
"score": [
0.9349112426035503,
0.8888888888888888,
0.8651685393258427
]
}
},
{
"id": "Titre 3",
"value": {
"similarity": [
"Titre 4",
"Titre 1",
"Titre 2"
],
"score": [
0.8888888888888888,
0.8757396449704142,
0.8651685393258427
]
}
},
{
"id": "Titre 4",
"value": {
"similarity": [
"Titre 1",
"Titre 3",
"Titre 2"
],
"score": [
0.9411764705882353,
0.8888888888888888,
0.8888888888888888
]
}
}
]
```

#### Paramètre(s) URL

| nom | description |
| ------------------- | ------------------------------------------- |
| indent (true/false) | Indenter le résultat renvoyer immédiatement |
leogail marked this conversation as resolved.
Show resolved Hide resolved
| output (0,1,n) | Choix de la sortie |

#### Entête(s) HTTP

| nom | description |
| ------ | ------------------------------------------------------------ |
| X-Hook | URL à appeler quand le résultat sera disponible (facultatif) |

#### Exemple en ligne de commande


```bash
# Send data for batch processing
cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fefa-9a72-4f76-96e5-059405a04f6c" "http://localhost:31976/v1/similarity" > output.json

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz
leogail marked this conversation as resolved.
Show resolved Hide resolved
Binary file not shown.
11 changes: 11 additions & 0 deletions services/data-computer/examples.http
Original file line number Diff line number Diff line change
Expand Up @@ -113,3 +113,14 @@ X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9

< ./example-json.tar.gz


###
# @name v1CorpusSimilarity
POST {{host}}/v1/corpus-similarity HTTP/1.1
Content-Type: application/x-tar
X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9

< ./example-similarity-json.tar.gz

2 changes: 1 addition & 1 deletion services/data-computer/package.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"private": true,
"name": "ws-data-computer",
"version": "2.12.6",
"version": "2.13.1",
"description": "Calculs sur fichier corpus compressé",
"repository": {
"type": "git",
Expand Down
2 changes: 1 addition & 1 deletion services/data-computer/swagger.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"info": {
"title": "data-computer - Calculs sur fichier corpus compressé",
"summary": "Calculs sur un corpus compressé",
"version": "2.12.6",
"version": "2.13.1",
"termsOfService": "https://services.istex.fr/",
"contact": {
"name": "Inist-CNRS",
Expand Down
83 changes: 83 additions & 0 deletions services/data-computer/tests.hurl
Original file line number Diff line number Diff line change
Expand Up @@ -72,4 +72,87 @@ delay: 1000
HTTP 200
[{"id":"#1","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"a"}},{"id":"#2","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"b"}},{"id":"#3","value":{"sample":1,"frequency":0.3333333333333333,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"c"}},{"id":"#4","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"a"}},{"id":"#5","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"b"}}]

################################ Test for Similarity ################################

POST {{host}}/v1/corpus-similarity
content-type: application/x-tar
x-hook: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
leogail marked this conversation as resolved.
Show resolved Hide resolved
file,example-similarity-json.tar.gz;

HTTP 200
# Capture the computing token
[Captures]
computing_token: jsonpath "$[0].value"
[Asserts]
variable "computing_token" exists

# There should be a waiting time, representing the time taken to process data.
# Fortunately, as the data is sparse, and the computing time is small,
# the need is small.

# Version 4.1.0 of hurl added a delay option, which value is milliseconds.
# https://hurl.dev/blog/2023/09/24/announcing-hurl-4.1.0.html#add-delay-between-requests

POST {{host}}/v1/retrieve-json?indent=true
content-type: application/json
[Options]
delay: 1000
```
[
{
"value":"{{computing_token}}"
}
]
```

HTTP 200
[{
"id": "Titre 1",
"value": {
"similarity": [
"Titre 4",
"Titre 2"
],
"score": [
0.9411764705882353,
0.9349112426035503
]
}
},
{
"id": "Titre 2",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9349112426035503
]
}
},
{
"id": "Titre 3",
"value": {
"similarity": [
"Titre 4"
],
"score": [
0.8888888888888888
]
}
},
{
"id": "Titre 4",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9411764705882353
]
}
}]


# TODO: ajouter les deux autres routes (v1GraphSegment, v1Lda)
# TODO: ajouter la route rapido
65 changes: 65 additions & 0 deletions services/data-computer/v1/corpus-similarity.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# OpenAPI Documentation - JSON format (dot notation)
mimeType = application/json

post.operationId = post-v1-corpus-similarity
post.description = Web service de calcul de similarité entre documents d un corpus
post.summary = 3 sorties sont disponibles
leogail marked this conversation as resolved.
Show resolved Hide resolved
post.tags.0 = data-computer
post.requestBody.content.application/x-tar.schema.type = string
post.requestBody.content.application/x-tar.schema.format = binary
post.requestBody.required = true
post.responses.default.description = Informations permettant de récupérer les données le moment venu
post.parameters.0.description = Indenter le JSON résultant
post.parameters.0.in = query
post.parameters.0.name = indent
post.parameters.0.schema.type = boolean
post.parameters.1.description = URL pour signaler que le traitement est terminé
post.parameters.1.in = header
post.parameters.1.name = X-Webhook-Success
post.parameters.1.schema.type = string
post.parameters.1.schema.format = uri
post.parameters.1.required = false
post.parameters.2.description = URL pour signaler que le traitement a échoué
post.parameters.2.in = header
post.parameters.2.name = X-Webhook-Failure
post.parameters.2.schema.type = string
post.parameters.2.schema.format = uri
post.parameters.2.required = false

post.parameters.3.in = query
post.parameters.3.name = output
post.parameters.3.schema.type = int
post.parameters.3.description = Choix du nombre de documents similaires à afficher dans la sortie : 0 pour automatique, 1 pour tout afficher, n'importe quel autre nombre pour afficher au maximum ce nombre d'élements.


[env]
path = generator
value = corpus-similarity

[use]
plugin = basics
plugin = spawn

# Step 1 (générique): Charger le fichier corpus
[delegate]
file = charger.cfg

# Step 2 (générique): Traiter de manière asynchrone les items reçus
[fork]
standalone = true
logger = logger.cfg

# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus
[fork/exec]
# command should be executable !
command = ./v1/corpus-similarity.py
args = fix('-p')
args = env('output', "0")

# Step 2.2 (générique): Enregistrer le résultat et signaler que le traitement est fini
[fork/delegate]
file = recorder.cfg

# Step 3 : Renvoyer immédiatement un seul élément indiquant comment récupérer le résultat quand il sera prêt
[delegate]
file = recipient.cfg
Loading