This repository provides a replication package for the empirical study on the usage of the operator "not" in JSON Schema.
To refer to this replication package in a publication, please use these BibTeX entries.
@misc{Baazizi_Usage_of_Not_2021,
author = {Baazizi, Mohamed Amine and Colazzo, Dario and Ghelli, Giorgio and Sartiani, Carlo and Scherzinger, Stefanie},
doi = {10.5281/zenodo.5141378},
month = {8},
title = {{"Usage of Not" Replication Package }},
url = {https://github.com/sdbs-uni-p/usage-of-not-replication},
year = {2021}
}
@inproceedings{Baazizi:2021:usageofnot,
author = {Mohamed Amine Baazizi and
Dario Colazzo and
Giorgio Ghelli and
Carlo Sartiani and
Stefanie Scherzinger},
title = {An Empirical Study on the "Usage of Not" in Real-World JSON Schema Documents},
booktitle = {Proc.\ ER},
year = {2021}
}
- "Usage of Not" Replication Package
The fastest way to get up and running is to download and run the pre-built docker image from Zenodo (DOI: 10.5281/zenodo.5141378) and load the docker image from the tarball.
# Download tarball from Zenodo first, then load image
# Please note: the image has a considerable size of 20.4GB (uncompressed) therefore this will take some time
docker load < sds_usage-of-not-replication_v1-0.tar.gz
# Start a container from the pre-built image
docker run --name <container-name> -d sds/usage-of-not-replication:1.0
Hint: the --name option is optional but allows you to identify the container easily.
After running aforementioned commands the container with the Postgres database server will be running in the background and already contain all the data, so you can now take your first steps.
You can also build the image yourself. To do so follow the steps below:
# Clone git repository
git clone https://github.com/sdbs-uni-p/usage-of-not-replication.git
# Go to repository root
cd usage-of-not-replication
# Download SQL-dump file (git lfs only stores pointer file for large files)
# Pre-requisite: git lfs must be installed (See https://git-lfs.github.com/)
git lfs pull --include "jsonschemacorpus_dump.sql.gz"
# Build docker image
# This image was built with "docker build -t sds/usage-of-not-replication:1.0 ."
# Note: If <container-registry> is omitted docker.io is used as container-registry
docker build -t <container-registry>/<repository>/<image-name>:<tag> .
# Start a container from your own image
# This image is run with "docker run --name <container-name> -d sds/usage-of-not-replication:1.0"
docker run --name <container-name> -d <container-registry>/<repository>/<image-name>:<tag>
The build process will take a considerable amount of time as not only some dependencies must be pulled in and installed but the initialization process will be performed. As soon as the build is finished you can run the image as in Pre-built docker image (OPTION 1).
Important:
jsonschemacorpus_dump.sql.gz MUST be downloaded from git lfs, otherwise decompressing the file will fail since the stored pointer file is not in the correct format. jsonschemacorpus_dump.sql.gz will be 1.5GB in size whereas the pointer file only occupies a few bytes. A SHA256-sum will be computed during the build process to verify the SQL-dump is valid. If it is not, the build process will fail.
Make sure you have completed all the steps in Getting Started and then continue here. To be able to work with the container you will have to perform the following steps:
As the container is now running in background you have to get shell access as user root to the container to be able to connect to the database and run queries.
# Start bash as root within the container
docker exec -u root -it <container-name> bash
# OR: Combine Steps "Get shell-access to container" and "Start psql" into one command to get dropped into the psql console immediately
docker exec -u root -it <container-name> psql jsonschemacorpus
Important In case you change the name of the database (i.e. modify $POSTGRES_DB when building the image) from default value "jsonschemacorpus" make sure to change the psql command above.
You are now dropped into a shell within /json-schema-corpus directory inside the container, which contains all relevant files (See Image content). A connection to the database $POSTGRES_DB (Default: jsonschemacorpus) can be established with psql CLI-client.
# Start psql client connecting to database $POSTGRES_DB
psql $POSTGRES_DB
# Expected result: psql console (Assuming $POSTGRES_DB=jsonschemacorpus)
psql (12.7 (Ubuntu 12.7-1.pgdg18.04+1))
Type "help" for help.
jsonschemacorpus=#
The psql console will now be displayed (as indicated above) and you can start to write your own queries or execute the queries outlined in the paper. To do so please use the following command:
# In general
\i path/to/file
# E.g.: Run query_02.sql
\i ./sql-queries/query_02.sql
Alternatively you can also execute a SQL-script directly from bash:
# In general
psql $POSTGRES_DB -f /path/to/file
# OR
psql $POSTGRES_DB < /path/to/file
# E.g.: Run query_02.sql
psql $POSTGRES_DB -f ./sql-queries/query_02.sql
# OR
psql $POSTGRES_DB < ./sql-queries/query_02.sql
Here some hints on what you may find helpful to work with the provided container image:
- Using vi or nano which are provided inside the container you may build your own queries or view and adjust the provided queries.
- You can create SQL-scripts on your host machine and copy them into the container or the other way round using docker cp command.
- You may find it useful to define a volume mapping between a directory on your host machine and a directory inside the container to share data. To do so you have to start the container with -v option and define which directorys you would like to map. E.g.: Defining a mapping between sql-queries folder on host and /json-schema-corpus/sql-scripts directory inside the container will give you a convenient way to view/ edit queries on your host machine and have them available inside your container for execution. Of course this is also possible with any other directory (except those relevant for the container to work properly).
- Use \pset pager off in psql console to disable pagination of query results
You may be more comfortable using a (web) GUI such as pgAdmin as a client application instead of psql as provided in the container. Two general approaches on how to connect from pgAdmin are described here.
If you have pgAdmin already installed as your default Postgres client you can simply add a new database server in pgAdmin. Of course this requires you to know the IP address of the container and there are 2 general network setups which will result in different IP addresses. On one hand there are isolated network stacks provided by docker (default) and on the other hand there is the possibility to not have isolated networks for container and host which might be more convenient (See host-networking).
By default your container will be started in an isolated network called "bridge" which is created by Docker for you. To obtain the IP address (and port) from the container in this case you may choose one of the following alternatives:
- Using docker inspect
docker inspect <container-name>
This will return a JSON document which at the end contains the networking details and of course the IP address. Here is an excerpt of the relevant section of the JSON document.
"NetworkSettings": {
"Bridge": "",
"SandboxID": "758e7f97f36d2fb6892e3ad73eb23d33b51faa606184483db2fef47e4e4d0d15",
"HairpinMode": false,
"LinkLocalIPv6Address": "",
"LinkLocalIPv6PrefixLen": 0,
// Here you will find all the ports that are exposed on this container
"Ports": {
"5432/tcp": null
},
"SandboxKey": "/var/run/docker/netns/758e7f97f36d",
"SecondaryIPAddresses": null,
"SecondaryIPv6Addresses": null,
"EndpointID": "938340b699e72dce8facae69e27fd7de6fc837efd56b44d43108ec35d8d1f053",
"Gateway": "172.17.0.1",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"IPAddress": "172.17.0.3",
"IPPrefixLen": 16,
"IPv6Gateway": "",
"MacAddress": "02:42:ac:11:00:03",
"Networks": {
// Find the "bridge" network
"bridge": {
"IPAMConfig": null,
"Links": null,
"Aliases": null,
"NetworkID": "68e507252eda2dd44c72977272ee4e0c1a6da0bcbf6acca47cce178e5268a06f",
"EndpointID": "938340b699e72dce8facae69e27fd7de6fc837efd56b44d43108ec35d8d1f053",
"Gateway": "172.17.0.1",
// Here you will find the relevant IP address
"IPAddress": "172.17.0.3",
"IPPrefixLen": 16,
"IPv6Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"MacAddress": "02:42:ac:11:00:03",
"DriverOpts": null
}
}
}
- Using standard UNIX utilities
By running docker exec command you can execute commands within the container and can therefore use utilitites installed to get the IP address.
docker exec <contaner-name> hostname -I
Please be aware that docker only provides host networking for Linux hosts.
If you dont want the network stack of the container to be isolated from your host you can attach the container to the so called "host" network. This is only possible when starting a container. You cannot change this afterwards.
# Use --network flag to connect the container to the "host" network
docker run --name <container-name> -d --network host sds/usage-of-not-replication:1.0
You can now reach the container using the IP address of your host system.
In case you dont have pgAdmin installed already or you dont want to use your local installation a convenient way to use pgAdmin is Docker. There is an offical pgAdmin Docker image available at DockerHub and offical documentation on how to use the image on pgadmin.org. You may use this image and user-defined docker networks to setup the two containers in their own isolated network or you might go further and use Docker Compose to deploy everything conveniently using a yaml configuration file.
The following is an excerpt (Section 2.1 Pattern Queries) from the paper "An Empirical Study on the “Usage of Not” in Real-World JSON Schema Documents" explaining the pattern language used to analyse JSON Schema Documents.
As part of our analysis, we will study which keywords occur below an instance
of the not operator. To this aim, we introduce a simple path language, where a
path such as .∗∗.not.required matches any path that ends with an object field
named required found inside an object field whose name is not.
Paths are expressed using the following language.¹ Pattern matching is de-
fined as in JSONPath², so we only give an informal introduction.
p ::= step | step p
step ::= .key | . ∗ | [∗] | .∗∗
filtered p ::= p ? p
The step .∗ retrieves all member values of an object, [∗] retrieves all items of
an array, and .∗∗ is the reflexive and transitive closure of the union of .∗ and [∗],
navigating to all nodes of the JSON tree to which it is applied.
We use the conditional form p1 ? p2 to denote nodes n that are reached by
a path p1, such that if we follow a path p2 starting from n we will arrive at
some node. For example, if we have one subtree reached by . ∗ ∗.anyOf that
contains three nodes with name not, then we count one instance of the path
.∗∗.anyOf ? .∗∗.not, but three instances of the path .∗∗.anyOf.∗∗.not.
PostgreSQL-internal implementation of JSONPath underspecified for our purposes.
[2] Friesen J.: Extracting JSON Values with JsonPath, pp. 299–322. Apress, Berkeley, CA (2019)
The init.sh script will do the following three steps to initialize the database:
- Configure psql
- Decompress SQL-dump
- Restore database from SQL-dump
- Delete SQL-dump to keep Docker image size as small as possible
The image is based on Ubuntu 18.04 and does run a postgres server listening at 0.0.0.0:5432 when started. Therefore, you may connect to the database server inside the container from another PostgreSQL client e.g. pgAdmin if you please. Furthermore it contains wget, gzip, nano and vi to provide a more convenient way to interact with the system. You can of course at any time pull in more packages as you desire. Apart from those packages the SQL queries presented in the paper and some administrative scripts are provided within the $WORKDIR (Default: /json-schema-corpus).
# Structure of working directory $WORKDIR
|–– json-schema-corpus
| |–– scripts
| | |–– check_queries.sh
| | |–– entrypoint.sh
| | |–– init.sh
| |–– sql-queries
| | |–– query_01.sql
| | |–– query_02.sql
| | |–– ...
| | |–– query_49.sql
Certain environment variables are available to customize the image. Commands outlined in the README may need to be adjusted accordingly when default values are changed.
Name | Default value | Description | Note |
---|---|---|---|
POSTGRES_USER | root | Postgres super user to be created | TRUST authentication (in PostgreSQL) is enabled, so if changed an appropiate linux user must be created |
POSTGRES_PASSWORD | password | Postgres super user password | Should not be required as TRUST authentication is enabled |
POSTGRES_DB | jsonschemacorpus | Name of created database | |
POSTGRES_PORT | 5432 | Port Postgres listens on | Port will be exposed by the container |
PG_MAJOR_VERSION | 12 | Major version of PostgreSQL | do not change to ensure everything works as expected |
SQL_DUMP_FNAME | jsonschemacorpus_dump.sql | Name of SQL-dump file (when decompressed) | When decompressed -> no ".gz" file extension |
WORKDIR | /json-schema-corpus | Working directory within the container containing all artifacts provided in the repo |
Due to the layered architecture of Docker and copy-on-write you might encounter performance issues when you perform write operations on the data within the database. Details on those key Docker concepts can be found in the offical Docker documentation on storage drivers especially in the section "Copying makes containers efficient". There you will also find detailed documentation on how to resolve those issues with write-heavy applications using Docker volumes. By making use of volumes and backup, restore and migration features of Docker you can easily create your own volume which contains all the data stored in the database (Path /var/lib/postgresql/$PG_MAJOR_VERSION/main within the container) and resolve those performance issues.
Download the pre-built docker image as tarball:
- Zenodo, DOI: 10.5281/zenodo.5141378 (Size: 5.2GB)
This work was partly funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant #385808805. The schema corpus was retrieved using Google BigQuery, supported by Google Cloud.
This work is licensed under a Creative Commons Attribution 4.0 International License.