Tool to build/export JSON documents from nested SQL queries.
Currently the data output is in the bulk format for loading into Elasticsearch. In addition, a base mapping file is auto-generated by inferring the types from the data.
The pre-packaged binary includes drivers for:
- PostgreSQL
- SQLite
- MSSQL
- MySQL/MariaDB
- Oracle
- (ODBC coming soon)
Download it from the releases page.
To keep the library simple, the sqltojson
library does not contain any drivers, but it is trivial to create a program with only the driver you want. See the cmd/sqltojson main file for an example.
Create a config file. See the included config.example.yaml for a real world example.
workers
- Number of workers. Defaults to 10.connections
- Max number of connections to the database. Defaults to 10.files.data
- Name of the file to write the data to. Defaults todata.json
. For more control, set the name to-
which will write the data to stdout. This makes it easy to pipe togzip
or perform other operations especially for large exports.files.mapping
- Name of the file to write the mapping to.index
- Name of the ES index the data applies to.type
- Name of the ES document type the data applies to.schema
- The schema of the documents being created.
Create a config file and then run.
sqltojson -config config.yaml
If files.data
is set to -
, the output will be written to stdout. This may be preferable for large exports to the data can be compressed.
sqltojson -config config-stdout.yaml | gzip -c > data.json.gz
Requires:
- Glide
- Go >=1.5 with
GO15VENDOREXPERIMENT=1
git clone https://github.com/chop-dbhi/sqltojson.git && cd sqltojson
glide install
The output files are named data.json
and mapping.json
by default.
curl -XPUT http://192.168.99.100:9200/my_index -d @mapping.json
curl -XPOST http://192.168.99.100:9200/_bulk --data-binary @data.json
"Connection reset by peer" during a data load
Elasticsearch has a limit to how large the request body can be for a single HTTP request. This is specified by the http.max_content_length
setting which is 100 MB by default.
One solution is to increase the limit, however all of the data does need to be kept in memory and is limited to about 4 GB, so there may be a hard limit on what is practical.
A better solution is to chunk up the output file into sizes smaller than that setting and load them sequentially. This can be done using the split
command available on Linux and OS X platforms.
Create a directory to split the files into.
mkdir data
cd
into and split the file so the new files are written in the directory. The files will be named liked xaa
, xab
, etc. Choose an even number of lines where the total file size is less than the http.max_content_length
.
cd data
split -l 10000 ../data.json
Loop over each file in order and bulk load it.
for f in `ls . | sort`; do
curl -X POST http://192.168.99.100:9200/_bulk --data-binary "@$f"
done
On a side note, the HTTP response is not terribly informative and is on board to be changed to a 413 Request Entity to Large.