Install Docker.
Install Git:
For Debian or Ubuntu:
$ sudo -H apt install -y -V gitFor RHEL variants:
$ sudo -H dnf install -y gitClone this repository:
$ sudo -H git clone https://github.com/ranguba/chupa-text-docker.git /var/lib/chupa-textPull Docker images:
$ cd /var/lib/chupa-text
$ sudo -H docker compose pullIf you want to change subnet for internal network from
172.18.0.0/24, copy .env.example to .env and change the content:
$ cd /var/lib/chupa-text
$ sudo -H cp -a .env{.example,}
$ sudo -H editor .envCreate log directory:
$ sudo -H mkdir -p /var/log/chupa-textInstall logrotate configuration:
$ sudo -H ln -fs \
/var/lib/chupa-text/etc/logrotate.d/chupa-text \
/etc/logrotate.d/chupa-textInstall systemd service file:
For Debian and Ubuntu:
$ sudo -H ln -fs \
/var/lib/chupa-text/lib/systemd/system/chupa-text.service \
/lib/systemd/system/chupa-text.service
$ sudo -H systemctl daemon-reload
$ sudo -H systemctl enable --now chupa-textFor RHEL variants:
$ sudo -H ln -fs \
/var/lib/chupa-text/usr/lib/systemd/system/chupa-text.service \
/usr/lib/systemd/system/chupa-text.service
$ sudo -H systemctl daemon-reload
$ sudo -H systemctl enable --now chupa-textYou can use ChupaText via HTTP or command line.
http://localhost:20080/ provides form to text extraction. You can use this style by your Web browser.
http://localhost:20080/extraction.json is Web API endpoint with the following specification:
-
HTTP Method:
POST -
Content-Type:
multipart/form-data -
Parameters:
You must to specify at least
dataoruri. You can specify bothdataanduri. In the case,uriis used as additional information.-
data: Data to be extracted. If content-type is specified, it's helpful because ChupaText doesn't need to guess content-type. If ChupaText guesses content-type, ChupaText may detect wrong content-type. -
uri: URI to be extracted.
-
Here is a curl command line to extract local PDF file at
/tmp/sample.pdf. You can use --form option to use
multipart/form-data. data=@PATH means that parameter name is
data and parameter value is content of
PATH. ;type=application/pdf specifies content-type of the data
value:
$ curl \
--form 'data=@/tmp/sample.pdf;type=application/pdf' \
http://localhost:20080/extraction.jsonThis Web API returns the following JSON:
{
"mime-type": "application/pdf",
"uri": "file:/home/chupa-text/chupa-text-http-server/sample.pdf",
"path": "/tmp/sample-36-1ywy0xf.pdf",
"size": 147159,
"texts": [
{
"mime-type": "text/plain",
"uri": "file:/home/chupa-text/chupa-text-http-server/sample.txt",
"path": "/home/chupa-text/chupa-text-http-server/sample.txt",
"size": 1012,
"title": "",
"created_time": "2015-01-22T15:54:11.000Z",
"source-mime-types": [
"application/pdf"
],
"creator": "Adobe Illustrator CS3",
"producer": "Adobe PDF library 8.00",
"body": "This is sample PDF. ...",
"screenshot": {
"mime-type": "image/png",
"data": "iVBORw...",
"encoding": "base64"
}
}
]
}In most cases, you're interested in texts values. They include
extracted text in body and screenshot in screenshot. Screenshot
has the following keys:
-
mime-type: The MIME type of thedata. Normally, this isimage/png. -
data: The image data encoded byencoding. -
encoding: This is optional. Ifdatais encoded by base64, this value is"base64". Ifdataisn't encoded, this key doesn't exist. ChupaText needs binary data but JSON doesn't support binary data because JSON is a text format. Ifdatais text data such as SVG, this key doesn't exist.
You can use ChupaText as command line tool by the following command line:
% sudo /usr/local/bin/docker-compose \
--file /var/lib/chupa-text/docker-compose.yml \
exec chupa-text \
xvfb-run -a chupa-text /tmp/sample.pdfIf your user is a member of docker group, you can omit sudo like
the following:
$ /usr/local/bin/docker-compose \
--file /var/lib/chupa-text/docker-compose.yml \
exec chupa-text \
xvfb-run -a chupa-text /tmp/sample.pdfCommand line interface uses the same JSON format as Web API.
- Sutou Kouhei
<kou@clear-code.com>
LGPL 2.1 or later.
(Sutou Kouhei has a right to change the license including contributed patches.)