-
Notifications
You must be signed in to change notification settings - Fork 63
Enabling local data sources
Most of 4CAT's data sources use external APIs. However, the tool is also capable of capturing, storing, and querying locally saved data, for instance with 4chan and 8kun data (see the data source overview for a list of all local data sources). These data are stored in a PostgreSQL database and can be queried with Sphinx search.
This page explains how to enable the collection and querying of local data sources.
The first step is to enable the collection of locally stored data.
We first need to generate the database tables for the local data sources you want to add.
This is done by running the SQL query stored in the database.sql
file in the data source's datasources/
folder (e.g. datasources/fourchan/database.sql
).
How to run this SQL query will depend on your specific installation. Usually, this involves running a command through psql
from the data source folder like so:
psql -U username -d mydatabase -a -f database.sql
On manual, local 4CAT installations, you can also use the query tool in software like pgAdmin. If you're using Docker, the following code adds the database tables for 4chan collection to a fourcat
database with user fourcat
:
docker exec -it 4cat_backend /bin/bash
cd datasources/fourchan/
psql --host=db --port=5432 --user=fourcat --dbname=fourcat -f database.sql
Once the database tables are generated, let's enable the data source through 4CAT's Web interface.
Navigate to Control Panel
-> Settings
-> Data sources
in 4CAT. Then, enable the desired data sources by checking the checkmark.
Enabling a local data source generates a specific menu for that data source on the Data sources
settings page in the Control Panel
(e.g. "4chan search"). Here you might have to make some adjustments. For instance, for imageboard data collection, you have to specify which boards you want to scrape, for instance by adding 4chan/pol/ like so:
You can add more than one board to the list, e.g. ["pol", "v", "fit"]
. You can also specify the interval with which boards are scraped, and whether to download images.
Go to Control Panel
-> Restart or Upgrade
and click the Restart
button. If you're using Docker, you can also use the Docker Desktop interface to stop and start the 4cat_backend
container.
After 4CAT restarts, you should begin to see log messages showing collected data.
Congrats! You're collecting data in a local PostgreSQL database. The data source will now show up in the Create dataset
page.
However, to execute queries for most local data sources, you will have to run a full-text search engine. To do so, we need to install Sphinx search and index the database.
The instructions will differ based on whether you're using 4CAT through Docker or if you're running it manually.
- Run the command
docker exec 4cat_backend python3 helper-scripts/generate_sphinx_config.py
to create a Sphinx configuration file, which contains information on all of the enabled local data sources (per the steps above). - Copy the
sphinx.conf
file to the host machine's current working directory, so we can edit the file. You can do so through the command:docker cp 4cat_backend:/usr/src/app/helper-scripts/sphinx.conf ./
You will later copy thesphinx.conf
file to the a new Sphinx container.
- Ensure
sql_host
is the 4CAT database container name, e.g.,sql_host = db
(older 4CAT versions did not do this automatically). - Change the
listen
hosts to0.0.0.0
fromlocalhost
. This allows Sphinx to receive connections from other containers and, if desired, your host machine.
listen = 0.0.0.0:9213
listen = 0.0.0.0:9306:mysql41
This container will index your collected data and allow you to search the data with 4CAT. The Docker image can be found here. To create the container, run the following command:
docker run -it --publish 9306 --name 4cat_sphinx -d macbre/sphinxsearch:3.3.1 /bin/sh
- Run
docker network ls
to identify 4CAT network, likely4cat_default
- Run
docker network connect 4cat_default 4cat_sphinx
assuming4cat_default
is the name of your 4CAT network and you used the--name 4cat_sphinx
option when creating thesphinxsearch
container in the previous step.
Edit the "Sphinx host" setting in 4CAT via Control Panel
-> Settings
-> 4CAT Tool Settings
- Edit "Sphinx host" to either the name of the
sphinxsearch
container (e.g.,4cat_sphinx
) or
- Run
docker network inspect 4cat_default
after adding the sphinx container to the network. Find the new sphinx container in the Container section and copy the IPv4Address. - In the 4CAT Control Panel, go to "4CAT Tool Settings" and change the "Sphinx host" value to the Sphinx IP address you just copied.
Prior to 2023-07, the host for Sphinx was hard-coded to run alongside 4CAT, but it must be updated for a Docker container setup.
This only affects the 4chan data source. Change this line in datasources/fourchan/search_4chan.py
to the Sphinx container IP address.
- Change
MySQLDatabase
host (default islocalhost
) to Docker IP address found via inspecting 4cat docker networkdocker network inspect 4cat_default
. (You can copy the file to your host directory in order to edit viadocker cp 4cat_backend:/usr/src/app/datasources/fourchan/search_4chan.py ./
or edit directly in the container if desired.) - After updating, copy to
4cat_backend
container (i.e.,docker cp datasources/fourchan/search_4chan.py 4cat_backend:/usr/src/app/datasources/fourchan/
)
We finally need to create full-text search indexes for any of the data that you already collected. Generating indexes means Sphinx will create fast lookup tables so words can be searched quickly. After, we run Sphinx through executing ./searchd
. Follow the following steps:
# Copy the `sphinx.conf` file we generated above to the sphinx bin folder
docker cp sphinx.conf 4cat_sphinx:/opt/sphinx/sphinx-3.3.1/bin/
# Connect to container
docker exec -it 4cat_sphinx /bin/sh
# Navigate to sphinx-3.3.1/bin/
cd /opt/sphinx/sphinx-3.3.1/bin/
# Create data and data/binlog folders IN the sphinx folder (sphinx-3.3.1/data/)
mkdir ../data
mkdir ../data/binlog
# run indexer
./indexer --all
# start searchd
./searchd
This generates full-text search indexes for all the local data sources you enabled and actives Sphinx. Make sure to the container running and restart ./searchd
whenever you restart the container!
To index newly collected posts, you can run docker exec 4cat_sphinx /bin/sh -c "cd /opt/sphinx/sphinx-3.3.1/bin/ && ./indexer --all --rotate"
whenever the container is running.
- You can check what Sphinx is listening to by running the following commend in the sphinx container (
docker exec -it sphinx_container_id /bin/bash
)netstat -nlp
If you're not using Docker, you can also install and run Sphinx manually.
- Download the Sphinx 3.3.1 source code.
- Create a sphinx directory somewhere in the directory of your 4CAT instance, e.g.
4cat/sphinx/
. In it, paste all the unzipped contents of thesphinx-3.3.1.zip
file you just downloaded (so that it's filled with the directoriesapi
,bin
, etc.). In thesphinx
directory, also create a folder calleddata
, and in thisdata
directory, one calledbinlog
. - Add a Sphinx configuration file. You can generate one by running the
generate_sphinx_config.py
script in the folderhelper-scripts
. After runninggenerate_sphinx_config.py
, a file calledsphinx.conf
will appear in thehelper-scripts
directory. Copy-paste this file to thebin
folder in yoursphinx
directory (in the example above:4cat/sphinx/bin/sphinx.conf
). - Generate indexes for the posts that you already collected (if you haven't run any scrape yet, you can do this later). Generating indexes means Sphinx will create fast lookup tables so words can be searched quickly. In your command line interface, navigate to the
bin
directory of your Sphinx installation and run the command./indexer --all
(Linux) orindexer.exe --all
(Windows). This should generate the indexes.- If you get the error
No such file or directory, will not index.
, make sure there's adata
folder in thesphinx
directory.
- If you get the error
- Finally, before executing any search queries, make sure Sphinx is active. In your command line interface, run
./searchd
(Linux) orsearchd.exe
(Windows; see known issues below if you get an error), once again within Sphinx'sbin
folder. Make sure to leave this process running (you may want to use something liketmux
).
See the Sphinx docs for more information.
You will need to re-run the indexer (docker exec 4cat_sphinx /bin/sh -c "cd /opt/sphinx/sphinx-3.3.1/bin/ && ./indexer --all --rotate"
for Docker, ./indexer --all
for Linux, and indexer.exe --all
for Windows) to update Sphinx's indexes with newly collected data. As your data grows, this can take a lot of time, so we run the indexer nightly via a cronjob
script.
- On Windows, you might encounter the error
The code execution cannot proceed because ssleay32.dll was not found
(see also this page). This can be solved by downloading Sphinx version 3.1.1. and copy-pasting the following files from the 3.1.1.bin
directory to your 3.3.1bin
directory:- libeay32.dll
- msvcr120.dll
- ssleay32.dll
- On Linux, you might run into permission issues. Make sure to execute the scripts with the right user.
🐈🐈🐈🐈