This repo is being used to proof out logstash and it's ability to assist in replicating an RDS database in Elasticsearch
- Elasticsearch
- [Logstash][]
- JDBC Driver
- Setup Python 3 virtual environment and install dependencies
- Store a dev.env file at the root of your project with the following configurations
- Configure an alias in your shell profile such that logstashexecutes thelogstashprogram on your machine. Example:alias logstash=/usr/local/bin/logstash
You'll need to contribute a local file with the following configurations. This file will be used to populate your environment when calling app/config.py explicitly or when running the run_logstash.sh file.
   ELASTICSEARCH_URL=""
   RDS_PASSWORD=""
   RDS_USERNAME=""
   JDBC_DRIVER_LIBRARY=""
   JDBC_CONNECTION_STRING=""
   SQL_DB_NAME=""
   JDBC_DRIVER_CLASS=""The entrypoint for this project is run_logstash.sh. If you step through the file you will notice it's doing a few things.
- python app/config.pyis called, which parses your- dev.envfile and produces a- dev_env.shfile, which will export all your variables to your primary shell process
- source dev_env.shsources the aforementioned file.
- logstashcomes with a- -rflag which watches for changes in the- .conffile which configures the pipeline. This is useful for debugging purposes.
- The -fflag is used as an identifier that the next argument will be the path the configuration file.
Things to note:
- If running a pipeline who's output is an elasticsearch instance, make sure elasticsearch is up and running.
Primary manual interface for interacting with Elasticsearch.
- Mappings for each index are created in the index_settings.py
- Calling python app/interface.pywill instantiate the mappings in the Elasticsearch instance for each defined index inindex_settings.py
- search.pywill eventually be used to query against the Elasticsearch instance
Each ls_conf file is a different pipeline configuration for use with Logstash.
Each logstash.conf file is made up of 3 parts: input, filter, and output. The *_dev.conf files are a workaround in which the output dumps the results to a json file for quicker development, preventing the need to query an elasticsearch index.
- used to establish a connection with a database
- establishes a secure connection with the database via the JDBC (Java Database Connector)
- runs a query against the database (queries stored in sql_scripts)
- parses and configures results of the input query and prepares them for delivery to the output
- establishes a secure connection with the outbound database via the JDBC
- there are also various textual output streams that can be used
For more detail, checkout the logstash docs, specifically:
Note the inline comments for more information...
input {
    # Connection to the jdbc connector. Each ${var} is picked up from the shell environment
  jdbc {
    jdbc_driver_library => "${JDBC_DRIVER_LIBRARY}"
    jdbc_connection_string => "${JDBC_CONNECTION_STRING}"
    jdbc_driver_class => "${JDBC_DRIVER_CLASS}"
    jdbc_user => "${RDS_USERNAME}"
    jdbc_password => "${RDS_PASSWORD}"
    # Path to the sql query to be executed and consequently mapped to elasticsearch
    statement_filepath => "sql_scripts/update_index.sql"
  }
}
filter {
    # Assumes a jsonified RDS query response from
    json{
      id => "id"
      source => "rows"
      remove_field => "rows"
    }
} 
output {
    # ES configuration. Notice there are no mappings here. Mappings are configured by the python interface as of now
  elasticsearch {
    document_id => "%{task_id}"
    document_type => "index"
    index => "indexs"
    codec => "json"
    hosts => "${ELASTICSEARCH_URL}"
  }
}
- 
Aggregate 
- 
Mapping 
- 
Indexing 
[logstash]: