Sissy-G ~= Syzygy /ˈsɪzɪdʒi/ ASTRONOMY the alignment of three or more celestial bodies.
This demonstration shows an Airflow integration with Weaviate and OpenAI. Sissy-G Toys is an online retailer for toys and games. The GroundTruth customer analytics application provides marketing, sales and product managers with a one-stop-shop for analytics.
This workflow includes:
- sourcing structured, unstructured and semistructured data from different systems
- ingest with Astronomer's python SDK for Airflow
- data quality checks with Great Expectations
- transformations and tests in DBT,
- audio file transcription with OpenAI Whisper
- natural language embeddings with OpenAI Embeddings
- vector search and named-entity recognition with Weaviate
- sentiment classification with Keras
All of the above are presented in a Streamlit applicaiton.
Your Astro project contains the following files and folders:
-
dags: This folder contains the Python files for the Airflow DAG.
-
Dockerfile: This file contains a versioned Astro Runtime Docker image that provides a differentiated Airflow experience. If you want to execute other commands or overrides at runtime, specify them here.
-
include: This folder contains additional directories for the services that will be built in the demo. Services included in this demo include:
- minio: Object storage which is used for ingest staging as well as stateful backups for other services.
- mlflow: A platform for the machine learning lifecycle including model registry and experiment tracking.
- weaviate: A vector database
- streamlit: A web application framework for building data-centric apps.
-
packages.txt: Install OS-level packages needed for the project.
-
requirements.txt: Install Python packages needed for the project.
-
plugins: Add custom or community plugins for your project to this file. It is empty by default.
-
airflow_settings.yaml: Use this local-only file to specify Airflow Connections, Variables, and Pools instead of entering them in the Airflow UI as you develop DAGs in this project.
Prerequisites:
Docker Desktop or similar Docker services running locally.
OpenAI account or Trial Account
- Install Astronomver CLI. The Astro CLI is a command-line interface for data orchestration. It allows you to get started with Apache Airflow quickly and it can be used with all Astronomer products. This will provide a local instance of Airflow if you don't have an existing service. For MacOS
brew install astro
For Linux
curl -sSL install.astronomer.io | sudo bash -s
- Clone this repository.
git clone https://github.com/astronomer/airflow-llm-demo
cd airflow-llm-demo
- The data for this demo has been pre-embedded so the DAG will run without requiring an OpenAI token. However, the Streamlit app uses the Weaviate Q&A and near text modules and an OpenAI key is required to generate embeddings for the users question or search term.
If you would like to run the Streamlit application you will need to add you OpenAI API key to the AIRFLOW_CONN_WEAVIATE_DEFAULT
variable in the .env
file.
- Start Airflow, Minio, Weaviate, Streamlit and MLflow.
astro dev start
- Run the Airflow DAG in the Airflow UI
astro dev run dags unpause customer_analytics
astro dev run dags trigger customer_analytics
Follow the status of the DAG run in the Airflow UI (username: admin, password: admin)
- After the DAG completes look at the customer analytics dashboard in Streamlit.
Streamlit has been installed alongside the Airflow UI in the webserver container.
Connect to the webserver container with the Astro CLI
astro dev bash -w
Start Streamlit
cd include/streamlit/src
python -m streamlit run ./streamlit_app.py
Open the streamlit application in a browser.
Other service UIs are available at the the following:
- Airflow: http://localhost:8080 Username:Password is admin:admin
- Minio: http://localhost:9000 Username:Password is minioadmin:minioadmin
- MLFlow: http://localhost:5000
- Weaviate: https://link.weaviate.io/3UD9H8z Enter localhost:8081 in the "Self-hosted Weaviate" field.