Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# StructSense Environment Configuration
# Copy this file to .env and configure according to your setup

# ============================================================================
# GROBID Configuration (for PDF Processing)
# ============================================================================
# GROBID is used to extract structured content from PDF files.
# You have multiple options for setting up GROBID:
# 1. Local Docker: http://localhost:8070 (default)
# 2. Hosted service: https://your-grobid-instance.com
# 3. External PDF service: Set EXTERNAL_PDF_EXTRACTION_SERVICE=True
# See docs/GROBID_SETUP.md for detailed setup instructions

# URL of GROBID server or external PDF extraction service
GROBID_SERVER_URL_OR_EXTERNAL_SERVICE=http://localhost:8070

# Whether to use an external PDF extraction service instead of GROBID
# Set to "True" if using a non-GROBID PDF extraction API
# Set to "False" to use GROBID (default)
EXTERNAL_PDF_EXTRACTION_SERVICE=False

# ============================================================================
# Weaviate Configuration (Vector Database)
# ============================================================================
# Weaviate is used for storing and querying ontology data

# HTTP connection settings
WEAVIATE_HTTP_HOST=localhost
WEAVIATE_HTTP_PORT=8080
WEAVIATE_HTTP_SECURE=False

# gRPC connection settings
WEAVIATE_GRPC_HOST=localhost
WEAVIATE_GRPC_PORT=50051
WEAVIATE_GRPC_SECURE=False

# Authentication
# IMPORTANT: Change this to a secure key in production!
WEAVIATE_API_KEY=user-a-key

# Timeout settings (in seconds)
WEAVIATE_TIMEOUT_INIT=30
WEAVIATE_TIMEOUT_QUERY=60
WEAVIATE_TIMEOUT_INSERT=120

# Weaviate collection name for ontology data
ONTOLOGY_DATABASE=ontology_database_agentpy

# ============================================================================
# Ollama Configuration (Local LLM)
# ============================================================================
# Ollama is used for local embeddings and LLM inference

# Ollama API endpoint
OLLAMA_API_ENDPOINT=http://localhost:11434

# Embedding model to use
OLLAMA_MODEL=nomic-embed-text

# ============================================================================
# LLM Configuration (for Agents)
# ============================================================================
# API keys for external LLM providers (if using hosted services)

# OpenAI / OpenRouter
# OPENAI_API_KEY=your-openai-api-key-here
# OPENROUTER_API_KEY=your-openrouter-api-key-here

# Anthropic Claude
# ANTHROPIC_API_KEY=your-anthropic-api-key-here

# Other providers
# DEEPSEEK_API_KEY=your-deepseek-api-key-here

# ============================================================================
# StructSense Configuration
# ============================================================================

# Enable knowledge graph source
ENABLE_KG_SOURCE=false

# Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOG_LEVEL=INFO

# ============================================================================
# Notes
# ============================================================================
# - Never commit the .env file to version control
# - Keep your API keys secure
# - See documentation for more configuration options
# - GROBID Setup Guide: docs/GROBID_SETUP.md
# - Docker Setup: docker/readme.md
55 changes: 52 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,57 @@ Welcome to `structsense`!

Whether you're working with scientific texts, documents, or messy data, `structsense` enables you to transform it into meaningful, structured insights.

### Documentation
The complete documentation for StructSense can be found here: [docs.brainkb.org](http://docs.brainkb.org/structsense_overview.html)
## 📋 Quick Start

### License
### Prerequisites

For PDF processing, StructSense requires a GROBID service. You have multiple options:

1. **Docker (Recommended)**: Run GROBID locally using Docker Compose
2. **Hosted Service**: Use a managed GROBID instance
3. **Manual Installation**: Install GROBID directly

See the [GROBID Setup Guide](docs/GROBID_SETUP.md) for detailed instructions on all setup options.

### Installation

```bash
pip install structsense
```

### Basic Usage

```bash
# Set up your environment variables (see GROBID Setup Guide)
export GROBID_SERVER_URL_OR_EXTERNAL_SERVICE=http://localhost:8070
export EXTERNAL_PDF_EXTRACTION_SERVICE=False

# Run StructSense
structsense-cli extract --source document.pdf --config config.yaml
```

## 📚 Documentation

- **Complete Documentation**: [docs.brainkb.org](http://docs.brainkb.org/structsense_overview.html)
- **GROBID Setup Guide**: [docs/GROBID_SETUP.md](docs/GROBID_SETUP.md)
- **Docker Setup**: [docker/readme.md](docker/readme.md)

## 🔑 Key Features

- **Multi-Agent System**: Orchestrates intelligent agents for structured extraction
- **Flexible PDF Processing**: Supports multiple GROBID deployment options
- **Scientific Text Support**: Optimized for scientific papers and technical documents
- **Ontology Integration**: Aligns extracted terms with standardized ontologies
- **Human-in-the-Loop**: Optional feedback integration for improved accuracy

## ⚙️ Configuration

StructSense uses environment variables for configuration. Key variables:

- `GROBID_SERVER_URL_OR_EXTERNAL_SERVICE`: URL of GROBID server (default: `http://localhost:8070`)
- `EXTERNAL_PDF_EXTRACTION_SERVICE`: Use external PDF service instead of GROBID (default: `False`)

See the [GROBID Setup Guide](docs/GROBID_SETUP.md) for complete configuration options.

## 📄 License
[Apache License Version 2.0](LICENSE.txt)
73 changes: 68 additions & 5 deletions docker/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,75 @@ You can also specify a particular Compose file with the `-f` flag:
docker compose -f custom-compose.yml up
```

## Directory
- Individual
- It consists individual docker compose file.
- Merged
- It contains a single Docker Compose file that consolidates all configurations from the individual files into one unified setup.
## 📁 Directory Structure

- **Individual**: Contains individual Docker Compose files for each service
- `grobid-service/`: GROBID PDF extraction service (optional)
- `ollama/`: Ollama LLM service
- `weaviate-vector-database/`: Weaviate vector database
- **Merged**: Contains a single Docker Compose file that consolidates all configurations from the individual files into one unified setup

## 🔧 Service Components

### Core Services (Root `docker-compose.yaml`)
The root `docker-compose.yaml` includes only the essential services:
- **Weaviate**: Vector database for ontology storage

### Optional Services

#### GROBID Service (Optional)
GROBID is used for PDF extraction but is **optional**. You have several alternatives:

1. **Run GROBID via Docker** (Recommended for local development):
```bash
cd docker/individual/grobid-service
docker compose up -d
```

2. **Use a hosted GROBID service**: Configure the URL in your `.env` file
3. **Use an external PDF extraction service**: Set `EXTERNAL_PDF_EXTRACTION_SERVICE=True`

See the [GROBID Setup Guide](../docs/GROBID_SETUP.md) for detailed instructions on all options.

#### Other Services
- **Ollama**: For running local LLM models
- **Complete Stack**: Use `docker/merged/docker-compose.yaml` to run all services together

## 🎯 Usage Examples

### Start Only Core Services
```bash
# From repository root
docker compose up -d
```

### Start GROBID Service (Optional)
```bash
cd docker/individual/grobid-service
docker compose up -d
```

### Start All Services (Including GROBID)
```bash
cd docker/merged
docker compose up -d
```

### Stop Services
```bash
docker compose down
```

## ⚠️ Requirements

Please ensure you have the **latest version of Docker and Docker Compose** installed. Older versions may result in compatibility errors related to the Compose file format.

- Docker Engine 20.10+
- Docker Compose V2 (recommended)

## 💡 Tips

- GROBID is **not required** if you're using hosted services or external PDF APIs
- Start only the services you need to save resources
- Use the merged configuration for a complete development environment
- Individual service configurations allow for more flexible deployment
Loading