FragmentEngine Content Extractor

A Typesense-powered content extraction and search system for government websites. This system crawls websites, extracts structured content fragments, enriches them with taxonomies, and provides a powerful search API. It then exposes this backend to produce user profiles, personalised content recommendation, retrieval augemented generation, service checklists and journey map visualisation.

Features

🕷️ Smart Web Scraping: Concurrent crawling with robots.txt compliance
📊 Rich Taxonomies: Auto-categorization by life events, services, and locations
🔍 Powerful Search: Faceted search with typo tolerance
📦 HTML Preservation: Maintains original styling and structure
🔄 Incremental Updates: Smart versioning prevents duplicates
🚀 Production Ready: Docker deployment with monitoring

Quick Start

Follow this track to get from zero install → Typesense full of fragments → multimodal UI → agentic evals.

0. Prerequisites

Docker + Compose plugin (enable Docker and add your user to the docker group)
Node.js 18+ (needed for UI helpers + eval runner)
curl and jq for smoke tests

1. Clone and configure

git clone https://github.com/p0ss/FragmentEngine
cd FragmentEngine
cp .env-example .env

# Required secrets
# Edit .env and set TYPESENSE_API_KEY=<your key>

# Optional: enable LiteLLM routing + your preferred model URLs
echo ENABLE_LITELLM=true >> .env
echo OLLAMA_URL=http://172.17.0.1:11434 >> .env  # host.docker.internal on Mac/Win
# Export OPENAI_API_KEY / ANTHROPIC_API_KEY / GROQ_API_KEY if you want remote providers

TYPESENSE_API_KEY must match between .env, UI helpers, and any scripts that talk to Typesense.

2. Bring the stack online

chmod +x deploy.sh
./deploy.sh --crawl   # builds images, boots Typesense + API + MCP, runs the unified scraper once
# Optional: pull in eval tooling (pick what you need)
./deploy.sh --with-evals        # agentic eval harness (API-only, lightweight)
./deploy.sh --with-seo-evals    # Google Search capture/SEO tooling (Puppeteer + Chromium)

Re-run ./deploy.sh after tweaking .env
Skip --crawl if you want to stage data manually
Health checks:

curl http://localhost:8108/health               # Typesense ready?
curl http://localhost:3000/health               # API + MCP server ready?
curl http://localhost:3000/api/llm/models       # Models discovered via LiteLLM/Ollama?

3. Fill Typesense with fragments

# Default crawl (my.gov.au + servicesaustralia.gov.au)
docker-compose run --rm scraper

# Target your own list
TARGET_URLS="https://www.ato.gov.au,https://www.ndis.gov.au" \
  docker-compose run --rm -e TARGET_URLS="$TARGET_URLS" scraper

Confirm ingestion:

source .env
curl -s -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
  http://localhost:8108/collections/content_fragments | jq '.num_documents'

If you want per-page aggregates, open pages-analyse.html after the crawl and click Build content_pages in Typesense to populate the derived collection used by the overlap visualisations.

4. Use the helper UIs to harvest facts & rubric data

Serve the static tools from the repo root so they can reach http://localhost:3000:

python3 -m http.server 4173
# Visit http://localhost:4173/<file>

fragments-visual.html: Graph explorer + facet filters for grabbing fragment IDs, context, and example claims. Use this to list the facts you expect (and the hallucinations you want to forbid) before authoring eval prompts.
pages-analyse.html: Builds the content_pages collection, compares domains, exports CSV/JSONL, and gives you per-page evidence to cite in rubrics.

Suggested workflow when designing eval samples:

Filter fragments by life event/provider and copy authoritative snippets → these become your ideal answers.
Note fragment IDs + URLs → store them in fragment_ids inside the eval sample for traceability.
Capture “facts we never want to see” and record them as disallowed_claims in the sample.
Use the exported CSV/JSONL as a checklist when filling the eval template.

5. Launch the multimodal interface

Using the same static server, open http://localhost:4173/multimode-interface.html.

What you get:

Conversational + profile-aware assistant that calls POST /api/llm/chat-with-context
Search + journey builder panes sourced directly from Typesense fragments
Automatic model dropdown fed by GET /api/llm/models (LiteLLM, Ollama, OpenAI, Anthropic, Groq)
Inline diagnostics if the API or models endpoint is unreachable

If the UI cannot reach the API, check Docker logs for api, ensure ports 3000/8108/8081 are open locally, and re-run the health commands from step 2.

6. Run the agentic evals

Enable just the tooling you need: ./deploy.sh --with-evals (or set ENABLE_AI_EVALS_PIPELINE=true) prepares the adversarial agentic harness, while ./deploy.sh --with-seo-evals (or ENABLE_SEO_EVALS_PIPELINE=true) installs the Google Search capture/SEO stack.

cd evals
npm install            # keeps npm scripts happy (no external deps required)
./quick-start.sh      # optional smoke check (verifies services + sample data)

# Full adversarial sweep (baseline + tool-enabled + reviewer loop)
npm run eval:all

# Focus on a single mode if needed
npm run eval:baseline
npm run eval:tools
npm run eval:adversarial

Outputs live in evals/results/<eval>-<mode>-TIMESTAMP.json. Each result contains:

iterations: responder + reviewer transcripts when running adversarial
score: automatic rubric result (factual_accuracy, completeness, entity_verification)
Evidence about which claim triggered a rejection

Add or edit cases in evals/registry/data/government-services-grounding/samples.jsonl:

{"input":[{"role":"system","content":"You are a helpful assistant with access to the MCP tools."},
           {"role":"user","content":"How long is Paid Parental Leave?"}],
 "ideal":"18 weeks",
 "facets":{"provider":"Services Australia","life_event":"Having a baby"},
 "fragment_ids":["servicesaustralia.gov.au::ppl-overview"],
 "disallowed_claims":["Paid Parental Leave is 26 weeks"],
 "eval_type":"factual_accuracy"}

The helper UIs make it easy to grab the fragments, acceptable facts, and forbidden claims that populate ideal, fragment_ids, and disallowed_claims. Run npm run eval:adversarial whenever you change a sample to confirm the reviewer can force the responder to stay grounded.

With those steps you now have: (1) Typesense populated with fragments, (2) the multimodal UI pointing at live data, and (3) an adversarial eval harness guarding regressions.

Usage

Running a Crawl (Unified Scraper)

# Run a full crawl of default targets (my.gov.au + servicesaustralia.gov.au)
docker-compose run --rm scraper

# Or specify multiple targets explicitly
TARGET_URLS="https://my.gov.au,https://www.servicesaustralia.gov.au,https://www.ato.gov.au" \
  docker-compose run --rm -e TARGET_URLS="$TARGET_URLS" scraper

# Tuning options
docker-compose run --rm -e MAX_DEPTH=2 -e CONCURRENCY=8 scraper

If you prefer targeted runs, the compose file also includes:

scraper-mygov
scraper-servicesaustralia

However, the unified scraper service is recommended to keep a single crawl version and simplify pruning/indexing across domains.

API Examples

Check if API is running

docker-compose up -d api

Get collection stats

curl http://localhost:3000/api/fragments/stats/overview

Search for content

curl "http://localhost:3000/api/fragments/search?q=medicare"

Get available facets

curl "http://localhost:3000/api/fragments/facets"

Get specific fragment

curl "http://localhost:3000/api/fragments/[fragment-id]"

Get a bulk export

curl "http://localhost:3000/api/fragments/export"

List available chat models (via LiteLLM or Ollama)

curl "http://localhost:3000/api/llm/models"

Integration Example

// In your application
async function getChecklistItems(state, stage, stageVariant) {
  const params = new URLSearchParams({
    life_event: 'Having a baby',
    state: state,
    stage: stage,
    stage_variant: stageVariant,
    include_html: true,
    per_page: 100
  });

  const response = await fetch(`http://localhost:3000/api/fragments/search?${params}`);
  const data = await response.json();
  
  return data.results.map(hit => ({
    id: hit.document.id,
    title: hit.document.title,
    description: hit.document.content_text,
    url: hit.document.url,
    html: hit.document.content_html,
    provider: hit.document.provider
  }));
}

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Scraper   │────▶│  Typesense  │◀────│     API     │
└─────────────┘     └─────────────┘     └─────────────┘
       │                                         ▲
       │                                         │
       ▼                                         │
┌─────────────┐                           ┌─────────────┐
│  Websites   │                           │  Your App   │
└─────────────┘                           └─────────────┘

Configuration

Scraper Configuration (`config/scraper-config.js`)

maxDepth: How deep to follow links (default: 3)
maxLinksPerPage: Links to follow per page (default: 10)
concurrency: Parallel crawling threads (default: 5)

Taxonomy Configuration (`data/seed-taxonomies.json`)

Life events and their keywords
Service categories
Government providers
State mappings

Development

Local Development Setup

# Install dependencies
cd scraper && npm install
cd ../api && npm install

# Run Typesense
docker run -p 8108:8108 -v/tmp/typesense-data:/data \
  typesense/typesense:0.25.1 \
  --data-dir /data --api-key=xyz123

# Run scraper
cd scraper && npm start

# Run API
cd api && npm start

Chat/LLM Integration

This stack runs LiteLLM by default (via Docker Compose) to proxy requests to local Ollama and/or remote APIs. The UI auto-discovers models from LiteLLM and provides a dropdown to switch.

Configure in .env:
- OLLAMA_URL → where your local Ollama is reachable from within Docker.
  - Linux: http://172.17.0.1:11434
  - Mac/Windows: http://host.docker.internal:11434
- OPENAI_API_KEY → optional, enables OpenAI models in LiteLLM.
The API uses LITELLM_URL=http://litellm:4000 by default (in-compose DNS).

You can verify models available via:

curl http://localhost:3000/api/llm/models

Adding New Taxonomies

Edit data/seed-taxonomies.json to add new:

Life events
Categories
Providers
Stage variants

Monitoring

Check crawl status:

docker-compose logs -f scraper

Check API health:

curl http://localhost:3000/health

View Typesense metrics:

curl http://localhost:8108/metrics.json

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
api		api
config		config
data		data
docs		docs
evals-env		evals-env
evals		evals
img		img
litellm		litellm
mcp-server		mcp-server
scraper		scraper
tests		tests
tools		tools
.env-example		.env-example
.gitignore		.gitignore
aeo-analysis.html		aeo-analysis.html
babel.min.js		babel.min.js
dashboard-content.html		dashboard-content.html
dashboard-index.html		dashboard-index.html
dashboard-platform.html		dashboard-platform.html
dashboard-portfolio.html		dashboard-portfolio.html
dashboard-topic.html		dashboard-topic.html
dashboard-unified.html		dashboard-unified.html
debug-typesense.sh		debug-typesense.sh
deploy.sh		deploy.sh
docker-compose.yml		docker-compose.yml
fragments-visual.html		fragments-visual.html
guided-flow.html		guided-flow.html
index.html		index.html
life-events-swimlanes.html		life-events-swimlanes.html
multimode-interface.html		multimode-interface.html
package-lock.json		package-lock.json
package.json		package.json
pages-analyse.html		pages-analyse.html
readme.md		readme.md
run-scrapers.sh		run-scrapers.sh

p0ss/FragmentEngine

Folders and files

Latest commit

History

Repository files navigation

FragmentEngine Content Extractor

Features

Quick Start

0. Prerequisites

1. Clone and configure

2. Bring the stack online

3. Fill Typesense with fragments

4. Use the helper UIs to harvest facts & rubric data

5. Launch the multimodal interface

6. Run the agentic evals

Usage

Running a Crawl (Unified Scraper)

API Examples

Check if API is running

Get collection stats

Search for content

Get available facets

Get specific fragment

Get a bulk export

List available chat models (via LiteLLM or Ollama)

Integration Example

Architecture

Configuration

Scraper Configuration (config/scraper-config.js)

Taxonomy Configuration (data/seed-taxonomies.json)

Development

Local Development Setup

Chat/LLM Integration

Adding New Taxonomies

Monitoring

Troubleshooting

Crawl is too slow

Out of memory

Missing content

Documentation Map

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Scraper Configuration (`config/scraper-config.js`)

Taxonomy Configuration (`data/seed-taxonomies.json`)

Packages