AXEtract is a high-performance, low-cost framework for extracting structured data from web pages. Based on the paper "AXE: Low-Cost Cross-Domain Web Structured Information Extraction", it optimizes the extraction pipeline by using specialized LoRA adapters for pruning and query-specific extraction, enabling state-of-the-art results with small models (e.g., Qwen3-0.6B).
- 🎯 Specialized LoRA Adapters: Uses task-specific adapters for DOM pruning and structured extraction, achieving high accuracy with minimal token overhead.
- ✂️ Smart DOM Pruning: Classifies and prunes irrelevant HTML nodes before passing them to the extractor, significantly reducing context window usage and costs.
- 📍 Grounded XPath Resolution (GXR): Automatically maps extracted JSON fields back to their original source XPaths in the DOM for verification and grounding.
- ⚡ High-Throughput Pipeline: Built-in support for multiple LLM engines, including vLLM for production-grade serving and HuggingFace for local research.
- 🌐 Cross-Domain Versatility: Designed to generalize across various web domains (e-commerce, real estate, listings) without needing domain-specific rules.
AXEtract follows a three-part decoupled pipeline for maximum efficiency:
- Preprocessor: Fetches raw HTML and chunks it into manageable, token-aware fragments.
- AI Extractor: Divided into two stages:
- Pruner: A lightweight LLM (LoRA-powered) filters out noise and selects only relevant HTML chunks.
- Extractor: A task-specific LLM maps the pruned HTML content directly to a structured JSON schema or natural language answer.
- Postprocessor: Validates the output and resolves source XPaths via Grounded XPath Resolution (GXR).
# Install from PyPI
uv pip install axetract
# Or install from source
git clone https://github.com/abdo-Mansour/axetract.git
cd axetract
uv syncfrom pydantic import BaseModel
from axetract.pipeline import AXEPipeline
# 1. Initialize the pipeline with default LoRA adapters
# (Automatically downloads adapters from HuggingFace Hub)
pipeline = AXEPipeline.from_config(use_vllm=False)
# 2. Define your desired extraction schema
class Product(BaseModel):
name: str
price: str
rating: float
# 3. Extract from a URL or raw HTML
url = "https://example.com/item/12345"
result = pipeline.extract(url, schema=Product)
# 4. Access your structured data
print(f"Status: {result.status}")
print(f"Prediction: {result.prediction}")
print(f"Source XPaths: {result.xpaths}")AXEtract includes a built-in FastAPI server for high-throughput serving. After installing the package, start it with the installed CLI entry point:
axe-serverOr via python -m for development installs:
python -m axetract.serverConfiguration is done via environment variables:
| Variable | Default | Description |
|---|---|---|
AXE_USE_VLLM |
false |
Set to true to use vLLM backend |
AXE_PORT |
8000 |
Port to listen on |
AXE_HOST |
0.0.0.0 |
Host to bind to |
AXE_LOG_FILE |
(stderr) | Optional path to a log file |
See axe_server/client_example.py for examples of interacting with the API via requests.
If you use AXEtract in your research, please cite our paper:
@misc{mansour2026axe,
title={AXE: Low-Cost Cross-Domain Web Structured Information Extraction},
author={Abdelrahman Mansour and Khaled W. Alshaer and Moataz Elsaban},
year={2026},
eprint={2602.01838},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.01838},
}This project is licensed under the MIT License - see the LICENSE file for details.