A lightweight Flask-based API that accepts .doc
(Word 97-2003) files, extracts the text using antiword
, and returns it in a clean JSON format.
Includes Swagger UI documentation for ease of testing and integration.
- Upload
.doc
file via POST request - Extracts and beautifies plain text from the document
- Returns structured JSON
- Swagger UI (
/apidocs
) for testing and documentation - Fully containerized with Docker
- Easy to extend for HTML/Markdown output or other formats
Tech | Purpose |
---|---|
Python 3.11 | Backend language |
Flask | Web framework |
Flasgger | Swagger/OpenAPI documentation for Flask |
antiword | CLI tool to extract text from .doc files |
Docker | Containerization |
Clone the repository and build the image:
git clone https://github.com/your-org/doc-text-api.git
cd doc-text-api
docker build -t doc-text-api .
docker run -p 5000:5000 doc-text-api
Now the API is available at: http://localhost:5000
Form Field: file
— must be a .doc
file (old Word format)
curl -X POST http://localhost:5000/extract-text \
-F "file=@/path/to/your/file.doc"
{
"text": "This is the extracted text from the document."
}
Once running, go to:
You’ll see an interactive Swagger interface to test and explore the API.
doc-text-api/
│
├── app.py # Main Flask application
├── requirements.txt # Python dependencies
├── Dockerfile # Container setup
└── README.md # You're here!
-
Install
antiword
:sudo apt install antiword # Debian/Ubuntu
-
Set up a Python virtual environment:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt
-
Run the app:
python app.py
Swagger uses in-code Python docstrings via flasgger
. If you want to customize the Swagger UI or add global info:
In app.py
:
swagger = Swagger(app, template={
"info": {
"title": "DOC Text Extractor API",
"description": "API to extract text from .doc (Word 97-2003) files using antiword.",
"version": "1.0.0"
}
})
- Only accepts
.doc
files — does not process.docx
. - No authentication is enabled by default.
- Consider wrapping this API behind a gateway or firewall in production environments.
MIT License — free for personal and commercial use.
Pull requests are welcome! For major changes, open an issue first to discuss what you’d like to change or add.
This tool uses antiword
, which only supports old Word binary formats (.doc
). For .docx
, consider python-docx
or pandoc
.
Right now, it returns clean plain text. Let us know if you want to support formatting or downloadable outputs!