This project provides a robust REST API built with FastAPI and Docker to manage and interact with Google's Gemma Translate AI Models for AI string translations on-device.
- Translation services
- Support for multiple models
- Automatic API Docs: Interactive API documentation powered by Swagger UI and ReDoc.
- FastAPI for the core web framework.
- Uvicorn as the ASGI server.
- Docker for containerization and easy deployment.
- Pydantic for data validation and settings management.
- Docker Desktop
- Conda (or another Python environment manager)
- Python 3.10+
Create and activate a Conda environment:
conda create -n translate python=3.11
conda activate translateInstall the hf tool to download the models:
pip install "fastapi[standard]" "uvicorn[standard]" httpx llama-cpp-python huggingface_hub
We're going to fetch the GGUF model fromats from these repositories:
https://huggingface.co/mradermacher/translategemma-27b-it-GGUF
https://huggingface.co/mradermacher/translategemma-12b-it-GGUF
https://huggingface.co/mradermacher/translategemma-4b-it-GGUF
Download one of the following Gemma Translate models:
Gemma 4B:
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.IQ4_XS.gguf --local-dir app/models/translategemma-4b-it.IQ4_XS
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q2_K.gguf --local-dir app/models/translategemma-4b-it.Q2_K
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q3_K_L.gguf --local-dir app/models/translategemma-4b-it.Q3_K_L
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q3_K_M.gguf --local-dir app/models/translategemma-4b-it.Q3_K_M
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q3_K_S.gguf --local-dir app/models/translategemma-4b-it.Q3_K_S
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q4_K_M.gguf --local-dir app/models/translategemma-4b-it.Q4_K_M
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q4_K_S.gguf --local-dir app/models/translategemma-4b-it.Q4_K_S
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q5_K_M.gguf --local-dir app/models/translategemma-4b-it.Q5_K_M
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q5_K_S.gguf --local-dir app/models/translategemma-4b-it.Q5_K_S
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q6_K.gguf --local-dir app/models/translategemma-4b-it.Q6_K
hf download mradermacher/translategemma-4b-it-GGUF translategemma-4b-it.Q8_0.gguf --local-dir app/models/translategemma-4b-it.Q8_0
Gemma 12B:
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.IQ4_XS.gguf --local-dir app/models/translategemma-12b-it.IQ4_XS
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q2_K.gguf --local-dir app/models/translategemma-12b-it.Q2_K
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q3_K_L.gguf --local-dir app/models/translategemma-12b-it.Q3_K_L
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q3_K_M.gguf --local-dir app/models/translategemma-12b-it.Q3_K_M
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q3_K_S.gguf --local-dir app/models/translategemma-12b-it.Q3_K_S
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q4_K_M.gguf --local-dir app/models/translategemma-12b-it.Q4_K_M
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q4_K_S.gguf --local-dir app/models/translategemma-12b-it.Q4_K_S
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q5_K_M.gguf --local-dir app/models/translategemma-12b-it.Q5_K_M
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q5_K_S.gguf --local-dir app/models/translategemma-12b-it.Q5_K_S
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q6_K.gguf --local-dir app/models/translategemma-12b-it.Q6_K
hf download mradermacher/translategemma-12b-it-GGUF translategemma-12b-it.Q8_0.gguf --local-dir app/models/translategemma-12b-it.Q8_0
Gemma 27B:
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.IQ4_XS.gguf --local-dir app/models/translategemma-27b-it.IQ4_XS
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q2_K.gguf --local-dir app/models/translategemma-27b-it.Q2_K
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q3_K_L.gguf --local-dir app/models/translategemma-27b-it.Q3_K_L
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q3_K_M.gguf --local-dir app/models/translategemma-27b-it.Q3_K_M
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q3_K_S.gguf --local-dir app/models/translategemma-27b-it.Q3_K_S
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q4_K_M.gguf --local-dir app/models/translategemma-27b-it.Q4_K_M
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q4_K_S.gguf --local-dir app/models/translategemma-27b-it.Q4_K_S
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q5_K_M.gguf --local-dir app/models/translategemma-27b-it.Q5_K_M
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q5_K_S.gguf --local-dir app/models/translategemma-27b-it.Q5_K_S
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q6_K.gguf --local-dir app/models/translategemma-27b-it.Q6_K
hf download mradermacher/translategemma-27b-it-GGUF translategemma-27b-it.Q8_0.gguf --local-dir app/models/translategemma-27b-it.Q8_0
This is the easiest and recommended way to run the application.
-
Build the Docker image:
docker build -t fastapi_gemma_translate . docker build -t grctest/fastapi_gemma_translate .
-
Run the Docker container: This command runs the container in detached mode (
-d) and maps port 8080 on your host to port 8080 in the container.docker run -d --name ai_container_cpu -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models fastapi_gemma_translate
Alternatively you can pull and run my docker image:
-
Pull the Docker image:
docker image pull grctest/fastapi_gemma_translate
-
Run the Docker container: This command runs the container in detached mode (
-d) and maps port 8080 on your host to port 8080 in the container.docker run -d --name ai_container_cpu -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models grctest/fastapi_gemma_translate
The above were for CPU-only mode, if you want Nvidia CUDA GPU support you'll need to use the CudaDockerfile, either by:
-
Building the Docker image:
docker build -t fastapi_gemma_translate_cuda:legacy -f LegacyCudaDockerfile . docker build -t fastapi_gemma_translate_cuda:mainstream -f MainstreamCudaDockerfile . docker build -t fastapi_gemma_translate_cuda:future -f FutureCudaDockerfile .
docker build -t grctest/fastapi_gemma_translate_cuda:legacy -f LegacyCudaDockerfile . docker build -t grctest/fastapi_gemma_translate_cuda:mainstream -f MainstreamCudaDockerfile . docker build -t grctest/fastapi_gemma_translate_cuda:future -f FutureCudaDockerfile .
-
Pulling the Docker image:
docker image pull grctest/fastapi_gemma_translate_cuda:legacy
Then you need to use the GPU flag:
docker run --gpus all -d --name ai_container_gpu -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models -e LLAMA_N_GPU_LAYERS=-1 fastapi_gemma_translate_cuda:legacy
docker run --gpus all -d --name ai_container_cuda -p 127.0.0.1:8080:8080 -v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models -e LLAMA_N_GPU_LAYERS=-1 grctest/fastapi_gemma_translate_cuda:legacy
Note: The C:/Users/username/Desktop/git/fastapi-gemma-translate/_models folder can be replaced by the path you've downloaded the gguf folders+files to.
Container GPU variants:
-
Legacy: Pascal / 10xx Nvidia cards
-
Mainstream: Turing to Ada / 20xx, 30xx, 40xx, A100 Nvidia cards
-
Future: Blackwell / 50xx Nvidia cards
For development, you can run the application directly with Uvicorn, which enables auto-reloading.
uvicorn app.main:app --host 0.0.0.0 --port 8080 --reloadllama-cpp-python GGUF inference can become unstable under heavy parallel requests on some CUDA setups.
This service includes a configurable inference gate shared by /translate, /experimental_translation, and /translate_image.
# safest default: serialize inference
set LLAMA_MAX_CONCURRENT_INFERENCES=1
# max wait in seconds before returning HTTP 503
set LLAMA_INFERENCE_ACQUIRE_TIMEOUT_SECONDS=45Notes:
- Keep
LLAMA_MAX_CONCURRENT_INFERENCES=1for 12B/27B models unless higher values are validated on your hardware. - If queue wait exceeds timeout, the API returns
503instead of allowing requests to pile up indefinitely. - These can be passed alongside existing runtime env vars like
LLAMA_N_GPU_LAYERS.
Example:
docker run --gpus all -d --name ai_container_cuda -p 127.0.0.1:8080:8080 \
-v C:/Users/username/Desktop/git/fastapi-gemma-translate/_models:/code/models \
-e LLAMA_N_GPU_LAYERS=-1 \
-e LLAMA_MAX_CONCURRENT_INFERENCES=1 \
-e LLAMA_INFERENCE_ACQUIRE_TIMEOUT_SECONDS=45 \
grctest/fastapi_gemma_translate_cuda:legacyOnce the server is running, you can access the interactive API documentation:
- Swagger UI: http://127.0.0.1:8080/docs
- ReDoc: http://127.0.0.1:8080/redoc
Model loading is explicit. Translation endpoints will reject requests unless the requested model is already loaded.
- Load a model:
curl -X POST "http://127.0.0.1:8080/model/load" \
-H "Content-Type: application/json" \
-d '{"model":"translategemma-4b-it-Q8_0"}'Load a model with vision support (mmproj can be relative to the model folder or absolute):
curl -X POST "http://127.0.0.1:8080/model/load" \
-H "Content-Type: application/json" \
-d '{"model":"translategemma-4b-it-Q8_0","mmproj":"translategemma-4b-it.mmproj-f16.gguf"}'- Check model status:
curl "http://127.0.0.1:8080/model/status"
curl "http://127.0.0.1:8080/model/status?model=translategemma-4b-it-Q8_0"/model/status includes:
loaded: whether any model is loadedloading: whether a load is currently in progressloaded_model: currently loaded model namevision_enabled: whether the currently loaded model can process images
POST /translate(stable locale list)POST /experimental_translation(stable + experimental locale list)
Both endpoints reject if:
- no model is loaded
- a model is still loading
- the requested
modeldoes not match the loaded model
This API supports image translation via multipart upload using llama-cpp-python vision chat formatting.
POST /translate_image(stable locale list)
Notes:
- Upload images with
multipart/form-dataas fieldfile. - The image stays local to the server process and is sent to the model as a Base64 data URI.
- The loaded model must be vision-enabled.
- Vision is enabled only when
/model/loadis called with anmmprojvalue. - Image translation requests are rejected if the currently loaded model was not loaded with
mmproj.
Example (stable image route):
curl -X POST "http://127.0.0.1:8080/translate_image" \
-F "file=@C:/path/to/image.jpg" \
-F "model=translategemma-4b-it-Q8_0" \
-F "source_lang_code=en" \
-F "target_lang_code=es" \
-F "max_new_tokens=200"This FastAPI Gemma Translate Docker container code is used by Metalglot software translation tool!
This project is licensed under the MIT License. See the LICENSE file for details.