This repository provides examples of using the OpenAI API-based ALCF Inference Endpoints
for running Large Language Models (LLMs) inference built using Globus Compute.
Currently, our endpoints are running on two clusters, with more to come:
- Sophia - https://data-portal-dev.cels.anl.gov/resource_server/sophia
- Polaris - https://data-portal-dev.cels.anl.gov/resource_server/polaris
Note: Endpoints are restricted by Globus groups and policy. To access the Inference endpoints, you need to authenticate with Globus using your Argonne or ALCF credentials. You need to be on Argonne's network to access these endpoints. You can run from systems within the Argonne networks, or you will need a VPN, Dash, ssh tunnels if working remotely.
- Supported Frameworks
- API Endpoints
- Available Models
- Accessing Endpoints
- Inference Execution
- Prerequisites
- Usage
- Troubleshooting/FAQs
- Contact us/Support
- vLLM - https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm
- llama.cpp (under testing)- https://data-portal-dev.cels.anl.gov/resource_server/sophia/llama-cpp
The OpenAI API chat completions and completions are available, with batch
processing for non-interactive use cases coming soon.
- chat completions - https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions
- completions - https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions
- To access the Inference endpoints, you need to authenticate with Globus using your Argonne or ALCF credentials.
- Qwen/Qwen2.5-14B-Instruct
- Qwen/Qwen2.5-7B-Instruct
- meta-llama/Meta-Llama-3-70B-Instruct
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3.1-70B-Instruct
- meta-llama/Meta-Llama-3.1-8B-Instruct
- mistralai/Mistral-7B-Instruct-v0.3
- mistralai/Mistral-Large-Instruct-2407
- mistralai/Mixtral-8x22B-Instruct-v0.1
Note: To add new models/endpoints, please add the HF-compatible model to the path /eagle/argonne_tpc/model_weights/
and contact Aditya Tanikanti or raise an issue in this repository or via slack
, and we will add it promptly.
The models are currently run as part of a 24-hour job on Sophia
. The endpoints are dynamically acquired and activated when the first query is performed by any group member, and subsequent queries by group members will re-use the running job/endpoint.
The persistence capability for the inference service is available. However, we are internally collecting various usage metrics and will add a persistent endpoint service shortly.
On Polaris
, the models are currently run as part of a debug job with a 1-hour duration.
- A Python environment with
globus_sdk
installed:
conda create -n globus_env python==3.11.9 --y
conda activate globus_env
pip install globus_sdk
- Create an access token. This script creates the access_token.txt file.
python3 generate_auth_token.py
access_token=$(cat access_token.txt)
Note: Once an access_token
is created, it will be active for 24 hours.
- Access to the endpoinsts is restricted to systems on the Argonne's network. Use VPN, Dash or ssh tunnel from the ALCF computes if working remotely.
You can use curl or Python to interact with the Inference API.
After running generate_auth_token.py, to check all available endpoints
python3 generate_auth_token.py
access_token=$(cat access_token.txt)
curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/list-endpoints" \
-H "Authorization: Bearer ${access_token}"
#!/bin/bash
# Define the access token
access_token=$(cat access_token.txt)
# Define the base URL
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions"
# Define the model and parameters
model="mistralai/Mistral-7B-Instruct-v0.3"
temperature=0.2
max_tokens=150
# Define an array of messages
messages=(
"List all proteins that interact with RAD51"
"What are the symptoms of diabetes?"
"How does photosynthesis work?"
)
# Loop through the messages and send a POST request for each
for message in "${messages[@]}"; do
curl -X POST "$base_url" \
-H "Authorization: Bearer ${access_token}" \
-H "Content-Type: application/json" \
-d '{
"model": "'$model'",
"temperature": '$temperature',
"max_tokens": '$max_tokens',
"messages":[{"role": "user", "content": "'"$message"'"}]
}'
done
#!/bin/bash
# Define the access token
access_token=$(cat access_token.txt)
# Define the base URL
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions"
# Define the model and parameters
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
temperature=0.2
max_tokens=150
# Define an array of prompts
prompts=(
"List all proteins that interact with RAD51"
"What are the symptoms of diabetes?"
"How does photosynthesis work?"
)
# Loop through the prompts and send a POST request for each
for prompt in "${prompts[@]}"; do
echo "'"$prompt"'"
curl -X POST "$base_url" \
-H "Authorization: Bearer ${access_token}" \
-H "Content-Type: application/json" \
-d '{
"model": "'$model'",
"temperature": '$temperature',
"max_tokens": '$max_tokens',
"prompt":"'"$prompt"'"
}'
done
For more examples see curl-requests.sh
First, ensure you have generated the authentication token by running generate_auth_token.py.
import requests
import json
# Define the access token
with open('access_token.txt', 'r') as file:
access_token = file.read().strip()
# Define the base URL
base_url = "https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions"
# Define the model and parameters
model = "mistralai/Mistral-7B-Instruct-v0.3"
temperature = 0.2
max_tokens = 150
# Define an array of messages
messages = [
"List all proteins that interact with RAD51",
"What are the symptoms of diabetes?",
"How does photosynthesis work?"
]
# Function to send POST requests
def send_request(message):
headers = {
'Authorization': f'Bearer {access_token}',
'Content-Type': 'application/json'
}
data = {
"model": model,
"temperature": temperature,
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": message}]
}
response = requests.post(base_url, headers=headers, data=json.dumps(data))
return response.json()
# Loop through the messages and send a POST request for each
for message in messages:
response = send_request(message)
print(response)
import requests
import json
# Define the access token
with open('access_token.txt', 'r') as file:
access_token = file.read().strip()
# Define the base URL
base_url = "https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions"
# Define the model and parameters
model = "meta-llama/Meta-Llama-3.1-8B-Instruct"
temperature = 0.2
max_tokens = 150
# Define an array of prompts
prompts = [
"List all proteins that interact with RAD51",
"What are the symptoms of diabetes?",
"How does photosynthesis work?"
]
# Function to send POST requests
def send_request(prompt):
headers = {
'Authorization': f'Bearer {access_token}',
'Content-Type': 'application/json'
}
data = {
"model": model,
"temperature": temperature,
"max_tokens": max_tokens,
"prompt": prompt
}
response = requests.post(base_url, headers=headers, data=json.dumps(data))
return response.json()
# Loop through the prompts and send a POST request for each
for prompt in prompts:
response = send_request(prompt)
print(response)
To run these Python scripts, save each script to a file (e.g., chat_completions.py
and completions.py
), then execute them using Python:
python3 chat_completions.py
python3 completions.py
First, ensure you have generated the authentication token by running generate_auth_token.py.
Install the OpenAI package if you haven't already:
pip install openai
import openai
import os
# Define the access token
with open('access_token.txt', 'r') as file:
access_token = file.read().strip()
# Define the base URL
base_url = "https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions"
# Define the model and parameters
model = "mistralai/Mistral-7B-Instruct-v0.3"
temperature = 0.2
max_tokens = 150
# Define an array of messages
messages = [
"List all proteins that interact with RAD51",
"What are the symptoms of diabetes?",
"How does photosynthesis work?"
]
# Set the API key for OpenAI
client = OpenAI(
api_key=access_token,
base_url=base_url,
)
# Function to send POST requests
def send_request(message):
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": message}],
temperature=temperature,
max_tokens=max_tokens
)
return response
# Loop through the messages and send a POST request for each
for message in messages:
response = send_request(message)
print(response)
import openai
import os
# Define the access token
with open('access_token.txt', 'r') as file:
access_token = file.read().strip()
# Define the base URL
base_url = "https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions"
# Define the model and parameters
model = "meta-llama/Meta-Llama-3.1-8B-Instruct"
temperature = 0.2
max_tokens = 150
# Define an array of prompts
prompts = [
"List all proteins that interact with RAD51",
"What are the symptoms of diabetes?",
"How does photosynthesis work?"
]
# Set the API key for OpenAI
client = OpenAI(
api_key=access_token,
base_url=base_url,
)
# Function to send POST requests
def send_request(prompt):
response = client.completions.create(
model=model,
prompt=prompt,
temperature=temperature,
max_tokens=max_tokens
)
return response
# Loop through the prompts and send a POST request for each
for prompt in prompts:
response = send_request(prompt)
print(response)
To run these Python scripts, save each script to a file (e.g., chat_completions_openai.py
and completions_openai.py
), then execute them using Python:
python3 chat_completions_openai.py
python3 completions_openai.py
Refer to remote_inference_gateway.ipynb for more detailed examples.
- If you see this error.
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='data-portal-dev.cels.anl.gov', port=443): Max retries exceeded with url: /resource_server/sophia/vllm/v1/chat/completions (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x1496ce979550>, 'Connection to data-portal-dev.cels.anl.gov timed out. (connect timeout=None)'))
- Check if the access token is expired and regenerate it. If you're using environment variables for the access token, ensure it is correctly set
- Check if you are accessing the API from within the Argonne network.
Contact Benoit Cote or Aditya Tanikanti or ALCF Support