Skip to content

This is a repository with examples to run inference endpoints on various ALCF clusters

Notifications You must be signed in to change notification settings

argonne-lcf/inference-endpoints

Repository files navigation

πŸ€– ALCF Inference Endpoints

Unlock Powerful Large Language Model Inference at Argonne Leadership Computing Facility (ALCF)

Table of Content

Overview    Available Clusters
Supported Frameworks
API Endpoints    Chat Completions
   Completions
   Embeddings
Available Models    Chat Language Models
   Deepseek Family
   Allenai Family
   Vision Language Models
   Embedding Models
Inference Execution    Performance and Wait Times
   Cluster Specific Details
Prerequisites    Python SDK Setup
   Authentication
Usage Examples    Curl Request Examples
   Python Implementations
Troubleshooting
Contact Us

🌐 Overview

The ALCF Inference Endpoints provide a robust API for running Large Language Model (LLM) inference using Globus Compute on ALCF HPC Clusters.

πŸ–₯️ Available Clusters

Cluster Endpoint
Sophia https://data-portal-dev.cels.anl.gov/resource_server/sophia

πŸ”’ Access Note:

  • Endpoints are restricted. You must be on Argonne's network (Use VPN, Dash, or SSH to ANL machine).
  • You will need to authenticate with Argonne or ALCF SSO (Single Sign On) using your credentials. See Authentication.

🧩 Supported Frameworks

πŸš€ API Endpoints

Chat Completions

https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions

Completions

https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions

Embeddings

https://data-portal-dev.cels.anl.gov/resource_server/sophia/infinity/v1/embeddings

πŸ“ Note Currently embeddings are only supported by the infinity framework. See usage and/or refer to OpenAI API docs for examples

πŸ“š Available Models

πŸ’¬ Chat Language Models

Qwen Family

  • Qwen/Qwen2.5-14B-Instruct
  • Qwen/Qwen2.5-7B-Instruct
  • Qwen/QwQ-32B-Preview

Meta Llama Family

  • meta-llama/Meta-Llama-3-70B-Instruct
  • meta-llama/Meta-Llama-3-8B-Instruct
  • meta-llama/Meta-Llama-3.1-70B-Instruct
  • meta-llama/Meta-Llama-3.1-8B-Instruct
  • meta-llama/Meta-Llama-3.1-405B-Instruct
  • meta-llama/Llama-3.3-70B-Instruct

Mistral Family

  • mistralai/Mistral-7B-Instruct-v0.3
  • mistralai/Mistral-Large-Instruct-2407
  • mistralai/Mixtral-8x22B-Instruct-v0.1

Nvidia Nemotron Family

  • mgoin/Nemotron-4-340B-Instruct-hf

Aurora GPT Family

  • argonne-private/AuroraGPT-7B (previously called auroragpt/auroragpt-0.1-chkpt-7B-Base)
  • argonne-private/AuroraGPT-IT-v4-0125 (previously called auroragpt/auroragpt-0.1-chkpt-7B-IT)
  • argonne-private/AuroraGPT-Tulu3-SFT-0125
  • argonne-private/AuroraGPT-KTO-1902 (previously called auroragpt/auroragpt-0.1-chkpt-7B-KTO)
  • argonne-private/AuroraGPT-DPO-1902 (previously called auroragpt/auroragpt-0.1-chkpt-7B-DPO)
  • argonne-private/AuroraGPT-SFT-190

Deepseek Family

  • deepseek-ai/DeepSeek-R1 (Not supported natively on A100 GPUs. Under Testing)
  • deepseek-ai/DeepSeek-V3 (Not supported natively on A100 GPUs. Under Testing)

Allenai Family

  • allenai/Llama-3.1-Tulu-3-405B

πŸ‘οΈ Vision Language Models

Qwen Family

Meta Llama Family

  • meta-llama/Llama-3.2-90B-Vision-Instruct

🧲 Embedding Models

Nvidia Family

πŸ“ Want to add a model? Add the HF-compatible, framework-supported model weights to /eagle/argonne_tpc/model_weights/ and contact Aditya Tanikanti

🧩 Inference Execution

Performance and Wait Times

When interacting with the inference endpoints, it's crucial to understand the system's operational characteristics:

  1. Initial Model Loading

    • The first query for a "cold" model takes approximately 10-15 minutes
    • Loading time depends on the specific model's size
    • A node must first be acquired and the model loaded into memory
  2. Cluster Resource Constraints

    • These endpoints run on a High-Performance Computing (HPC) cluster as PBS jobs
    • The cluster is used for multiple tasks beyond inference
    • During high-demand periods, your job might be queued
    • You may need to wait until computational resources become available
  3. Job and model running status

    • To view currently running jobs along with the models served on the cluster you can run curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/sophia/jobs" -H "Authorization: Bearer ${access_token}". See Authentication for access_token

🚧 Future Improvements:

  • The team is actively working on implementing a node reservation system to mitigate wait times and improve user experience.
  • If you’re interested in extended model runtimes, reservations, or private model deployments, please get in touch with us.

Cluster-Specific Details

Sophia Cluster

The models currently run as part of a 24-hour job on Sophia. Here's how the endpoint activation works:

  • The first query by a user dynamically acquires and activates the endpoints (approximately 10-15 minutes).
  • Subsequent queries by users will re-use the running job/endpoint.
  • Running endpoints that are idle for more than 2 hours will be terminated in order to re-allocate resources to other HPC jobs.

πŸ› οΈ Prerequisites

Python SDK Setup

# Create a new Conda environment
conda create -n globus_env python==3.11.9 --y
conda activate globus_env

# Install Globus SDK (must be at least version 3.46.0)
pip install globus_sdk

# Install optional package
pip install openai

Authentication

Download the script to manage access tokens:

wget https://raw.githubusercontent.com/argonne-lcf/inference-endpoints/refs/heads/main/inference_auth_token.py

Authenticate with your Globus account:

python inference_auth_token.py authenticate

The above command will generate an access token and a refresh token, and store them in your home directory.

If you need to re-authenticate from scratch in order to 1) change Globus account, or 2) resolve a Permission denied from internal policies error, first logout from your account by visiting https://app.globus.org/logout, and type the following command:

python inference_auth_token.py authenticate --force

View your access token:

python inference_auth_token.py get_access_token

If your current access token is expired, the above command will atomatically generate a new token without human intervention.

⏰ Token Validity: All access tokens are valid for 48 hours, but the refresh token will allow you to acquire new access tokens programatically without needing to re-authenticate. Refresh tokens do not expire unless they are left unused for 6 months or more. However, an internal policy will force users to re-authenticate every 7 days.

πŸ”’ Access Note:

  • Endpoints are restricted. You must be on Argonne's network (Use VPN, Dash, or SSH to ANL machine).
  • You will need to authenticate with Argonne or ALCF SSO (Single Sign On) using your credentials.

πŸ’‘ Usage Examples

🌟 Curl Request Examples

List the status of running jobs/endpoints on the cluster
#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)

curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/sophia/jobs" \
 -H "Authorization: Bearer ${access_token}"
List all available endpoints
#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)


curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/list-endpoints" \
 -H "Authorization: Bearer ${access_token}"
Chat Completions Curl Example
#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)

# Define the base URL
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions"

# Define the model and parameters
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
temperature=0.2
max_tokens=150

# Define an array of messages
messages=(
  "List all proteins that interact with RAD51"
  "What are the symptoms of diabetes?"
  "How does photosynthesis work?"
)

# Loop through the messages and send a POST request for each
for message in "${messages[@]}"; do
  curl -X POST "$base_url" \
       -H "Authorization: Bearer ${access_token}" \
       -H "Content-Type: application/json" \
       -d '{
              "model": "'$model'",
              "temperature": '$temperature',
              "max_tokens": '$max_tokens',
              "messages":[{"role": "user", "content": "'"$message"'"}]
           }'
done
Completions Curl Example
#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)

# Define the base URL
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions"

# Define the model and parameters
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
temperature=0.2
max_tokens=150

# Define an array of prompts
prompts=(
  "List all proteins that interact with RAD51"
  "What are the symptoms of diabetes?"
  "How does photosynthesis work?"
)

# Loop through the prompts and send a POST request for each
for prompt in "${prompts[@]}"; do
  echo "'"$prompt"'"
  curl -X POST "$base_url" \
       -H "Authorization: Bearer ${access_token}" \
       -H "Content-Type: application/json" \
       -d '{
              "model": "'$model'",
              "temperature": '$temperature',
              "max_tokens": '$max_tokens',
              "prompt":"'"$prompt"'"
           }'
done

🐍 Python Implementations

Using Requests
import requests
import json
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()

# Chat Completions Example
def send_chat_request(message):
    url = "https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions"
    headers = {
        'Authorization': f'Bearer {access_token}',
        'Content-Type': 'application/json'
    }
    data = {
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": message}]
    }
    response = requests.post(url, headers=headers, data=json.dumps(data))
    return response.json()

output = send_chat_request("What is the purpose of life?")
print(output)
Using OpenAI Package
from openai import OpenAI
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()

client = OpenAI(
    api_key=access_token,
    base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

print(response)
Using Vision Model
from openai import OpenAI
import base64
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()
    
# Initialize the client
client = OpenAI(
    api_key=access_token,
    base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1"
)

# Function to encode image to base64
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Prepare the image
image_path = "scientific_diagram.png"
base64_image = encode_image(image_path)

# Create vision model request
response = client.chat.completions.create(
    model="Qwen/Qwen2-VL-72B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the key components in this scientific diagram"},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
            ]
        }
    ],
    max_tokens=300
)

# Print the model's analysis
print(response.server_response)
Using Embedding Model
from openai import OpenAI
import base64
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()
 
# Initialize the client
client = OpenAI(
    api_key=access_token,
    base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/infinity/v1"
)

# Create Embeddings
completion = client.embeddings.create(
  model="nvidia/NV-Embed-v2",
  input="The food was delicious and the waiter...",
  encoding_format="float"
)

# Print the model's analysis
print(completion)

🚨 Troubleshooting

  • Connection Timeout?
    • Regenerate your access token
    • Verify Argonne network access
    • Your job is queued as the cluster has too many pending jobs

πŸ“ž Contact Us

About

This is a repository with examples to run inference endpoints on various ALCF clusters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published