Evaluating XPULink Models with OpenBench¶

This guide will help you use the OpenBench framework to evaluate and test AI models hosted on the XPULink platform. OpenBench is a powerful model evaluation tool that supports multiple benchmark tests and evaluation metrics.

What is OpenBench?¶

OpenBench is an open-source AI model evaluation framework that supports: - Multiple standard evaluation benchmarks (MMLU, GSM8K, HellaSwag, etc.) - Custom evaluation tasks - OpenAI-compatible API interface - Detailed performance reports and analysis

Requirements¶

Python 3.8+
XPULink API Key (obtain from www.xpulink.ai)
OpenBench framework

Installation Steps¶

1. Install OpenBench¶

# Install OpenBench using pip
pip install openbench

# Or install from source
git clone https://github.com/OpenBMB/OpenBench.git
cd OpenBench
pip install -e .

2. Configure Environment Variables¶

Create a .env file or set the following environment variables in your system:

# XPULink API Key
export XPU_API_KEY=your_api_key_here

# XPULink API Base URL
export OPENAI_API_BASE=https://www.xpulink.ai/v1

Or create a .env file in your project directory:

XPU_API_KEY=your_api_key_here
OPENAI_API_BASE=https://www.xpulink.ai/v1

Using OpenBench to Test XPULink Models¶

Basic Configuration Example¶

Create a configuration file xpulink_config.yaml:

# XPULink model configuration
model:
  type: openai  # Use OpenAI compatible interface
  name: qwen3-32b  # Model name on XPULink
  api_key: ${XPU_API_KEY}  # Read from environment variable
  base_url: https://www.xpulink.ai/v1  # XPULink API base URL

# Evaluation configuration
evaluation:
  benchmarks:
    - mmlu  # Multi-task language understanding
    - gsm8k  # Mathematical reasoning
    - hellaswag  # Common sense reasoning

  # Generation parameters
  generation:
    temperature: 0.0  # Deterministic output
    max_tokens: 2048
    top_p: 1.0

Python Code Example¶

import os
from dotenv import load_dotenv
import openai

# Load environment variables
load_dotenv()

# Configure OpenAI client to connect to XPULink
openai.api_key = os.getenv("XPU_API_KEY")
openai.api_base = "https://www.xpulink.ai/v1"

# Test connection
def test_xpulink_model():
    """Test if XPULink model is accessible"""
    try:
        response = openai.ChatCompletion.create(
            model="qwen3-32b",
            messages=[
                {"role": "user", "content": "Please explain artificial intelligence in one sentence."}
            ],
            max_tokens=100,
            temperature=0.7
        )
        print("Model responded successfully!")
        print("Response content:", response.choices[0].message.content)
        return True
    except Exception as e:
        print(f"Connection failed: {e}")
        return False

if __name__ == "__main__":
    test_xpulink_model()

Running Evaluation¶

Method 1: Using Command Line¶

# Run single benchmark test
openbench evaluate \
  --model-type openai \
  --model-name qwen3-32b \
  --api-key $XPU_API_KEY \
  --base-url https://www.xpulink.ai/v1 \
  --benchmark mmlu

# Run multiple benchmark tests
openbench evaluate \
  --config xpulink_config.yaml \
  --output results/xpulink_evaluation.json

Method 2: Using Python Script¶

Create run_evaluation.py:

import os
from dotenv import load_dotenv
from openbench import Evaluator

# Load environment variables
load_dotenv()

# Configure evaluator
evaluator = Evaluator(
    model_type="openai",
    model_name="qwen3-32b",
    api_key=os.getenv("XPU_API_KEY"),
    base_url="https://www.xpulink.ai/v1"
)

# Run evaluation
results = evaluator.run_benchmarks([
    "mmlu",      # Multi-task language understanding
    "gsm8k",     # Mathematical reasoning
    "hellaswag"  # Common sense reasoning
])

# Save results
evaluator.save_results(results, "results/xpulink_evaluation.json")

# Print summary
print("\nEvaluation Results Summary:")
for benchmark, scores in results.items():
    print(f"{benchmark}: {scores['accuracy']:.2%}")

Run the script:

python run_evaluation.py

Advanced Configuration¶

Custom Evaluation Tasks¶

Create custom_evaluation.py:

import os
from dotenv import load_dotenv
import openai

load_dotenv()

# Configure XPULink API
openai.api_key = os.getenv("XPU_API_KEY")
openai.api_base = "https://www.xpulink.ai/v1"

def evaluate_custom_task(questions, model="qwen3-32b"):
    """
    Custom evaluation task

    Args:
        questions: List of questions
        model: Model name

    Returns:
        Evaluation results
    """
    results = []

    for q in questions:
        response = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a professional AI assistant."},
                {"role": "user", "content": q["question"]}
            ],
            max_tokens=512,
            temperature=0.0
        )

        answer = response.choices[0].message.content
        results.append({
            "question": q["question"],
            "model_answer": answer,
            "expected_answer": q.get("expected_answer", None),
            "correct": answer.strip() == q.get("expected_answer", "").strip()
        })

    return results

# Sample question set
questions = [
    {
        "question": "What is machine learning?",
        "expected_answer": "Machine learning is a branch of artificial intelligence that allows computers to automatically improve performance through data and experience."
    },
    {
        "question": "What type of programming language is Python?",
        "expected_answer": "Python is a high-level, interpreted, object-oriented programming language."
    }
]

# Run evaluation
results = evaluate_custom_task(questions)

# Calculate accuracy
accuracy = sum(1 for r in results if r["correct"]) / len(results)
print(f"Accuracy: {accuracy:.2%}")

Batch Testing Multiple Models¶

import os
from dotenv import load_dotenv
import openai

load_dotenv()

openai.api_key = os.getenv("XPU_API_KEY")
openai.api_base = "https://www.xpulink.ai/v1"

# Define models to test
models_to_test = [
    "qwen3-32b",
    "qwen3-14b",
    "llama3-70b"
]

def benchmark_models(models, test_prompt):
    """Compare performance of multiple models"""
    results = {}

    for model in models:
        try:
            response = openai.ChatCompletion.create(
                model=model,
                messages=[{"role": "user", "content": test_prompt}],
                max_tokens=100,
                temperature=0.0
            )
            results[model] = {
                "success": True,
                "response": response.choices[0].message.content,
                "tokens": response.usage.total_tokens
            }
        except Exception as e:
            results[model] = {
                "success": False,
                "error": str(e)
            }

    return results

# Run comparison test
test_prompt = "Please explain what deep learning is, in no more than 50 words."
comparison = benchmark_models(models_to_test, test_prompt)

# Output results
for model, result in comparison.items():
    print(f"\nModel: {model}")
    if result["success"]:
        print(f"Response: {result['response']}")
        print(f"Token Usage: {result['tokens']}")
    else:
        print(f"Error: {result['error']}")

Supported Evaluation Benchmarks¶

OpenBench supports the following standard evaluation benchmarks for testing XPULink models:

Benchmark	Description	Evaluates
MMLU	Massive Multitask Language Understanding	Knowledge breadth, domain expertise
GSM8K	Grade School Math	Mathematical reasoning, problem-solving
HellaSwag	Common sense reasoning	Common sense understanding, context completion
TruthfulQA	Truthful question answering	Factual accuracy, honesty
HumanEval	Code generation	Programming capability, code understanding
MBPP	Python programming benchmark	Basic programming skills

Result Analysis¶

After evaluation completes, OpenBench generates detailed reports including:

Accuracy: Proportion of correctly answered questions
F1 Score: Harmonic mean of precision and recall
Inference Time: Average response time per task
Token Usage: Token consumption of API calls
Cost Estimation: Cost estimates based on token usage

View results example:

import json

# Load evaluation results
with open("results/xpulink_evaluation.json", "r") as f:
    results = json.load(f)

# Print detailed results
print(json.dumps(results, indent=2, ensure_ascii=False))

FAQ¶

Q: How to obtain XPU_API_KEY?¶

A: Visit www.xpulink.ai to register an account, then create and obtain your API Key from the API key management page in the console.

Q: What if API timeout occurs during evaluation?¶

A: You can increase the timeout in the configuration:

openai.request_timeout = 60  # Set to 60 seconds

Q: How to view detailed evaluation logs?¶

A: Enable verbose logging mode:

openbench evaluate --config config.yaml --verbose --log-file evaluation.log

Q: Which XPULink models are supported?¶

A: OpenBench supports all models on the XPULink platform that are compatible with the OpenAI API format. Common ones include: - qwen3-32b - qwen3-14b - llama3-70b - deepseek-chat

Please visit XPULink official documentation for the complete model list.

Q: How to control evaluation costs?¶

A: Recommend taking the following measures: 1. Test on small datasets first 2. Use max_tokens to limit generation length 3. Set temperature=0.0 for deterministic output, avoiding repeated tests 4. Use caching mechanisms to avoid duplicate calls

Best Practices¶

API Key Security:
Always use environment variables to store API Keys
Add .env files to .gitignore
Don't hardcode keys in code
Evaluation Strategy:
Start testing with small-scale datasets
Gradually increase evaluation task complexity
Save intermediate results regularly
Error Handling:
Implement retry mechanisms to handle network fluctuations
Log failed test cases
Monitor API quota usage
Result Comparison:
Save historical evaluation results for comparison
Use the same random seed to ensure reproducibility
Record model version and configuration during evaluation

Example Project Structure¶

Evaluation/
├── README.md                    # This document
├── config/
│   ├── xpulink_config.yaml     # XPULink configuration
│   └── benchmarks.yaml          # Benchmark test configuration
├── scripts/
│   ├── test_connection.py       # Test connection script
│   ├── run_evaluation.py        # Run evaluation script
│   └── custom_evaluation.py     # Custom evaluation
└── results/
    └── xpulink_evaluation.json  # Evaluation results

Technical Support¶

For questions or suggestions, please: 1. Visit XPULink Official Website 2. Check the OpenBench project's Issue page 3. Submit an Issue in this project

Note: Please use API quota reasonably to avoid unnecessary costs. It is recommended to estimate costs before conducting large-scale evaluations.