Evaluating XPULink Models with OpenBench¶
This guide will help you use the OpenBench framework to evaluate and test AI models hosted on the XPULink platform. OpenBench is a powerful model evaluation tool that supports multiple benchmark tests and evaluation metrics.
What is OpenBench?¶
OpenBench is an open-source AI model evaluation framework that supports: - Multiple standard evaluation benchmarks (MMLU, GSM8K, HellaSwag, etc.) - Custom evaluation tasks - OpenAI-compatible API interface - Detailed performance reports and analysis
Requirements¶
- Python 3.8+
- XPULink API Key (obtain from www.xpulink.ai)
- OpenBench framework
Installation Steps¶
1. Install OpenBench¶
# Install OpenBench using pip
pip install openbench
# Or install from source
git clone https://github.com/OpenBMB/OpenBench.git
cd OpenBench
pip install -e .
2. Configure Environment Variables¶
Create a .env file or set the following environment variables in your system:
# XPULink API Key
export XPU_API_KEY=your_api_key_here
# XPULink API Base URL
export OPENAI_API_BASE=https://www.xpulink.ai/v1
Or create a .env file in your project directory:
XPU_API_KEY=your_api_key_here
OPENAI_API_BASE=https://www.xpulink.ai/v1
Using OpenBench to Test XPULink Models¶
Basic Configuration Example¶
Create a configuration file xpulink_config.yaml:
# XPULink model configuration
model:
type: openai # Use OpenAI compatible interface
name: qwen3-32b # Model name on XPULink
api_key: ${XPU_API_KEY} # Read from environment variable
base_url: https://www.xpulink.ai/v1 # XPULink API base URL
# Evaluation configuration
evaluation:
benchmarks:
- mmlu # Multi-task language understanding
- gsm8k # Mathematical reasoning
- hellaswag # Common sense reasoning
# Generation parameters
generation:
temperature: 0.0 # Deterministic output
max_tokens: 2048
top_p: 1.0
Python Code Example¶
import os
from dotenv import load_dotenv
import openai
# Load environment variables
load_dotenv()
# Configure OpenAI client to connect to XPULink
openai.api_key = os.getenv("XPU_API_KEY")
openai.api_base = "https://www.xpulink.ai/v1"
# Test connection
def test_xpulink_model():
"""Test if XPULink model is accessible"""
try:
response = openai.ChatCompletion.create(
model="qwen3-32b",
messages=[
{"role": "user", "content": "Please explain artificial intelligence in one sentence."}
],
max_tokens=100,
temperature=0.7
)
print("Model responded successfully!")
print("Response content:", response.choices[0].message.content)
return True
except Exception as e:
print(f"Connection failed: {e}")
return False
if __name__ == "__main__":
test_xpulink_model()
Running Evaluation¶
Method 1: Using Command Line¶
# Run single benchmark test
openbench evaluate \
--model-type openai \
--model-name qwen3-32b \
--api-key $XPU_API_KEY \
--base-url https://www.xpulink.ai/v1 \
--benchmark mmlu
# Run multiple benchmark tests
openbench evaluate \
--config xpulink_config.yaml \
--output results/xpulink_evaluation.json
Method 2: Using Python Script¶
Create run_evaluation.py:
import os
from dotenv import load_dotenv
from openbench import Evaluator
# Load environment variables
load_dotenv()
# Configure evaluator
evaluator = Evaluator(
model_type="openai",
model_name="qwen3-32b",
api_key=os.getenv("XPU_API_KEY"),
base_url="https://www.xpulink.ai/v1"
)
# Run evaluation
results = evaluator.run_benchmarks([
"mmlu", # Multi-task language understanding
"gsm8k", # Mathematical reasoning
"hellaswag" # Common sense reasoning
])
# Save results
evaluator.save_results(results, "results/xpulink_evaluation.json")
# Print summary
print("\nEvaluation Results Summary:")
for benchmark, scores in results.items():
print(f"{benchmark}: {scores['accuracy']:.2%}")
Run the script:
python run_evaluation.py
Advanced Configuration¶
Custom Evaluation Tasks¶
Create custom_evaluation.py:
import os
from dotenv import load_dotenv
import openai
load_dotenv()
# Configure XPULink API
openai.api_key = os.getenv("XPU_API_KEY")
openai.api_base = "https://www.xpulink.ai/v1"
def evaluate_custom_task(questions, model="qwen3-32b"):
"""
Custom evaluation task
Args:
questions: List of questions
model: Model name
Returns:
Evaluation results
"""
results = []
for q in questions:
response = openai.ChatCompletion.create(
model=model,
messages=[
{"role": "system", "content": "You are a professional AI assistant."},
{"role": "user", "content": q["question"]}
],
max_tokens=512,
temperature=0.0
)
answer = response.choices[0].message.content
results.append({
"question": q["question"],
"model_answer": answer,
"expected_answer": q.get("expected_answer", None),
"correct": answer.strip() == q.get("expected_answer", "").strip()
})
return results
# Sample question set
questions = [
{
"question": "What is machine learning?",
"expected_answer": "Machine learning is a branch of artificial intelligence that allows computers to automatically improve performance through data and experience."
},
{
"question": "What type of programming language is Python?",
"expected_answer": "Python is a high-level, interpreted, object-oriented programming language."
}
]
# Run evaluation
results = evaluate_custom_task(questions)
# Calculate accuracy
accuracy = sum(1 for r in results if r["correct"]) / len(results)
print(f"Accuracy: {accuracy:.2%}")
Batch Testing Multiple Models¶
import os
from dotenv import load_dotenv
import openai
load_dotenv()
openai.api_key = os.getenv("XPU_API_KEY")
openai.api_base = "https://www.xpulink.ai/v1"
# Define models to test
models_to_test = [
"qwen3-32b",
"qwen3-14b",
"llama3-70b"
]
def benchmark_models(models, test_prompt):
"""Compare performance of multiple models"""
results = {}
for model in models:
try:
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": test_prompt}],
max_tokens=100,
temperature=0.0
)
results[model] = {
"success": True,
"response": response.choices[0].message.content,
"tokens": response.usage.total_tokens
}
except Exception as e:
results[model] = {
"success": False,
"error": str(e)
}
return results
# Run comparison test
test_prompt = "Please explain what deep learning is, in no more than 50 words."
comparison = benchmark_models(models_to_test, test_prompt)
# Output results
for model, result in comparison.items():
print(f"\nModel: {model}")
if result["success"]:
print(f"Response: {result['response']}")
print(f"Token Usage: {result['tokens']}")
else:
print(f"Error: {result['error']}")
Supported Evaluation Benchmarks¶
OpenBench supports the following standard evaluation benchmarks for testing XPULink models:
| Benchmark | Description | Evaluates |
|---|---|---|
| MMLU | Massive Multitask Language Understanding | Knowledge breadth, domain expertise |
| GSM8K | Grade School Math | Mathematical reasoning, problem-solving |
| HellaSwag | Common sense reasoning | Common sense understanding, context completion |
| TruthfulQA | Truthful question answering | Factual accuracy, honesty |
| HumanEval | Code generation | Programming capability, code understanding |
| MBPP | Python programming benchmark | Basic programming skills |
Result Analysis¶
After evaluation completes, OpenBench generates detailed reports including:
- Accuracy: Proportion of correctly answered questions
- F1 Score: Harmonic mean of precision and recall
- Inference Time: Average response time per task
- Token Usage: Token consumption of API calls
- Cost Estimation: Cost estimates based on token usage
View results example:
import json
# Load evaluation results
with open("results/xpulink_evaluation.json", "r") as f:
results = json.load(f)
# Print detailed results
print(json.dumps(results, indent=2, ensure_ascii=False))
FAQ¶
Q: How to obtain XPU_API_KEY?¶
A: Visit www.xpulink.ai to register an account, then create and obtain your API Key from the API key management page in the console.
Q: What if API timeout occurs during evaluation?¶
A: You can increase the timeout in the configuration:
openai.request_timeout = 60 # Set to 60 seconds
Q: How to view detailed evaluation logs?¶
A: Enable verbose logging mode:
openbench evaluate --config config.yaml --verbose --log-file evaluation.log
Q: Which XPULink models are supported?¶
A: OpenBench supports all models on the XPULink platform that are compatible with the OpenAI API format. Common ones include: - qwen3-32b - qwen3-14b - llama3-70b - deepseek-chat
Please visit XPULink official documentation for the complete model list.
Q: How to control evaluation costs?¶
A: Recommend taking the following measures:
1. Test on small datasets first
2. Use max_tokens to limit generation length
3. Set temperature=0.0 for deterministic output, avoiding repeated tests
4. Use caching mechanisms to avoid duplicate calls
Best Practices¶
- API Key Security:
- Always use environment variables to store API Keys
- Add
.envfiles to.gitignore -
Don't hardcode keys in code
-
Evaluation Strategy:
- Start testing with small-scale datasets
- Gradually increase evaluation task complexity
-
Save intermediate results regularly
-
Error Handling:
- Implement retry mechanisms to handle network fluctuations
- Log failed test cases
-
Monitor API quota usage
-
Result Comparison:
- Save historical evaluation results for comparison
- Use the same random seed to ensure reproducibility
- Record model version and configuration during evaluation
Example Project Structure¶
Evaluation/
├── README.md # This document
├── config/
│ ├── xpulink_config.yaml # XPULink configuration
│ └── benchmarks.yaml # Benchmark test configuration
├── scripts/
│ ├── test_connection.py # Test connection script
│ ├── run_evaluation.py # Run evaluation script
│ └── custom_evaluation.py # Custom evaluation
└── results/
└── xpulink_evaluation.json # Evaluation results
Related Resources¶
Technical Support¶
For questions or suggestions, please: 1. Visit XPULink Official Website 2. Check the OpenBench project's Issue page 3. Submit an Issue in this project
Note: Please use API quota reasonably to avoid unnecessary costs. It is recommended to estimate costs before conducting large-scale evaluations.