AI Models
Configure the AI models used for chat, embeddings, and other features in ZenSearch.
Overview
ZenSearch provides centralized AI model management with unified usage tracking, per-team cost attribution, and rate limiting across all providers.
The AI Models settings allow you to:
- View available models
- Add new model configurations
- Set default models
- Monitor model usage and costs per team
Accessing Model Settings
- Click Settings in the sidebar
- Select the AI Models tab
Available Models
Model Types
| Type | Purpose |
|---|---|
| Chat | Conversational AI responses |
| Embedding | Document vectorization |
| Reranker | Result reranking |
Supported Providers
| Provider | Models |
|---|---|
| OpenAI | GPT-5 / GPT-5.2 / GPT-5.4 series, GPT-4.1, GPT-4o, GPT-4o-mini |
| Anthropic | Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Claude 4.5 and 4.1 snapshots |
| Groq | Llama 3.1, gpt-oss-120b, gpt-oss-20b (fast, low-cost inference) |
| OpenRouter | Unified gateway to 100+ models from OpenAI, Anthropic, Meta, Google, Mistral and more — all behind a single API key. Model IDs use the openrouter/vendor/name form (e.g. openrouter/openai/gpt-4o-mini). No embeddings. |
| Azure AI Foundry | OpenAI-compatible v1 surface hosted on your own Azure tenant. Access to GPT-4o / 4.1 / 5, Llama 3.1 family, Phi-4, Mistral, DeepSeek, Cohere through one endpoint. Model IDs use the azure/<deployment-name> form (e.g. azure/gpt-4o-mini). Supports embeddings natively. |
| Amazon Bedrock | Unified Converse API on your own AWS account: Claude 4.5 / 4 / 3.5 Sonnet + Haiku + Opus, Llama 3.1 / 3.3 / 4, Amazon Nova (Micro / Lite / Pro / Premier), Mistral Large, Cohere Command R/R+, DeepSeek. Native Bedrock model IDs (anthropic.claude-sonnet-4-5-20250929-v1:0) or cross-region inference profiles (us. / eu. / apac. / global. prefix). Supports embeddings (Titan v2, Cohere Embed v3/v4). |
| Cohere | Command-R, Command-R+, rerank-v3 |
| Embeddings | OpenAI text-embedding-3-*, Cohere, Jina, Mixedbread, Qwen3, Azure AI Foundry, Amazon Titan v1/v2, Cohere on Bedrock |
| Self-hosted / Custom | Ollama, LM Studio, vLLM, any OpenAI-compatible endpoint |
ZenSearch Model Aliases
Instead of hardcoding provider-specific names, configure agents with the stable ZenSearch aliases — swap the underlying provider any time without touching your agent configs.
| Alias | Purpose | Default (OpenAI) | Default (Groq) | Default (OpenRouter) | Default (Azure) | Default (Bedrock) |
|---|---|---|---|---|---|---|
zen-mini | Fast, cheap classification / grading / simple chat | gpt-5.4-mini | llama-3.1-8b-instant | openrouter/openai/gpt-5.4-mini | azure/gpt-4o-mini | us.anthropic.claude-haiku-4-5-20251001-v1:0 |
zen-agent | Tool-using agent workloads | gpt-5.4 | openai/gpt-oss-120b | openrouter/openai/gpt-5.4 | azure/gpt-4o | us.anthropic.claude-sonnet-4-5-20250929-v1:0 |
zen-agent-pro | Complex reasoning and synthesis | gpt-5.4 | openai/gpt-oss-120b | openrouter/anthropic/claude-sonnet-4.6 | azure/gpt-4.1 | global.anthropic.claude-opus-4-5-20251101-v1:0 |
zen-embed | Default embedding model | text-embedding-3-small | n/a (Groq has no embeddings) | n/a (OpenRouter has no embeddings) | azure/text-embedding-3-small | amazon.titan-embed-text-v2:0 |
Switch the active provider at any time by setting ZEN_MODELS_PROVIDER to openai, groq, openrouter, azure, or bedrock. Embeddings are configured independently via ZEN_EMBED_PROVIDER and must use a provider that offers embedding models: OpenAI, Cohere, Jina, Mixedbread, Azure, or Bedrock (Groq and OpenRouter are rejected at startup).
Using Azure AI Foundry
Azure AI Foundry exposes OpenAI, Llama, Phi, Mistral, DeepSeek, and Cohere models through a single per-tenant OpenAI-compatible v1 surface. To use it:
- In the Azure portal, open your Foundry resource and create deployments for the models you want (e.g.
gpt-4o-mini,gpt-4o,text-embedding-3-small). The deployment names become the model IDs you use in ZenSearch. - Copy the resource URL (
https://<resource>.openai.azure.com) and an API key from the Keys and Endpoint page. - Set
AZURE_API_KEYandAZURE_BASE_URL=https://<resource>.openai.azure.com/openai/v1inservices/model-gw/.env. - In
services/core-api/.env, setZEN_MODELS_PROVIDER=azureand adjustZEN_MINI_MODEL_AZURE/ZEN_AGENT_MODEL_AZURE/ZEN_AGENT_PRO_MODEL_AZUREif your deployment names don't match the defaults. Example:ZEN_MINI_MODEL_AZURE=azure/my-gpt-4o-mini-deployment. - Restart
core-apiandmodel-gw.
Partner models (Meta Llama, Mistral, Cohere, Anthropic) require an active Azure Marketplace SaaS subscription in addition to the Foundry resource. OpenAI and Microsoft-first-party models work without Marketplace.
Using Amazon Bedrock
Amazon Bedrock runs Claude, Llama, Amazon Nova, Mistral, and Cohere models on your own AWS account through a unified Converse API. To use it:
- In the AWS Bedrock console → Model access, enable each model you plan to use (e.g. Claude Sonnet 4.5, Claude Haiku 4.5, Claude Opus 4.5, Titan Embed v2). This is a one-time per-account opt-in. Partner models (Meta, Mistral, Cohere) require Marketplace agreement on top of enablement.
- Option A — Bedrock API key (simpler onboarding, GA July 2025): In the Bedrock console → API keys, generate a short-term or long-term key. Set
AWS_BEARER_TOKEN_BEDROCK=<key>inservices/model-gw/.env. Long-term keys default to 30-day expiry; short-term keys inherit the creating principal's IAM permissions and are recommended for production. - Option B — AWS credentials chain: Leave
AWS_BEARER_TOKEN_BEDROCKempty and configure IAM credentials the standard way — env vars (AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY), shared credentials file, IAM role, IRSA, or EC2 instance metadata. The SDK discovers all of these automatically. - Set
AWS_REGIONinservices/model-gw/.env— mandatory, Bedrock is strictly region-scoped. Common choices:us-east-1,us-west-2,eu-central-1,ap-northeast-1. - In
services/core-api/.env, setZEN_MODELS_PROVIDER=bedrock. The defaults use cross-region inference profiles (us./global.prefixed IDs) for the Claude 4.5 family because base IDs are only enabled in a subset of regions. Override withZEN_MINI_MODEL_BEDROCK/ZEN_AGENT_MODEL_BEDROCK/ZEN_AGENT_PRO_MODEL_BEDROCKif you need a specific base ID or a different model family. - Restart
core-apiandmodel-gw.
Gotchas:
AccessDeniedExceptionusually means the model isn't enabled for your account in the region you're hitting. Open the Bedrock console → Model access and check.- Newer Claude models (4.5 Sonnet/Haiku/Opus) are only addressable via cross-region inference profiles — there are no base model IDs. If you paste an
anthropic.claude-sonnet-4-5...ID without aus./global./ etc. prefix you'll get aValidationException. - Embeddings cannot use inference profiles — always use a base model ID like
amazon.titan-embed-text-v2:0. The embedding client rejectsus./eu./etc. prefixes at construction time with a clear error. - Per-request cost tracking is disabled by default for Bedrock (rate card varies by model, region, and inference-profile geo). Set
CostPerInputToken/CostPerOutputTokenon the model DB row if you need accurate billing attribution.
Context Windows
Each model has a context window — the maximum amount of information it can process at once. This affects how much conversation history and source documents can be included.
| Model | Context Window | Best For |
|---|---|---|
| Claude Opus 4.6 | 200,000 tokens | Most intelligent, complex agents and coding |
| Claude Sonnet 4.6 | 200,000 tokens | Best speed/intelligence balance |
| Claude Haiku 4.5 | 200,000 tokens | Fastest Claude, near-frontier intelligence |
| GPT-5.4 / GPT-5 | 400,000 tokens | Frontier reasoning and long-context analysis |
| GPT-4.1 | 1,000,000 tokens | Very long documents and codebases |
| GPT-4o | 128,000 tokens | General-purpose chat and agents |
| GPT-4o-mini | 128,000 tokens | Fast, low-cost classification and grading |
| Llama 3.1 (Groq) | 128,000 tokens | Low-latency zen-mini workloads |
| Command-R+ | 128,000 tokens | RAG-optimized responses |
Tip: Choose models with larger context windows when working with:
- Long documents
- Extended conversations
- Multiple source documents
Adding Models
Add a New Model
- Click Add Model
- Select the provider
- Choose the model
- Enter API key (if required)
- Click Add
Configuration Fields
| Field | Description |
|---|---|
| Provider | Model provider (OpenAI, Anthropic, etc.) |
| Model | Specific model name |
| API Key | Provider API key |
| Endpoint | Custom endpoint URL (if applicable) |
Default Models
Setting Defaults
Set default models for each use case:
- Find the model in the list
- Click Set as Default
- Select the use case (Chat, Embedding)
Default Assignment
| Use Case | Recommendation |
|---|---|
| Chat | zen-agent (GPT-5.4 or Claude Sonnet 4.6 via OpenRouter) |
| Fast chat / grading | zen-mini (GPT-5.4-mini or Llama 3.1 via Groq) |
| Complex reasoning | zen-agent-pro (GPT-5.4 or Claude Opus 4.5 via Bedrock) |
| Embedding | zen-embed (text-embedding-3-small) |
| Reranker | Cohere rerank-v3 or the bundled cross-encoder reranker service |
Reliability & Cost Controls
The Model Gateway sits between ZenSearch services and every provider, adding production-grade reliability features on top of the raw APIs.
Provider Fallback & Circuit Breaker
When a primary provider starts returning errors or timing out, the gateway automatically routes subsequent calls to a configured fallback provider. A circuit breaker tracks consecutive failures per provider — once a threshold is exceeded, the provider is marked unhealthy and skipped until it recovers. This prevents cascading failures from knocking chat and agents offline when a single upstream has an incident.
Smart Model Routing
The gateway inspects incoming requests destined for zen-agent or zen-agent-pro and downgrades simple prompts (short, single-turn, no tool use) to zen-mini. Complex queries continue to use the stronger model. This cuts token costs on workloads where a fraction of traffic is trivial without forcing developers to make per-request routing decisions.
Enabled by default. Disable with SMART_ROUTING_ENABLED=false on the Model Gateway if you need consistent model selection (e.g. for benchmarking or reproducibility).
Auto-Retry with Countdown UI
When a provider returns a transient "unavailable" error, the chat UI surfaces a retry countdown rather than a hard error. The request is retried automatically with exponential backoff and rich error metadata is streamed back so operators can see which provider failed and why.
Structured Output Self-Correction
When an agent asks for JSON output and the response fails schema validation, the gateway automatically feeds the validation error back to the model and asks it to repair the output. This runs up to AGENT_STRUCTURED_OUTPUT_MAX_RETRIES times (default: 2) before giving up. Invisible to the caller — you just get valid JSON or a final error.
Prompt Caching
System prompts and tool definitions are cached per provider — Anthropic via explicit cache_control markers (90% discount on cached reads), OpenAI and Groq via automatic prefix caching (50% discount). Cache usage is tracked per team and per model so you can see how much of your spend is cached.
Model Usage
Viewing Usage
Navigate to the Model Usage tab to see:
- Tokens consumed per model
- Cost breakdown
- Usage over time
- Per-team breakdown
Usage Metrics
| Metric | Description |
|---|---|
| Input Tokens | Tokens sent to model |
| Output Tokens | Tokens received from model |
| Total Cost | Estimated cost |
| Request Count | Number of API calls |
Testing Models
Test Connection
Before saving, test the model:
- Click Test Connection
- Wait for verification
- Check for errors
Test Results
| Result | Meaning |
|---|---|
| Success | Model is accessible |
| Auth Error | API key is invalid |
| Network Error | Cannot reach endpoint |
| Model Error | Model not available |
Custom Endpoints
OpenAI-Compatible APIs
For local or self-hosted models:
Provider: Custom
Endpoint: http://localhost:8000/v1
Model: local-llama
API Key: (optional)
Supported Endpoints
- Ollama
- LM Studio
- vLLM
- Text Generation Inference
Recommended Local Models (Ollama)
For the Developer Edition installer's "Local Setup" option, ZenSearch picks chat + embedding models sized against both your available RAM and GPU VRAM so the Ollama runtime leaves room for Docker and the ZenSearch stack. The April 2026 default chat family is qwen3.5 — confirmed tools + thinking + vision support across the full size ladder. The installer also creates a custom zensearch-chat Ollama tag that wraps the picked base model with a tier-appropriate num_ctx (8K / 16K / 32K), bypassing Ollama's 4096-token default which would otherwise truncate the agent's tool defs and history.
GPU-first ladder (when ≥ 8 GB VRAM detected):
| GPU VRAM | Chat | Context | Embedding |
|---|---|---|---|
| ≥ 48 GB | qwen3.5:35b | 32K | mxbai-embed-large |
| ≥ 24 GB | qwen3.5:27b | 16K | mxbai-embed-large |
| ≥ 16 GB | qwen3.5:9b | 32K | mxbai-embed-large |
| ≥ 12 GB | qwen3.5:9b | 16K | mxbai-embed-large |
| ≥ 8 GB | qwen3.5:4b | 16K | nomic-embed-text |
RAM-only ladder (no GPU / Apple Silicon):
| Total RAM | Chat | Context | Embedding |
|---|---|---|---|
| ≥ 64 GB | qwen3.5:27b | 16K | mxbai-embed-large |
| 32 – 64 GB | qwen3.5:9b | 16K | mxbai-embed-large |
| 16 – 32 GB | qwen3.5:4b | 16K | nomic-embed-text |
| 8 – 16 GB | qwen3.5:4b | 8K | nomic-embed-text |
| < 8 GB | qwen3.5:2b | 8K | granite-embedding:30m |
These tiers assume the full ZenSearch stack is running on the same host. If you're pointing ZenSearch at a dedicated Ollama box you can safely run a larger model — set LLM_CHAT_MODEL / LLM_EMBED_MODEL in .env to override. See the self-hosting guide for the full sizing rationale and notes on the zensearch-chat tag.
Best Practices
Model Selection
- Use GPT-4o or Claude for complex queries
- Use faster models for simple tasks
- Consider cost vs. quality tradeoffs
- Test models before production use
API Key Security
- Never share API keys
- Rotate keys periodically
- Use separate keys per environment
- Monitor for unauthorized usage
Troubleshooting
Model Not Responding
- Verify API key is valid
- Check provider status page
- Test connection in settings
- Review rate limits
High Costs
- Review model usage dashboard
- Consider using smaller models
- Optimize query complexity
- Set usage limits
Next Steps
- Guardrails - Configure safety features
- API Keys - Manage API access