AI Models

Configure the AI models used for chat, embeddings, and other features in ZenSearch.

Overview

ZenSearch provides centralized AI model management with unified usage tracking, per-team cost attribution, and rate limiting across all providers.

The AI Models settings allow you to:

View available models
Add new model configurations
Set default models
Monitor model usage and costs per team

Accessing Model Settings

Click Settings in the sidebar
Select the AI Models tab

Available Models

Model Types

Type	Purpose
Chat	Conversational AI responses
Embedding	Document vectorization
Reranker	Result reranking

Supported Providers

Provider	Models
OpenAI	GPT-5 / GPT-5.2 / GPT-5.4 series, GPT-4.1, GPT-4o, GPT-4o-mini
Anthropic	Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Claude 4.5 and 4.1 snapshots
Groq	Llama 3.1, `gpt-oss-120b`, `gpt-oss-20b` (fast, low-cost inference)
OpenRouter	Unified gateway to 100+ models from OpenAI, Anthropic, Meta, Google, Mistral and more — all behind a single API key. Model IDs use the `openrouter/vendor/name` form (e.g. `openrouter/openai/gpt-4o-mini`). No embeddings.
Azure AI Foundry	OpenAI-compatible v1 surface hosted on your own Azure tenant. Access to GPT-4o / 4.1 / 5, Llama 3.1 family, Phi-4, Mistral, DeepSeek, Cohere through one endpoint. Model IDs use the `azure/<deployment-name>` form (e.g. `azure/gpt-4o-mini`). Supports embeddings natively.
Amazon Bedrock	Unified Converse API on your own AWS account: Claude 4.5 / 4 / 3.5 Sonnet + Haiku + Opus, Llama 3.1 / 3.3 / 4, Amazon Nova (Micro / Lite / Pro / Premier), Mistral Large, Cohere Command R/R+, DeepSeek. Native Bedrock model IDs (`anthropic.claude-sonnet-4-5-20250929-v1:0`) or cross-region inference profiles (`us.` / `eu.` / `apac.` / `global.` prefix). Supports embeddings (Titan v2, Cohere Embed v3/v4).
Cohere	Command-R, Command-R+, rerank-v3
Embeddings	OpenAI `text-embedding-3-*`, Cohere, Jina, Mixedbread, Qwen3, Azure AI Foundry, Amazon Titan v1/v2, Cohere on Bedrock
Self-hosted / Custom	Ollama, LM Studio, vLLM, any OpenAI-compatible endpoint

ZenSearch Model Aliases

Instead of hardcoding provider-specific names, configure agents with the stable ZenSearch aliases — swap the underlying provider any time without touching your agent configs.

Alias	Purpose	Default (OpenAI)	Default (Groq)	Default (OpenRouter)	Default (Azure)	Default (Bedrock)
`zen-mini`	Fast, cheap classification / grading / simple chat	`gpt-5.4-mini`	`llama-3.1-8b-instant`	`openrouter/openai/gpt-5.4-mini`	`azure/gpt-4o-mini`	`us.anthropic.claude-haiku-4-5-20251001-v1:0`
`zen-agent`	Tool-using agent workloads	`gpt-5.4`	`openai/gpt-oss-120b`	`openrouter/openai/gpt-5.4`	`azure/gpt-4o`	`us.anthropic.claude-sonnet-4-5-20250929-v1:0`
`zen-agent-pro`	Complex reasoning and synthesis	`gpt-5.4`	`openai/gpt-oss-120b`	`openrouter/anthropic/claude-sonnet-4.6`	`azure/gpt-4.1`	`global.anthropic.claude-opus-4-5-20251101-v1:0`
`zen-embed`	Default embedding model	`text-embedding-3-small`	n/a (Groq has no embeddings)	n/a (OpenRouter has no embeddings)	`azure/text-embedding-3-small`	`amazon.titan-embed-text-v2:0`

Switch the active provider at any time by setting ZEN_MODELS_PROVIDER to openai, groq, openrouter, azure, or bedrock. Embeddings are configured independently via ZEN_EMBED_PROVIDER and must use a provider that offers embedding models: OpenAI, Cohere, Jina, Mixedbread, Azure, or Bedrock (Groq and OpenRouter are rejected at startup).

Using Azure AI Foundry

Azure AI Foundry exposes OpenAI, Llama, Phi, Mistral, DeepSeek, and Cohere models through a single per-tenant OpenAI-compatible v1 surface. To use it:

In the Azure portal, open your Foundry resource and create deployments for the models you want (e.g. gpt-4o-mini, gpt-4o, text-embedding-3-small). The deployment names become the model IDs you use in ZenSearch.
Copy the resource URL (https://<resource>.openai.azure.com) and an API key from the Keys and Endpoint page.
Set AZURE_API_KEY and AZURE_BASE_URL=https://<resource>.openai.azure.com/openai/v1 in services/model-gw/.env.
In services/core-api/.env, set ZEN_MODELS_PROVIDER=azure and adjust ZEN_MINI_MODEL_AZURE / ZEN_AGENT_MODEL_AZURE / ZEN_AGENT_PRO_MODEL_AZURE if your deployment names don't match the defaults. Example: ZEN_MINI_MODEL_AZURE=azure/my-gpt-4o-mini-deployment.
Restart core-api and model-gw.

Partner models (Meta Llama, Mistral, Cohere, Anthropic) require an active Azure Marketplace SaaS subscription in addition to the Foundry resource. OpenAI and Microsoft-first-party models work without Marketplace.

Using Amazon Bedrock

Amazon Bedrock runs Claude, Llama, Amazon Nova, Mistral, and Cohere models on your own AWS account through a unified Converse API. To use it:

In the AWS Bedrock console → Model access, enable each model you plan to use (e.g. Claude Sonnet 4.5, Claude Haiku 4.5, Claude Opus 4.5, Titan Embed v2). This is a one-time per-account opt-in. Partner models (Meta, Mistral, Cohere) require Marketplace agreement on top of enablement.
Option A — Bedrock API key (simpler onboarding, GA July 2025): In the Bedrock console → API keys, generate a short-term or long-term key. Set AWS_BEARER_TOKEN_BEDROCK=<key> in services/model-gw/.env. Long-term keys default to 30-day expiry; short-term keys inherit the creating principal's IAM permissions and are recommended for production.
Option B — AWS credentials chain: Leave AWS_BEARER_TOKEN_BEDROCK empty and configure IAM credentials the standard way — env vars (AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY), shared credentials file, IAM role, IRSA, or EC2 instance metadata. The SDK discovers all of these automatically.
Set AWS_REGION in services/model-gw/.env — mandatory, Bedrock is strictly region-scoped. Common choices: us-east-1, us-west-2, eu-central-1, ap-northeast-1.
In services/core-api/.env, set ZEN_MODELS_PROVIDER=bedrock. The defaults use cross-region inference profiles (us. / global. prefixed IDs) for the Claude 4.5 family because base IDs are only enabled in a subset of regions. Override with ZEN_MINI_MODEL_BEDROCK / ZEN_AGENT_MODEL_BEDROCK / ZEN_AGENT_PRO_MODEL_BEDROCK if you need a specific base ID or a different model family.
Restart core-api and model-gw.

Gotchas:

AccessDeniedException usually means the model isn't enabled for your account in the region you're hitting. Open the Bedrock console → Model access and check.
Newer Claude models (4.5 Sonnet/Haiku/Opus) are only addressable via cross-region inference profiles — there are no base model IDs. If you paste an anthropic.claude-sonnet-4-5... ID without a us. / global. / etc. prefix you'll get a ValidationException.
Embeddings cannot use inference profiles — always use a base model ID like amazon.titan-embed-text-v2:0. The embedding client rejects us./eu./etc. prefixes at construction time with a clear error.
Per-request cost tracking is disabled by default for Bedrock (rate card varies by model, region, and inference-profile geo). Set CostPerInputToken / CostPerOutputToken on the model DB row if you need accurate billing attribution.

Context Windows

Each model has a context window — the maximum amount of information it can process at once. This affects how much conversation history and source documents can be included.

Model	Context Window	Best For
Claude Opus 4.6	200,000 tokens	Most intelligent, complex agents and coding
Claude Sonnet 4.6	200,000 tokens	Best speed/intelligence balance
Claude Haiku 4.5	200,000 tokens	Fastest Claude, near-frontier intelligence
GPT-5.4 / GPT-5	400,000 tokens	Frontier reasoning and long-context analysis
GPT-4.1	1,000,000 tokens	Very long documents and codebases
GPT-4o	128,000 tokens	General-purpose chat and agents
GPT-4o-mini	128,000 tokens	Fast, low-cost classification and grading
Llama 3.1 (Groq)	128,000 tokens	Low-latency zen-mini workloads
Command-R+	128,000 tokens	RAG-optimized responses

Tip: Choose models with larger context windows when working with:

Long documents
Extended conversations
Multiple source documents

Adding Models

Add a New Model

Click Add Model
Select the provider
Choose the model
Enter API key (if required)
Click Add

Configuration Fields

Field	Description
Provider	Model provider (OpenAI, Anthropic, etc.)
Model	Specific model name
API Key	Provider API key
Endpoint	Custom endpoint URL (if applicable)

Default Models

Setting Defaults

Set default models for each use case:

Find the model in the list
Click Set as Default
Select the use case (Chat, Embedding)

Default Assignment

Use Case	Recommendation
Chat	`zen-agent` (GPT-5.4 or Claude Sonnet 4.6 via OpenRouter)
Fast chat / grading	`zen-mini` (GPT-5.4-mini or Llama 3.1 via Groq)
Complex reasoning	`zen-agent-pro` (GPT-5.4 or Claude Opus 4.5 via Bedrock)
Embedding	`zen-embed` (text-embedding-3-small)
Reranker	Cohere rerank-v3 or the bundled cross-encoder reranker service

Reliability & Cost Controls

The Model Gateway sits between ZenSearch services and every provider, adding production-grade reliability features on top of the raw APIs.

Provider Fallback & Circuit Breaker

When a primary provider starts returning errors or timing out, the gateway automatically routes subsequent calls to a configured fallback provider. A circuit breaker tracks consecutive failures per provider — once a threshold is exceeded, the provider is marked unhealthy and skipped until it recovers. This prevents cascading failures from knocking chat and agents offline when a single upstream has an incident.

Smart Model Routing

The gateway inspects incoming requests destined for zen-agent or zen-agent-pro and downgrades simple prompts (short, single-turn, no tool use) to zen-mini. Complex queries continue to use the stronger model. This cuts token costs on workloads where a fraction of traffic is trivial without forcing developers to make per-request routing decisions.

Enabled by default. Disable with SMART_ROUTING_ENABLED=false on the Model Gateway if you need consistent model selection (e.g. for benchmarking or reproducibility).

Auto-Retry with Countdown UI

When a provider returns a transient "unavailable" error, the chat UI surfaces a retry countdown rather than a hard error. The request is retried automatically with exponential backoff and rich error metadata is streamed back so operators can see which provider failed and why.

Structured Output Self-Correction

When an agent asks for JSON output and the response fails schema validation, the gateway automatically feeds the validation error back to the model and asks it to repair the output. This runs up to AGENT_STRUCTURED_OUTPUT_MAX_RETRIES times (default: 2) before giving up. Invisible to the caller — you just get valid JSON or a final error.

Prompt Caching

System prompts and tool definitions are cached per provider — Anthropic via explicit cache_control markers (90% discount on cached reads), OpenAI and Groq via automatic prefix caching (50% discount). Cache usage is tracked per team and per model so you can see how much of your spend is cached.

Model Usage

Viewing Usage

Navigate to the Model Usage tab to see:

Tokens consumed per model
Cost breakdown
Usage over time
Per-team breakdown

Usage Metrics

Metric	Description
Input Tokens	Tokens sent to model
Output Tokens	Tokens received from model
Total Cost	Estimated cost
Request Count	Number of API calls

Testing Models

Test Connection

Before saving, test the model:

Click Test Connection
Wait for verification
Check for errors

Test Results

Result	Meaning
Success	Model is accessible
Auth Error	API key is invalid
Network Error	Cannot reach endpoint
Model Error	Model not available

Custom Endpoints

OpenAI-Compatible APIs

For local or self-hosted models:

Provider: Custom
Endpoint: http://localhost:8000/v1
Model: local-llama
API Key: (optional)

Supported Endpoints

Ollama
LM Studio
vLLM
Text Generation Inference

Recommended Local Models (Ollama)

For the Developer Edition installer's "Local Setup" option, ZenSearch picks chat + embedding models sized against both your available RAM and GPU VRAM so the Ollama runtime leaves room for Docker and the ZenSearch stack. The April 2026 default chat family is qwen3.5 — confirmed tools + thinking + vision support across the full size ladder. The installer also creates a custom zensearch-chat Ollama tag that wraps the picked base model with a tier-appropriate num_ctx (8K / 16K / 32K), bypassing Ollama's 4096-token default which would otherwise truncate the agent's tool defs and history.

GPU-first ladder (when ≥ 8 GB VRAM detected):

GPU VRAM	Chat	Context	Embedding
≥ 48 GB	`qwen3.5:35b`	32K	`mxbai-embed-large`
≥ 24 GB	`qwen3.5:27b`	16K	`mxbai-embed-large`
≥ 16 GB	`qwen3.5:9b`	32K	`mxbai-embed-large`
≥ 12 GB	`qwen3.5:9b`	16K	`mxbai-embed-large`
≥ 8 GB	`qwen3.5:4b`	16K	`nomic-embed-text`

RAM-only ladder (no GPU / Apple Silicon):

Total RAM	Chat	Context	Embedding
≥ 64 GB	`qwen3.5:27b`	16K	`mxbai-embed-large`
32 – 64 GB	`qwen3.5:9b`	16K	`mxbai-embed-large`
16 – 32 GB	`qwen3.5:4b`	16K	`nomic-embed-text`
8 – 16 GB	`qwen3.5:4b`	8K	`nomic-embed-text`
< 8 GB	`qwen3.5:2b`	8K	`granite-embedding:30m`

These tiers assume the full ZenSearch stack is running on the same host. If you're pointing ZenSearch at a dedicated Ollama box you can safely run a larger model — set LLM_CHAT_MODEL / LLM_EMBED_MODEL in .env to override. See the self-hosting guide for the full sizing rationale and notes on the zensearch-chat tag.

Best Practices

Model Selection

Use GPT-4o or Claude for complex queries
Use faster models for simple tasks
Consider cost vs. quality tradeoffs
Test models before production use

API Key Security

Never share API keys
Rotate keys periodically
Use separate keys per environment
Monitor for unauthorized usage

Troubleshooting

Model Not Responding

Verify API key is valid
Check provider status page
Test connection in settings
Review rate limits

High Costs

Review model usage dashboard
Consider using smaller models
Optimize query complexity
Set usage limits

Next Steps

Guardrails - Configure safety features
API Keys - Manage API access

Overview​

Accessing Model Settings​

Available Models​

Model Types​

Supported Providers​

ZenSearch Model Aliases​

Using Azure AI Foundry​

Using Amazon Bedrock​

Context Windows​

Adding Models​

Add a New Model​

Configuration Fields​

Default Models​

Setting Defaults​

Default Assignment​

Reliability & Cost Controls​

Provider Fallback & Circuit Breaker​

Smart Model Routing​

Auto-Retry with Countdown UI​

Structured Output Self-Correction​

Prompt Caching​

Model Usage​

Viewing Usage​

Usage Metrics​

Testing Models​

Test Connection​

Test Results​

Custom Endpoints​

OpenAI-Compatible APIs​

Supported Endpoints​

Recommended Local Models (Ollama)​

Best Practices​

Model Selection​

API Key Security​

Troubleshooting​

Model Not Responding​

High Costs​

Next Steps​