Skip to main content

AI Models

Configure the AI models used for chat, embeddings, and other features in ZenSearch.

Overview

ZenSearch provides centralized AI model management with unified usage tracking, per-team cost attribution, and rate limiting across all providers.

The AI Models settings allow you to:

  • View available models
  • Add new model configurations
  • Set default models
  • Monitor model usage and costs per team

Accessing Model Settings

  1. Click Settings in the sidebar
  2. Select the AI Models tab

Available Models

Model Types

TypePurpose
ChatConversational AI responses
EmbeddingDocument vectorization
RerankerResult reranking

Supported Providers

ProviderModels
OpenAIGPT-5 / GPT-5.2 / GPT-5.4 series, GPT-4.1, GPT-4o, GPT-4o-mini
AnthropicClaude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Claude 4.5 and 4.1 snapshots
GroqLlama 3.1, gpt-oss-120b, gpt-oss-20b (fast, low-cost inference)
OpenRouterUnified gateway to 100+ models from OpenAI, Anthropic, Meta, Google, Mistral and more — all behind a single API key. Model IDs use the openrouter/vendor/name form (e.g. openrouter/openai/gpt-4o-mini). No embeddings.
Azure AI FoundryOpenAI-compatible v1 surface hosted on your own Azure tenant. Access to GPT-4o / 4.1 / 5, Llama 3.1 family, Phi-4, Mistral, DeepSeek, Cohere through one endpoint. Model IDs use the azure/<deployment-name> form (e.g. azure/gpt-4o-mini). Supports embeddings natively.
Amazon BedrockUnified Converse API on your own AWS account: Claude 4.5 / 4 / 3.5 Sonnet + Haiku + Opus, Llama 3.1 / 3.3 / 4, Amazon Nova (Micro / Lite / Pro / Premier), Mistral Large, Cohere Command R/R+, DeepSeek. Native Bedrock model IDs (anthropic.claude-sonnet-4-5-20250929-v1:0) or cross-region inference profiles (us. / eu. / apac. / global. prefix). Supports embeddings (Titan v2, Cohere Embed v3/v4).
CohereCommand-R, Command-R+, rerank-v3
EmbeddingsOpenAI text-embedding-3-*, Cohere, Jina, Mixedbread, Qwen3, Azure AI Foundry, Amazon Titan v1/v2, Cohere on Bedrock
Self-hosted / CustomOllama, LM Studio, vLLM, any OpenAI-compatible endpoint

ZenSearch Model Aliases

Instead of hardcoding provider-specific names, configure agents with the stable ZenSearch aliases — swap the underlying provider any time without touching your agent configs.

AliasPurposeDefault (OpenAI)Default (Groq)Default (OpenRouter)Default (Azure)Default (Bedrock)
zen-miniFast, cheap classification / grading / simple chatgpt-5.4-minillama-3.1-8b-instantopenrouter/openai/gpt-5.4-miniazure/gpt-4o-minius.anthropic.claude-haiku-4-5-20251001-v1:0
zen-agentTool-using agent workloadsgpt-5.4openai/gpt-oss-120bopenrouter/openai/gpt-5.4azure/gpt-4ous.anthropic.claude-sonnet-4-5-20250929-v1:0
zen-agent-proComplex reasoning and synthesisgpt-5.4openai/gpt-oss-120bopenrouter/anthropic/claude-sonnet-4.6azure/gpt-4.1global.anthropic.claude-opus-4-5-20251101-v1:0
zen-embedDefault embedding modeltext-embedding-3-smalln/a (Groq has no embeddings)n/a (OpenRouter has no embeddings)azure/text-embedding-3-smallamazon.titan-embed-text-v2:0

Switch the active provider at any time by setting ZEN_MODELS_PROVIDER to openai, groq, openrouter, azure, or bedrock. Embeddings are configured independently via ZEN_EMBED_PROVIDER and must use a provider that offers embedding models: OpenAI, Cohere, Jina, Mixedbread, Azure, or Bedrock (Groq and OpenRouter are rejected at startup).

Using Azure AI Foundry

Azure AI Foundry exposes OpenAI, Llama, Phi, Mistral, DeepSeek, and Cohere models through a single per-tenant OpenAI-compatible v1 surface. To use it:

  1. In the Azure portal, open your Foundry resource and create deployments for the models you want (e.g. gpt-4o-mini, gpt-4o, text-embedding-3-small). The deployment names become the model IDs you use in ZenSearch.
  2. Copy the resource URL (https://<resource>.openai.azure.com) and an API key from the Keys and Endpoint page.
  3. Set AZURE_API_KEY and AZURE_BASE_URL=https://<resource>.openai.azure.com/openai/v1 in services/model-gw/.env.
  4. In services/core-api/.env, set ZEN_MODELS_PROVIDER=azure and adjust ZEN_MINI_MODEL_AZURE / ZEN_AGENT_MODEL_AZURE / ZEN_AGENT_PRO_MODEL_AZURE if your deployment names don't match the defaults. Example: ZEN_MINI_MODEL_AZURE=azure/my-gpt-4o-mini-deployment.
  5. Restart core-api and model-gw.

Partner models (Meta Llama, Mistral, Cohere, Anthropic) require an active Azure Marketplace SaaS subscription in addition to the Foundry resource. OpenAI and Microsoft-first-party models work without Marketplace.

Using Amazon Bedrock

Amazon Bedrock runs Claude, Llama, Amazon Nova, Mistral, and Cohere models on your own AWS account through a unified Converse API. To use it:

  1. In the AWS Bedrock console → Model access, enable each model you plan to use (e.g. Claude Sonnet 4.5, Claude Haiku 4.5, Claude Opus 4.5, Titan Embed v2). This is a one-time per-account opt-in. Partner models (Meta, Mistral, Cohere) require Marketplace agreement on top of enablement.
  2. Option A — Bedrock API key (simpler onboarding, GA July 2025): In the Bedrock console → API keys, generate a short-term or long-term key. Set AWS_BEARER_TOKEN_BEDROCK=<key> in services/model-gw/.env. Long-term keys default to 30-day expiry; short-term keys inherit the creating principal's IAM permissions and are recommended for production.
  3. Option B — AWS credentials chain: Leave AWS_BEARER_TOKEN_BEDROCK empty and configure IAM credentials the standard way — env vars (AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY), shared credentials file, IAM role, IRSA, or EC2 instance metadata. The SDK discovers all of these automatically.
  4. Set AWS_REGION in services/model-gw/.envmandatory, Bedrock is strictly region-scoped. Common choices: us-east-1, us-west-2, eu-central-1, ap-northeast-1.
  5. In services/core-api/.env, set ZEN_MODELS_PROVIDER=bedrock. The defaults use cross-region inference profiles (us. / global. prefixed IDs) for the Claude 4.5 family because base IDs are only enabled in a subset of regions. Override with ZEN_MINI_MODEL_BEDROCK / ZEN_AGENT_MODEL_BEDROCK / ZEN_AGENT_PRO_MODEL_BEDROCK if you need a specific base ID or a different model family.
  6. Restart core-api and model-gw.

Gotchas:

  • AccessDeniedException usually means the model isn't enabled for your account in the region you're hitting. Open the Bedrock console → Model access and check.
  • Newer Claude models (4.5 Sonnet/Haiku/Opus) are only addressable via cross-region inference profiles — there are no base model IDs. If you paste an anthropic.claude-sonnet-4-5... ID without a us. / global. / etc. prefix you'll get a ValidationException.
  • Embeddings cannot use inference profiles — always use a base model ID like amazon.titan-embed-text-v2:0. The embedding client rejects us./eu./etc. prefixes at construction time with a clear error.
  • Per-request cost tracking is disabled by default for Bedrock (rate card varies by model, region, and inference-profile geo). Set CostPerInputToken / CostPerOutputToken on the model DB row if you need accurate billing attribution.

Context Windows

Each model has a context window — the maximum amount of information it can process at once. This affects how much conversation history and source documents can be included.

ModelContext WindowBest For
Claude Opus 4.6200,000 tokensMost intelligent, complex agents and coding
Claude Sonnet 4.6200,000 tokensBest speed/intelligence balance
Claude Haiku 4.5200,000 tokensFastest Claude, near-frontier intelligence
GPT-5.4 / GPT-5400,000 tokensFrontier reasoning and long-context analysis
GPT-4.11,000,000 tokensVery long documents and codebases
GPT-4o128,000 tokensGeneral-purpose chat and agents
GPT-4o-mini128,000 tokensFast, low-cost classification and grading
Llama 3.1 (Groq)128,000 tokensLow-latency zen-mini workloads
Command-R+128,000 tokensRAG-optimized responses

Tip: Choose models with larger context windows when working with:

  • Long documents
  • Extended conversations
  • Multiple source documents

Adding Models

Add a New Model

  1. Click Add Model
  2. Select the provider
  3. Choose the model
  4. Enter API key (if required)
  5. Click Add

Configuration Fields

FieldDescription
ProviderModel provider (OpenAI, Anthropic, etc.)
ModelSpecific model name
API KeyProvider API key
EndpointCustom endpoint URL (if applicable)

Default Models

Setting Defaults

Set default models for each use case:

  1. Find the model in the list
  2. Click Set as Default
  3. Select the use case (Chat, Embedding)

Default Assignment

Use CaseRecommendation
Chatzen-agent (GPT-5.4 or Claude Sonnet 4.6 via OpenRouter)
Fast chat / gradingzen-mini (GPT-5.4-mini or Llama 3.1 via Groq)
Complex reasoningzen-agent-pro (GPT-5.4 or Claude Opus 4.5 via Bedrock)
Embeddingzen-embed (text-embedding-3-small)
RerankerCohere rerank-v3 or the bundled cross-encoder reranker service

Reliability & Cost Controls

The Model Gateway sits between ZenSearch services and every provider, adding production-grade reliability features on top of the raw APIs.

Provider Fallback & Circuit Breaker

When a primary provider starts returning errors or timing out, the gateway automatically routes subsequent calls to a configured fallback provider. A circuit breaker tracks consecutive failures per provider — once a threshold is exceeded, the provider is marked unhealthy and skipped until it recovers. This prevents cascading failures from knocking chat and agents offline when a single upstream has an incident.

Smart Model Routing

The gateway inspects incoming requests destined for zen-agent or zen-agent-pro and downgrades simple prompts (short, single-turn, no tool use) to zen-mini. Complex queries continue to use the stronger model. This cuts token costs on workloads where a fraction of traffic is trivial without forcing developers to make per-request routing decisions.

Enabled by default. Disable with SMART_ROUTING_ENABLED=false on the Model Gateway if you need consistent model selection (e.g. for benchmarking or reproducibility).

Auto-Retry with Countdown UI

When a provider returns a transient "unavailable" error, the chat UI surfaces a retry countdown rather than a hard error. The request is retried automatically with exponential backoff and rich error metadata is streamed back so operators can see which provider failed and why.

Structured Output Self-Correction

When an agent asks for JSON output and the response fails schema validation, the gateway automatically feeds the validation error back to the model and asks it to repair the output. This runs up to AGENT_STRUCTURED_OUTPUT_MAX_RETRIES times (default: 2) before giving up. Invisible to the caller — you just get valid JSON or a final error.

Prompt Caching

System prompts and tool definitions are cached per provider — Anthropic via explicit cache_control markers (90% discount on cached reads), OpenAI and Groq via automatic prefix caching (50% discount). Cache usage is tracked per team and per model so you can see how much of your spend is cached.

Model Usage

Viewing Usage

Navigate to the Model Usage tab to see:

  • Tokens consumed per model
  • Cost breakdown
  • Usage over time
  • Per-team breakdown

Usage Metrics

MetricDescription
Input TokensTokens sent to model
Output TokensTokens received from model
Total CostEstimated cost
Request CountNumber of API calls

Testing Models

Test Connection

Before saving, test the model:

  1. Click Test Connection
  2. Wait for verification
  3. Check for errors

Test Results

ResultMeaning
SuccessModel is accessible
Auth ErrorAPI key is invalid
Network ErrorCannot reach endpoint
Model ErrorModel not available

Custom Endpoints

OpenAI-Compatible APIs

For local or self-hosted models:

Provider: Custom
Endpoint: http://localhost:8000/v1
Model: local-llama
API Key: (optional)

Supported Endpoints

  • Ollama
  • LM Studio
  • vLLM
  • Text Generation Inference

For the Developer Edition installer's "Local Setup" option, ZenSearch picks chat + embedding models sized against both your available RAM and GPU VRAM so the Ollama runtime leaves room for Docker and the ZenSearch stack. The April 2026 default chat family is qwen3.5 — confirmed tools + thinking + vision support across the full size ladder. The installer also creates a custom zensearch-chat Ollama tag that wraps the picked base model with a tier-appropriate num_ctx (8K / 16K / 32K), bypassing Ollama's 4096-token default which would otherwise truncate the agent's tool defs and history.

GPU-first ladder (when ≥ 8 GB VRAM detected):

GPU VRAMChatContextEmbedding
≥ 48 GBqwen3.5:35b32Kmxbai-embed-large
≥ 24 GBqwen3.5:27b16Kmxbai-embed-large
≥ 16 GBqwen3.5:9b32Kmxbai-embed-large
≥ 12 GBqwen3.5:9b16Kmxbai-embed-large
≥ 8 GBqwen3.5:4b16Knomic-embed-text

RAM-only ladder (no GPU / Apple Silicon):

Total RAMChatContextEmbedding
≥ 64 GBqwen3.5:27b16Kmxbai-embed-large
32 – 64 GBqwen3.5:9b16Kmxbai-embed-large
16 – 32 GBqwen3.5:4b16Knomic-embed-text
8 – 16 GBqwen3.5:4b8Knomic-embed-text
< 8 GBqwen3.5:2b8Kgranite-embedding:30m

These tiers assume the full ZenSearch stack is running on the same host. If you're pointing ZenSearch at a dedicated Ollama box you can safely run a larger model — set LLM_CHAT_MODEL / LLM_EMBED_MODEL in .env to override. See the self-hosting guide for the full sizing rationale and notes on the zensearch-chat tag.

Best Practices

Model Selection

  1. Use GPT-4o or Claude for complex queries
  2. Use faster models for simple tasks
  3. Consider cost vs. quality tradeoffs
  4. Test models before production use

API Key Security

  1. Never share API keys
  2. Rotate keys periodically
  3. Use separate keys per environment
  4. Monitor for unauthorized usage

Troubleshooting

Model Not Responding

  1. Verify API key is valid
  2. Check provider status page
  3. Test connection in settings
  4. Review rate limits

High Costs

  1. Review model usage dashboard
  2. Consider using smaller models
  3. Optimize query complexity
  4. Set usage limits

Next Steps