Models & Providers

ScalyClaw routes LLM calls through a configurable provider stack. You can register multiple models from different providers, assign priorities and weights for load balancing, and define fallback chains so that a single provider outage never stops the system. All model configuration lives in Redis — no config files on disk — and changes take effect immediately without a restart.

Model Configuration

Models are added and managed in the dashboard under Settings → Models. Each entry in the model list describes a single provider/model combination. You can have as many entries as you need; the orchestrator selects the right one at call time based on priority, weight, and the capabilities required by that particular request.

Model Properties

Property	Type	Description
`id`	string	Unique identifier for this model entry within ScalyClaw. Used internally and in logs — choose something readable like `gpt4o-primary` or `claude-opus-main`.
`name`	string	The model name as the provider expects it — e.g. `gpt-4o`, `claude-sonnet-4-20250514`, `gemini-1.5-pro`. For local/Ollama, this is the tag pulled in Ollama.
`provider`	string	Key into the `providers` map — e.g. `openai`, `anthropic`, `google`. Determines which configured provider credentials and base URL are used.
`enabled`	boolean	Toggle this model on or off without removing it from the config. Disabled models are never selected by the orchestrator.
`priority`	integer	Lower number = higher priority. The orchestrator always tries the lowest-numbered priority group first. If every model in that group fails, it falls back to the next group. Models with `priority: 1` are tried before `priority: 2`.
`weight`	integer	Relative weight (0–100) for load balancing among models sharing the same priority. A model with `weight: 3` receives three times as many requests as one with `weight: 1`. Useful for spreading load across multiple keys for the same provider.
`temperature`	number	Sampling temperature for this model. Range 0.0–2.0 depending on provider; 0.7 is a reasonable default for conversational use.
`maxTokens`	integer	Maximum number of tokens the model may generate in a single response.
`contextWindow`	integer	Total context window size for this model in tokens, including both input and output. Used to guard against prompts that would exceed the model's limit.
`toolEnabled`	boolean	Whether this model supports tool/function calling. The orchestrator only selects this model for tool-enabled requests when this is `true`.
`imageEnabled`	boolean	Whether this model can process image inputs.
`audioEnabled`	boolean	Whether this model can process audio inputs.
`videoEnabled`	boolean	Whether this model can process video inputs.
`documentEnabled`	boolean	Whether this model can process document inputs (e.g. PDFs).
`reasoningEnabled`	boolean	Whether this model supports extended thinking / reasoning mode.
`inputPricePerMillion`	number	Cost in USD per one million input tokens. Used for budget tracking and spend estimates.
`outputPricePerMillion`	number	Cost in USD per one million output tokens. Used for budget tracking and spend estimates.

Fallback Chain

When a model call fails — due to a network error, rate-limit, timeout, or provider outage — ScalyClaw automatically tries the next available model. The fallback order is determined entirely by the priority field:

All models with the lowest priority value are candidates for the first attempt. If there are multiple, they share load according to their weight.
If every model in that priority group fails (or none has the required capabilities), the orchestrator moves to the next priority group.
This continues until either a call succeeds or all models are exhausted, in which case an error is returned to the channel.

How weight works

Weights within the same priority level are used for weighted-random selection, not strict round-robin. Each candidate's probability of being chosen equals its weight divided by the sum of all weights in that group. A weight of 0 disables a model without removing it from the config.

Supported Providers

Providers are registered under config.models.providers as a map from a provider key to an object with an optional apiKey and optional baseUrl. Each model entry's provider field references one of these keys.

Provider	`provider` key	Notes
OpenAI	`openai`	GPT-4o, GPT-4o-mini, o1, o3, and future models. Set `apiKey` to your OpenAI API key.
Anthropic	`anthropic`	Claude Opus, Sonnet, and Haiku family. Extended thinking supported where available. Set `apiKey` to your Anthropic API key.
Google AI	`google`	Gemini 1.5 Pro, Flash, and 2.0 series. Uses the Gemini API (not Vertex AI). Set `apiKey` to your Google AI API key.
Azure OpenAI	`azure`	Set `apiKey` to your Azure API key and `baseUrl` to your Azure endpoint. Use the deployment name as the model's `name` field.
Local / Ollama	`local`	Any OpenAI-compatible local server (Ollama, LM Studio, llama.cpp). Set `baseUrl` to the local address (e.g. `http://localhost:11434/v1`). `apiKey` can be omitted or left empty.

Example Configuration

The following shows a realistic multi-provider setup: Anthropic as the primary model, OpenAI as a same-priority peer with lower weight, and a local Ollama model as a last-resort fallback for basic requests.

json

{
  "models": {
    "providers": {
      "anthropic": { "apiKey": "sk-ant-..." },
      "openai": { "apiKey": "sk-..." },
      "local": { "baseUrl": "http://localhost:11434/v1" }
    },
    "models": [
      {
        "id": "claude-primary",
        "name": "claude-opus-4-6",
        "provider": "anthropic",
        "enabled": true,
        "priority": 1,
        "weight": 75,
        "temperature": 0.7,
        "maxTokens": 8192,
        "contextWindow": 200000,
        "toolEnabled": true,
        "imageEnabled": true,
        "audioEnabled": false,
        "videoEnabled": false,
        "documentEnabled": true,
        "reasoningEnabled": false,
        "inputPricePerMillion": 15.00,
        "outputPricePerMillion": 75.00
      },
      {
        "id": "gpt4o-secondary",
        "name": "gpt-4o",
        "provider": "openai",
        "enabled": true,
        "priority": 1,
        "weight": 25,
        "temperature": 0.7,
        "maxTokens": 4096,
        "contextWindow": 128000,
        "toolEnabled": true,
        "imageEnabled": true,
        "audioEnabled": false,
        "videoEnabled": false,
        "documentEnabled": false,
        "reasoningEnabled": false,
        "inputPricePerMillion": 2.50,
        "outputPricePerMillion": 10.00
      },
      {
        "id": "local-fallback",
        "name": "llama3.2",
        "provider": "local",
        "enabled": true,
        "priority": 2,
        "weight": 100,
        "temperature": 0.7,
        "maxTokens": 2048,
        "contextWindow": 8192,
        "toolEnabled": false,
        "imageEnabled": false,
        "audioEnabled": false,
        "videoEnabled": false,
        "documentEnabled": false,
        "reasoningEnabled": false,
        "inputPricePerMillion": 0,
        "outputPricePerMillion": 0
      }
    ],
    "embeddingModels": []
  }
}

With this config, 75% of priority-1 requests go to Claude (weight 75 out of 100) and 25% go to GPT-4o (weight 25 out of 100). If both fail, the local Llama 3.2 instance handles the request — but only for requests that do not require tools or multimodal inputs, since those capability flags are false for the fallback.

Capability matching is strict

If a request requires tool use (because the LLM is expected to call tools) and no model in any priority group has toolEnabled: true, the call will fail immediately rather than fall through to an incapable model and produce a broken response. Always ensure at least one enabled model has the required capability flags set for each feature you rely on.

Embedding Models

ScalyClaw's memory system stores every saved memory entry alongside a high-dimensional vector embedding. When the orchestrator retrieves relevant context before an LLM call, it runs a cosine-similarity search using sqlite-vec against those stored vectors. The accuracy of that search depends entirely on the quality of the embedding model you choose.

How Embeddings Are Generated

When a memory entry is saved — either automatically by the orchestrator or explicitly via the save_memory tool — ScalyClaw calls the configured embedding model to convert the text into a float32 vector. That vector is stored in the SQLite database alongside the entry. At retrieval time, the query text is embedded on the fly using the same model, and sqlite-vec finds the nearest stored vectors by cosine distance.

Do not change embedding models mid-deployment

All stored vectors must come from the same model. Switching to a different embedding model produces incompatible vectors — semantic search will return nonsense results. If you need to switch models, re-embed all existing memories first using Settings → Memory → Re-embed all in the dashboard.

Recommended Models

Model	Provider	Dimensions	Recommended for
`text-embedding-3-small`	OpenAI	1536	Best cost-to-quality ratio for most deployments. Good multilingual support. Default recommendation.
`text-embedding-3-large`	OpenAI	3072	Higher accuracy for large, diverse memory stores. Higher cost and storage per entry.
`text-embedding-ada-002`	OpenAI	1536	Legacy model. Use `text-embedding-3-small` instead for new deployments.
`nomic-embed-text`	Local / Ollama	768	Fully local, no API cost. Good quality for English-primary content. Pull with `ollama pull nomic-embed-text`.
`mxbai-embed-large`	Local / Ollama	1024	Higher-quality local embedding. Slightly larger and slower than nomic-embed-text but better recall.

Embedding Configuration

Embedding models live in config.models.embeddingModels, the same config section as chat models but in a separate array. You can use a different provider for embeddings than for chat — for example, use Anthropic for chat but OpenAI's cheaper embedding API for memory. Each entry shares the same providers map as chat models.

json

{
  "models": {
    "providers": {
      "openai": { "apiKey": "sk-..." }
    },
    "models": [ /* ... chat models ... */ ],
    "embeddingModels": [
      {
        "id": "openai-embed",
        "name": "text-embedding-3-small",
        "provider": "openai",
        "enabled": true,
        "priority": 1,
        "weight": 100,
        "dimensions": 1536,
        "inputPricePerMillion": 0.02,
        "outputPricePerMillion": 0
      }
    ]
  }
}

For a fully local setup with Ollama, add the local provider to the providers map and point the embedding model at it:

json

{
  "models": {
    "providers": {
      "local": { "baseUrl": "http://localhost:11434/v1" }
    },
    "models": [ /* ... chat models ... */ ],
    "embeddingModels": [
      {
        "id": "local-embed",
        "name": "nomic-embed-text",
        "provider": "local",
        "enabled": true,
        "priority": 1,
        "weight": 100,
        "dimensions": 768,
        "inputPricePerMillion": 0,
        "outputPricePerMillion": 0
      }
    ]
  }
}

Tip

The dimensions field must exactly match what the model actually produces. If you set it wrong, sqlite-vec will reject the insert. Check your model's documentation for the exact output dimension before setting this value.

Budget Control

LLM API calls cost money. ScalyClaw tracks token usage per model and per day, accumulates spending estimates based on the inputPricePerMillion and outputPricePerMillion values you configure on each model, and enforces configurable global daily and monthly limits. Budget is a single global config block — there are no per-model budget caps. You can choose between hard enforcement (block all calls when the limit is reached) or soft enforcement (warn but continue).

Enforcement Modes

Mode	Behavior when limit is reached	Use when
Hard stop	All LLM calls are blocked immediately. The system returns an error message to the channel explaining the budget limit has been reached. No calls go out until the limit resets (midnight UTC for daily, first of month for monthly).	Production deployments with strict cost controls, shared installations, or when you want to guarantee a monthly maximum spend.
Soft warn	LLM calls continue normally. A warning is emitted to the dashboard logs and, optionally, to a configured alert channel. The system does not stop; it only signals that the threshold has been crossed.	Personal deployments where uninterrupted service matters more than strict spend enforcement, or when you want visibility without interruption.

Per-Model Cost Tracking

Every LLM call records the number of input tokens, output tokens, and the estimated cost in USD using the pricing table ScalyClaw maintains for each known model. Costs are stored in Redis and aggregated by day and by month. The dashboard usage page displays:

Daily and monthly spend broken down by model
Token usage histograms per model per day
Budget consumption as a percentage of configured limits
A list of the most expensive individual requests

For models with custom or unknown pricing (e.g., local models or new provider releases), set inputPricePerMillion and outputPricePerMillion directly on the model entry in config.models.models. ScalyClaw uses those figures for all cost tracking and budget accounting for that model. Set both to 0 for free local models.

Budget Configuration

json

{
  "budget": {
    "monthlyLimit": 150,
    "dailyLimit": 10,
    "hardLimit": true,
    "alertThresholds": [50, 80, 90]
  }
}

Budget is a single global block — there are no per-model caps. The fields are:

monthlyLimit — maximum USD spend per calendar month. Set to 0 for unlimited.
dailyLimit — maximum USD spend per day (resets at midnight UTC). Set to 0 for unlimited.
hardLimit — when true, all LLM calls are blocked once a limit is reached. When false, the system continues but emits warnings.
alertThresholds — array of percentage values (e.g. [50, 80, 90]). A warning is emitted to dashboard logs and any configured alert channels each time cumulative spend crosses one of these thresholds — giving you advance notice before a hard stop occurs.

Custom Pricing

Set inputPricePerMillion and outputPricePerMillion directly on any model entry. ScalyClaw uses those values for all cost tracking for that model:

json

{
  "id": "my-azure-deployment",
  "name": "my-gpt4o-deployment",
  "provider": "azure",
  "enabled": true,
  "priority": 1,
  "weight": 100,
  "temperature": 0.7,
  "maxTokens": 4096,
  "contextWindow": 128000,
  "toolEnabled": true,
  "imageEnabled": true,
  "audioEnabled": false,
  "videoEnabled": false,
  "documentEnabled": false,
  "reasoningEnabled": false,
  "inputPricePerMillion": 2.50,
  "outputPricePerMillion": 10.00
}

Tip

Set monthlyLimit and dailyLimit conservatively with hardLimit: true in production. The alertThresholds array lets you get warnings at e.g. 50%, 80%, and 90% of the limit so you can react before the system blocks calls entirely.