Models & Providers
ScalyClaw routes LLM calls through a configurable provider stack. You can register multiple models from different providers, assign priorities and weights for load balancing, and define fallback chains so that a single provider outage never stops the system. All model configuration lives in Redis — no config files on disk — and changes take effect immediately without a restart.
Model Configuration
Models are added and managed in the dashboard under Settings → Models. Each entry in the model list describes a single provider/model combination. You can have as many entries as you need; the orchestrator selects the right one at call time based on priority, weight, and the capabilities required by that particular request.
Model Properties
| Property | Type | Description |
|---|---|---|
id |
string | Unique identifier for this model entry within ScalyClaw. Used internally and in logs — choose something readable like gpt4o-primary or claude-opus-main. |
name |
string | The model name as the provider expects it — e.g. gpt-4o, claude-sonnet-4-20250514, gemini-1.5-pro. For local/Ollama, this is the tag pulled in Ollama. |
provider |
string | Key into the providers map — e.g. openai, anthropic, google. Determines which configured provider credentials and base URL are used. |
enabled |
boolean | Toggle this model on or off without removing it from the config. Disabled models are never selected by the orchestrator. |
priority |
integer | Lower number = higher priority. The orchestrator always tries the lowest-numbered priority group first. If every model in that group fails, it falls back to the next group. Models with priority: 1 are tried before priority: 2. |
weight |
integer | Relative weight (0–100) for load balancing among models sharing the same priority. A model with weight: 3 receives three times as many requests as one with weight: 1. Useful for spreading load across multiple keys for the same provider. |
temperature |
number | Sampling temperature for this model. Range 0.0–2.0 depending on provider; 0.7 is a reasonable default for conversational use. |
maxTokens |
integer | Maximum number of tokens the model may generate in a single response. |
contextWindow |
integer | Total context window size for this model in tokens, including both input and output. Used to guard against prompts that would exceed the model's limit. |
toolEnabled |
boolean | Whether this model supports tool/function calling. The orchestrator only selects this model for tool-enabled requests when this is true. |
imageEnabled |
boolean | Whether this model can process image inputs. |
audioEnabled |
boolean | Whether this model can process audio inputs. |
videoEnabled |
boolean | Whether this model can process video inputs. |
documentEnabled |
boolean | Whether this model can process document inputs (e.g. PDFs). |
reasoningEnabled |
boolean | Whether this model supports extended thinking / reasoning mode. |
inputPricePerMillion |
number | Cost in USD per one million input tokens. Used for budget tracking and spend estimates. |
outputPricePerMillion |
number | Cost in USD per one million output tokens. Used for budget tracking and spend estimates. |
Fallback Chain
When a model call fails — due to a network error, rate-limit, timeout, or provider outage — ScalyClaw automatically tries the next available model. The fallback order is determined entirely by the priority field:
- All models with the lowest priority value are candidates for the first attempt. If there are multiple, they share load according to their
weight. - If every model in that priority group fails (or none has the required capabilities), the orchestrator moves to the next priority group.
- This continues until either a call succeeds or all models are exhausted, in which case an error is returned to the channel.
Weights within the same priority level are used for weighted-random selection, not strict round-robin. Each candidate's probability of being chosen equals its weight divided by the sum of all weights in that group. A weight of 0 disables a model without removing it from the config.
Supported Providers
Providers are registered under config.models.providers as a map from a provider key to an object with an optional apiKey and optional baseUrl. Each model entry's provider field references one of these keys.
| Provider | provider key | Notes |
|---|---|---|
| OpenAI | openai |
GPT-4o, GPT-4o-mini, o1, o3, and future models. Set apiKey to your OpenAI API key. |
| Anthropic | anthropic |
Claude Opus, Sonnet, and Haiku family. Extended thinking supported where available. Set apiKey to your Anthropic API key. |
| Google AI | google |
Gemini 1.5 Pro, Flash, and 2.0 series. Uses the Gemini API (not Vertex AI). Set apiKey to your Google AI API key. |
| Azure OpenAI | azure |
Set apiKey to your Azure API key and baseUrl to your Azure endpoint. Use the deployment name as the model's name field. |
| Local / Ollama | local |
Any OpenAI-compatible local server (Ollama, LM Studio, llama.cpp). Set baseUrl to the local address (e.g. http://localhost:11434/v1). apiKey can be omitted or left empty. |
Example Configuration
The following shows a realistic multi-provider setup: Anthropic as the primary model, OpenAI as a same-priority peer with lower weight, and a local Ollama model as a last-resort fallback for basic requests.
{
"models": {
"providers": {
"anthropic": { "apiKey": "sk-ant-..." },
"openai": { "apiKey": "sk-..." },
"local": { "baseUrl": "http://localhost:11434/v1" }
},
"models": [
{
"id": "claude-primary",
"name": "claude-opus-4-6",
"provider": "anthropic",
"enabled": true,
"priority": 1,
"weight": 75,
"temperature": 0.7,
"maxTokens": 8192,
"contextWindow": 200000,
"toolEnabled": true,
"imageEnabled": true,
"audioEnabled": false,
"videoEnabled": false,
"documentEnabled": true,
"reasoningEnabled": false,
"inputPricePerMillion": 15.00,
"outputPricePerMillion": 75.00
},
{
"id": "gpt4o-secondary",
"name": "gpt-4o",
"provider": "openai",
"enabled": true,
"priority": 1,
"weight": 25,
"temperature": 0.7,
"maxTokens": 4096,
"contextWindow": 128000,
"toolEnabled": true,
"imageEnabled": true,
"audioEnabled": false,
"videoEnabled": false,
"documentEnabled": false,
"reasoningEnabled": false,
"inputPricePerMillion": 2.50,
"outputPricePerMillion": 10.00
},
{
"id": "local-fallback",
"name": "llama3.2",
"provider": "local",
"enabled": true,
"priority": 2,
"weight": 100,
"temperature": 0.7,
"maxTokens": 2048,
"contextWindow": 8192,
"toolEnabled": false,
"imageEnabled": false,
"audioEnabled": false,
"videoEnabled": false,
"documentEnabled": false,
"reasoningEnabled": false,
"inputPricePerMillion": 0,
"outputPricePerMillion": 0
}
],
"embeddingModels": []
}
}
With this config, 75% of priority-1 requests go to Claude (weight 75 out of 100) and 25% go to GPT-4o (weight 25 out of 100). If both fail, the local Llama 3.2 instance handles the request — but only for requests that do not require tools or multimodal inputs, since those capability flags are false for the fallback.
If a request requires tool use (because the LLM is expected to call tools) and no model in any priority group has toolEnabled: true, the call will fail immediately rather than fall through to an incapable model and produce a broken response. Always ensure at least one enabled model has the required capability flags set for each feature you rely on.
Embedding Models
ScalyClaw's memory system stores every saved memory entry alongside a high-dimensional vector embedding. When the orchestrator retrieves relevant context before an LLM call, it runs a cosine-similarity search using sqlite-vec against those stored vectors. The accuracy of that search depends entirely on the quality of the embedding model you choose.
How Embeddings Are Generated
When a memory entry is saved — either automatically by the orchestrator or explicitly via the save_memory tool — ScalyClaw calls the configured embedding model to convert the text into a float32 vector. That vector is stored in the SQLite database alongside the entry. At retrieval time, the query text is embedded on the fly using the same model, and sqlite-vec finds the nearest stored vectors by cosine distance.
All stored vectors must come from the same model. Switching to a different embedding model produces incompatible vectors — semantic search will return nonsense results. If you need to switch models, re-embed all existing memories first using Settings → Memory → Re-embed all in the dashboard.
Recommended Models
| Model | Provider | Dimensions | Recommended for |
|---|---|---|---|
text-embedding-3-small |
OpenAI | 1536 | Best cost-to-quality ratio for most deployments. Good multilingual support. Default recommendation. |
text-embedding-3-large |
OpenAI | 3072 | Higher accuracy for large, diverse memory stores. Higher cost and storage per entry. |
text-embedding-ada-002 |
OpenAI | 1536 | Legacy model. Use text-embedding-3-small instead for new deployments. |
nomic-embed-text |
Local / Ollama | 768 | Fully local, no API cost. Good quality for English-primary content. Pull with ollama pull nomic-embed-text. |
mxbai-embed-large |
Local / Ollama | 1024 | Higher-quality local embedding. Slightly larger and slower than nomic-embed-text but better recall. |
Embedding Configuration
Embedding models live in config.models.embeddingModels, the same config section as chat models but in a separate array. You can use a different provider for embeddings than for chat — for example, use Anthropic for chat but OpenAI's cheaper embedding API for memory. Each entry shares the same providers map as chat models.
{
"models": {
"providers": {
"openai": { "apiKey": "sk-..." }
},
"models": [ /* ... chat models ... */ ],
"embeddingModels": [
{
"id": "openai-embed",
"name": "text-embedding-3-small",
"provider": "openai",
"enabled": true,
"priority": 1,
"weight": 100,
"dimensions": 1536,
"inputPricePerMillion": 0.02,
"outputPricePerMillion": 0
}
]
}
}
For a fully local setup with Ollama, add the local provider to the providers map and point the embedding model at it:
{
"models": {
"providers": {
"local": { "baseUrl": "http://localhost:11434/v1" }
},
"models": [ /* ... chat models ... */ ],
"embeddingModels": [
{
"id": "local-embed",
"name": "nomic-embed-text",
"provider": "local",
"enabled": true,
"priority": 1,
"weight": 100,
"dimensions": 768,
"inputPricePerMillion": 0,
"outputPricePerMillion": 0
}
]
}
}
The dimensions field must exactly match what the model actually produces. If you set it wrong, sqlite-vec will reject the insert. Check your model's documentation for the exact output dimension before setting this value.
Budget Control
LLM API calls cost money. ScalyClaw tracks token usage per model and per day, accumulates spending estimates based on the inputPricePerMillion and outputPricePerMillion values you configure on each model, and enforces configurable global daily and monthly limits. Budget is a single global config block — there are no per-model budget caps. You can choose between hard enforcement (block all calls when the limit is reached) or soft enforcement (warn but continue).
Enforcement Modes
| Mode | Behavior when limit is reached | Use when |
|---|---|---|
| Hard stop | All LLM calls are blocked immediately. The system returns an error message to the channel explaining the budget limit has been reached. No calls go out until the limit resets (midnight UTC for daily, first of month for monthly). | Production deployments with strict cost controls, shared installations, or when you want to guarantee a monthly maximum spend. |
| Soft warn | LLM calls continue normally. A warning is emitted to the dashboard logs and, optionally, to a configured alert channel. The system does not stop; it only signals that the threshold has been crossed. | Personal deployments where uninterrupted service matters more than strict spend enforcement, or when you want visibility without interruption. |
Per-Model Cost Tracking
Every LLM call records the number of input tokens, output tokens, and the estimated cost in USD using the pricing table ScalyClaw maintains for each known model. Costs are stored in Redis and aggregated by day and by month. The dashboard usage page displays:
- Daily and monthly spend broken down by model
- Token usage histograms per model per day
- Budget consumption as a percentage of configured limits
- A list of the most expensive individual requests
For models with custom or unknown pricing (e.g., local models or new provider releases), set inputPricePerMillion and outputPricePerMillion directly on the model entry in config.models.models. ScalyClaw uses those figures for all cost tracking and budget accounting for that model. Set both to 0 for free local models.
Budget Configuration
{
"budget": {
"monthlyLimit": 150,
"dailyLimit": 10,
"hardLimit": true,
"alertThresholds": [50, 80, 90]
}
}
Budget is a single global block — there are no per-model caps. The fields are:
monthlyLimit— maximum USD spend per calendar month. Set to0for unlimited.dailyLimit— maximum USD spend per day (resets at midnight UTC). Set to0for unlimited.hardLimit— whentrue, all LLM calls are blocked once a limit is reached. Whenfalse, the system continues but emits warnings.alertThresholds— array of percentage values (e.g.[50, 80, 90]). A warning is emitted to dashboard logs and any configured alert channels each time cumulative spend crosses one of these thresholds — giving you advance notice before a hard stop occurs.
Custom Pricing
Set inputPricePerMillion and outputPricePerMillion directly on any model entry. ScalyClaw uses those values for all cost tracking for that model:
{
"id": "my-azure-deployment",
"name": "my-gpt4o-deployment",
"provider": "azure",
"enabled": true,
"priority": 1,
"weight": 100,
"temperature": 0.7,
"maxTokens": 4096,
"contextWindow": 128000,
"toolEnabled": true,
"imageEnabled": true,
"audioEnabled": false,
"videoEnabled": false,
"documentEnabled": false,
"reasoningEnabled": false,
"inputPricePerMillion": 2.50,
"outputPricePerMillion": 10.00
}
Set monthlyLimit and dailyLimit conservatively with hardLimit: true in production. The alertThresholds array lets you get warnings at e.g. 50%, 80%, and 90% of the limit so you can react before the system blocks calls entirely.