Troubleshooting¶
Common issues and solutions when running Loom.
Setup & Configuration¶
loom setup can't detect Ollama¶
Symptom: Setup wizard reports "Ollama not detected" even though Ollama is running.
Fix:
- Check Ollama is running:
curl http://localhost:11434/api/tags - If Ollama is on a different port or host, enter the URL when prompted
- Check firewall isn't blocking port 11434
- If using Docker Ollama:
docker run -p 11434:11434 ollama/ollama
loom setup Anthropic key validation fails¶
Symptom: Setup reports "Key validation failed" after entering an API key.
Fix:
- Double-check the key starts with
sk-ant- - Verify network connectivity to
api.anthropic.com - The key is saved anyway — validation is best-effort
- Test manually:
curl -H "x-api-key: sk-ant-..." https://api.anthropic.com/v1/models
Config file not picked up¶
Symptom: Settings from ~/.loom/config.yaml don't take effect.
Fix:
- Check the file exists:
cat ~/.loom/config.yaml - Env vars override config file values — check for conflicting
OLLAMA_URL,ANTHROPIC_API_KEY - Priority: CLI flags > env vars > config.yaml > defaults
- See Configuration for the full priority chain
RAG Pipeline¶
loom rag ingest fails with "No valid exports found"¶
Symptom: Ingest exits immediately without processing any files.
Fix:
- Verify files exist:
ls /path/to/exports/result*.json - Telegram exports must be JSON format (not HTML)
- Use Telegram Desktop → Export Chat → JSON format
- File paths are passed as arguments:
loom rag ingest file1.json file2.json
Embedding fails during loom rag ingest¶
Symptom: Ingest hangs or errors at the "Storing chunks" step.
Fix:
- Check Ollama is running:
curl http://localhost:11434/api/tags - Check embedding model is installed:
ollama list | grep nomic-embed-text - Pull the model:
ollama pull nomic-embed-text - Use
--no-embedto skip embeddings:loom rag ingest --no-embed files... - Check Ollama URL in config:
cat ~/.loom/config.yaml | grep ollama_url
loom rag search returns no results¶
Symptom: Search returns "No results found" even after ingesting data.
Fix:
- Check the store has data:
loom rag stats - If you ingested with
--no-embed, search won't work (embeddings required) - Re-ingest with embeddings:
loom rag ingest --embed files... - Lower the score threshold:
loom rag search "query" --min-score 0.0 - Check you're using the same store path:
loom rag --db-path /path/to/store.duckdb search "query"
LanceDB import errors¶
Symptom: ImportError: No module named 'lancedb' when using --store lancedb.
Fix:
NATS Connection¶
Cannot connect to NATS¶
Symptom: Actor exits immediately with bus.connected never appearing in logs, or error Could not connect to server.
Fix:
# Check if NATS is running
nats-server --version # Should print version
curl -s http://localhost:8222/varz | head -5 # NATS monitoring endpoint
# Start NATS via Docker (quickest)
docker run -d --name nats -p 4222:4222 -p 8222:8222 nats:latest
# Or via Homebrew (macOS)
brew install nats-server
nats-server &
# Or via Docker Compose (full stack)
docker compose up -d
NATS connection drops intermittently¶
Symptom: Log shows bus.disconnected followed by bus.reconnected (or actor crash after 60s of retries).
Fix:
- Check NATS server resource usage (
nats-servermemory, disk, connections) - Increase NATS max payload if sending large messages:
nats-server --max_payload 4MB - If behind a load balancer, ensure idle timeout exceeds NATS ping interval (default 2 min)
- Check network stability between client and NATS server
Messages silently dropped¶
Symptom: Tasks published but no worker picks them up. No errors in logs.
Cause: NATS uses at-most-once delivery. If no subscriber is listening when a message is published, it is silently dropped.
Fix:
- Ensure workers are running before publishing tasks
- Start actors in the right order: workers → router → orchestrator/pipeline
- Check that
worker_typein the task matches the worker's subscription (case-sensitive) - Check
loom.tasks.dead_letterfor unroutable tasks:loom dead-letter monitor
Workers¶
Worker produces empty or invalid output¶
Symptom: Worker completes but output doesn't match output_schema. Downstream stages fail with validation errors.
Fix:
- Check the worker's system prompt — it must instruct the LLM to output valid JSON matching the schema
- Use the Workshop test bench to test the worker in isolation:
loom workshop --port 8080 - Enable trace logging to see full I/O:
LOOM_TRACE=1 loom worker --config ... - Verify the LLM backend is responding correctly (try a direct API call)
ANTHROPIC_API_KEY not set¶
Symptom: Workers using standard or frontier tier fail with authentication errors.
Fix:
export ANTHROPIC_API_KEY=sk-ant-...
# Or add to shell profile:
echo 'export ANTHROPIC_API_KEY=sk-ant-...' >> ~/.zshrc
OLLAMA_URL not set / Ollama not running¶
Symptom: Workers using local tier fail to connect.
Fix:
# Install and start Ollama
brew install ollama # macOS
ollama serve &
# Set URL (default is http://localhost:11434)
export OLLAMA_URL=http://localhost:11434
# Pull a model
ollama pull llama3.2
Worker hangs or times out¶
Symptom: Worker never completes. Pipeline shows PipelineTimeoutError.
Fix:
- Check LLM backend is responsive (try a direct API call)
- Increase
timeout_secondsin the stage config if the task is legitimately slow - For Ollama, check if the model is still loading (
ollama ps) - Check if the worker is stuck in a tool-use loop (max 10 rounds by default)
Pipelines¶
PipelineMappingError: key not found in context¶
Symptom: Stage 'X' mapping error: Path 'Y.output.Z': key 'Z' not found in context
Cause: A stage's input_mapping references a field that the previous stage didn't produce.
Fix:
- Check the upstream stage's
output_schema— does it include the field? - Test the upstream worker in Workshop to see its actual output
- If the field is optional, add a
conditionto skip the downstream stage when it's missing
PipelineValidationError: input/output validation failed¶
Symptom: Stage fails before or after execution with schema validation errors.
Fix:
- Check
input_schema/output_schemain the stage config - Use Workshop test bench to verify the worker's actual output format
- Common issue: schema says
"type": "integer"but worker outputs a string number
Circular dependency detected¶
Symptom: Pipeline fails to start with ValueError: Circular dependency detected among stages.
Fix:
- Check
input_mappingpaths — stage A referencing stage B and B referencing A creates a cycle - Use
depends_onto override automatic dependency inference if needed - Visualize the dependency graph in Workshop's pipeline editor
Router¶
Tasks going to dead letter¶
Symptom: Tasks appear in loom.tasks.dead_letter instead of reaching workers.
Cause: Router can't find a matching route for the worker_type + model_tier combination.
Fix:
- Check
configs/router_rules.yamlfor tier overrides - Verify the
worker_typein the task matches a running worker's configname - Check rate limits — rate-limited tasks may be dead-lettered
- Monitor dead letters:
loom dead-letter monitor --nats-url nats://localhost:4222
Workshop¶
Workshop won't start¶
Symptom: loom workshop fails with import errors.
Fix:
App deployment fails¶
Symptom: ZIP upload returns error during app deployment.
Fix:
- Verify ZIP contains
manifest.yamlat the root (not in a subdirectory) - Check manifest fields:
name,version,descriptionare required - Ensure all config files referenced in
entry_configsexist in the ZIP - ZIP must not contain symlinks or paths with
.. - Build the ZIP using the app's
scripts/build-app.shfor correct structure
Docker / Kubernetes¶
Container can't reach NATS¶
Symptom: Containers fail to connect to nats://nats:4222.
Fix:
- In Docker Compose: services use the service name as hostname (
nats) - Standalone Docker: use
--network hostor link containers - In Kubernetes: verify the NATS service is in the same namespace
- Check:
docker exec <container> nslookup nats
Workshop not accessible from host¶
Symptom: Workshop runs but browser can't reach it.
Fix:
- Bind to
0.0.0.0not127.0.0.1:loom workshop --host 0.0.0.0 --port 8080 - Docker: expose the port:
-p 8080:8080 - Kubernetes: use NodePort (30080) or port-forward:
kubectl port-forward svc/loom-workshop 8080:8080
macOS Service (launchd)¶
Services not starting after install¶
Fix:
# Check service status
launchctl list | grep loom
# Check logs
cat ~/Library/Logs/loom/workshop.err
cat ~/Library/Logs/loom/router.err
# Reload services
launchctl unload ~/Library/LaunchAgents/com.loom.workshop.plist
launchctl load ~/Library/LaunchAgents/com.loom.workshop.plist
Permission denied¶
Fix:
- launchd user agents don't need sudo — run as your user
- If
loombinary is in a restricted path, move it or adjust the plist
Windows Service (NSSM)¶
Services not starting¶
Fix:
# Check service status
nssm status LoomWorkshop
nssm status LoomRouter
# Check logs
Get-Content "$env:LOCALAPPDATA\loom\logs\workshop.err"
# Restart
nssm restart LoomWorkshop
NSSM not found¶
Fix:
Performance¶
Pipeline is slow¶
Fix:
- Design stages with independent dependencies so they run in parallel
- Scale workers horizontally via NATS queue groups (run multiple instances)
- Set
max_concurrent_goalsin pipeline config for concurrent goal processing - Check token usage logs (
worker.llm_usage) for expensive stages
High memory usage¶
Fix:
- Workers are stateless and
reset()between tasks — check for leaked references - DuckDB stores can grow large — monitor disk usage
- Dead-letter consumer has a bounded store (default 1000 entries) — adjust
max_sizeif needed - Valkey checkpoint store: check TTL settings for expired entries
Workshop Evaluation¶
LLM judge gives inconsistent scores¶
Symptom: Same test case produces different scores across eval runs.
Fix:
- Set
temperature=0.0for the judge backend (this is the default) - Use a more capable model for judging (the Workshop prefers the
standardtier) - Provide a more specific
judge_prompttailored to your domain - Check
score_details.reasoningin eval results to understand scoring rationale
Baseline comparison shows no data¶
Symptom: Eval detail page shows no baseline comparison.
Fix:
- Promote a successful eval run as baseline first: click "Promote as Baseline" on the eval detail page
- Each worker has at most one baseline; promoting a new one replaces the previous
- Baseline comparison is only shown when viewing a run that is not the baseline itself
Dead-letter replay keeps failing¶
Symptom: Replayed tasks end up back in the dead-letter queue.
Fix:
- Check the original reason in the replay log — if "unroutable", ensure a worker for that
worker_type+tieris running - If "rate_limited", wait for the rate limiter bucket to refill before replaying
- If "malformed", the task data itself is invalid and needs to be fixed at the source
TUI Dashboard¶
TUI won't start¶
Symptom: loom ui fails with import errors.
Fix:
TUI shows "NATS connection failed"¶
Symptom: Dashboard starts but shows a red "disconnected" status and an error in the Events log.
Fix:
- Check that NATS is running:
nats-server --versionordocker ps | grep nats - Verify the URL:
loom ui --nats-url nats://localhost:4222(default) - Check for firewall rules blocking port 4222
- The TUI needs NATS running — it subscribes to
loom.>to observe traffic
TUI shows no events¶
Symptom: Dashboard is connected (green status) but no goals, tasks, or events appear.
Fix:
- The TUI is a passive observer — it only shows traffic that occurs while it's running
- Submit a goal or run a pipeline to generate traffic
- Check that actors (router, workers, orchestrator/pipeline) are running
- The TUI subscribes to
loom.>which catches all Loom NATS subjects
Pipeline stages not appearing¶
Symptom: Goals and tasks appear but the Pipeline tab is empty.
Fix:
- Pipeline stage data comes from
_timelinein result output — only pipeline orchestrators produce this - Dynamic orchestrators (OrchestratorActor) don't produce timeline data; use pipeline orchestrators for stage visibility
- Check that the pipeline is producing results (look in the Events tab for
loom.results.*messages)
Distributed Tracing (OpenTelemetry)¶
Tracing not producing spans¶
Symptom: No spans appear in your tracing backend (Jaeger, Zipkin, Tempo).
Fix:
# Install OTel dependencies
uv sync --extra otel
# Verify installation
python -c "from opentelemetry import trace; print('OTel available')"
Then initialize tracing at startup:
Or set the standard OTel environment variables:
Spans not linking across actors¶
Symptom: You see separate root spans for each actor instead of a connected trace.
Fix:
- Loom propagates trace context via a
_trace_contextkey in NATS messages (W3C traceparent format) - Both the sender and receiver must have OTel installed for propagation to work
- Check that
inject_trace_context()andextract_trace_context()are being called (they are inBaseActor._process_one()and all publish methods) - If using custom actors, ensure you call
inject_trace_context(data)before publishing andextract_trace_context(data)when receiving
OTel not installed but code imports it¶
Symptom: Worried about import errors when OTel is not installed.
Fix:
- This is handled automatically. The
loom.tracingmodule uses runtime feature detection — if OTel SDK is not installed, all functions become no-ops. No code changes needed. See Design Invariant #9.
LLM spans missing prompt/completion text¶
Symptom: LLM call spans appear in your tracing backend but contain no prompt or completion content.
Fix:
Set the LOOM_TRACE_CONTENT environment variable to enable prompt and completion recording:
With this enabled, LLM call spans include two span events:
gen_ai.content.prompt— the full prompt sent to the modelgen_ai.content.completion— the model's response text
This is disabled by default to avoid storing sensitive data in your tracing backend.
Note: Even without LOOM_TRACE_CONTENT, all LLM call spans include these gen_ai.* attributes per the OpenTelemetry GenAI semantic conventions:
| Attribute | Description |
|---|---|
gen_ai.system |
LLM provider (e.g., anthropic, ollama) |
gen_ai.request.model |
Model requested (e.g., claude-sonnet-4-20250514) |
gen_ai.response.model |
Model that served the request |
gen_ai.usage.input_tokens |
Prompt token count |
gen_ai.usage.output_tokens |
Completion token count |
gen_ai.request.temperature |
Sampling temperature |
gen_ai.request.max_tokens |
Max output tokens requested |