Skip to content

Troubleshooting

Common issues and solutions when running Loom.


Setup & Configuration

loom setup can't detect Ollama

Symptom: Setup wizard reports "Ollama not detected" even though Ollama is running.

Fix:

  • Check Ollama is running: curl http://localhost:11434/api/tags
  • If Ollama is on a different port or host, enter the URL when prompted
  • Check firewall isn't blocking port 11434
  • If using Docker Ollama: docker run -p 11434:11434 ollama/ollama

loom setup Anthropic key validation fails

Symptom: Setup reports "Key validation failed" after entering an API key.

Fix:

  • Double-check the key starts with sk-ant-
  • Verify network connectivity to api.anthropic.com
  • The key is saved anyway — validation is best-effort
  • Test manually: curl -H "x-api-key: sk-ant-..." https://api.anthropic.com/v1/models

Config file not picked up

Symptom: Settings from ~/.loom/config.yaml don't take effect.

Fix:

  • Check the file exists: cat ~/.loom/config.yaml
  • Env vars override config file values — check for conflicting OLLAMA_URL, ANTHROPIC_API_KEY
  • Priority: CLI flags > env vars > config.yaml > defaults
  • See Configuration for the full priority chain

RAG Pipeline

loom rag ingest fails with "No valid exports found"

Symptom: Ingest exits immediately without processing any files.

Fix:

  • Verify files exist: ls /path/to/exports/result*.json
  • Telegram exports must be JSON format (not HTML)
  • Use Telegram Desktop → Export Chat → JSON format
  • File paths are passed as arguments: loom rag ingest file1.json file2.json

Embedding fails during loom rag ingest

Symptom: Ingest hangs or errors at the "Storing chunks" step.

Fix:

  • Check Ollama is running: curl http://localhost:11434/api/tags
  • Check embedding model is installed: ollama list | grep nomic-embed-text
  • Pull the model: ollama pull nomic-embed-text
  • Use --no-embed to skip embeddings: loom rag ingest --no-embed files...
  • Check Ollama URL in config: cat ~/.loom/config.yaml | grep ollama_url

loom rag search returns no results

Symptom: Search returns "No results found" even after ingesting data.

Fix:

  • Check the store has data: loom rag stats
  • If you ingested with --no-embed, search won't work (embeddings required)
  • Re-ingest with embeddings: loom rag ingest --embed files...
  • Lower the score threshold: loom rag search "query" --min-score 0.0
  • Check you're using the same store path: loom rag --db-path /path/to/store.duckdb search "query"

LanceDB import errors

Symptom: ImportError: No module named 'lancedb' when using --store lancedb.

Fix:

uv sync --extra lancedb

NATS Connection

Cannot connect to NATS

Symptom: Actor exits immediately with bus.connected never appearing in logs, or error Could not connect to server.

Fix:

# Check if NATS is running
nats-server --version  # Should print version
curl -s http://localhost:8222/varz | head -5  # NATS monitoring endpoint

# Start NATS via Docker (quickest)
docker run -d --name nats -p 4222:4222 -p 8222:8222 nats:latest

# Or via Homebrew (macOS)
brew install nats-server
nats-server &

# Or via Docker Compose (full stack)
docker compose up -d

NATS connection drops intermittently

Symptom: Log shows bus.disconnected followed by bus.reconnected (or actor crash after 60s of retries).

Fix:

  • Check NATS server resource usage (nats-server memory, disk, connections)
  • Increase NATS max payload if sending large messages: nats-server --max_payload 4MB
  • If behind a load balancer, ensure idle timeout exceeds NATS ping interval (default 2 min)
  • Check network stability between client and NATS server

Messages silently dropped

Symptom: Tasks published but no worker picks them up. No errors in logs.

Cause: NATS uses at-most-once delivery. If no subscriber is listening when a message is published, it is silently dropped.

Fix:

  • Ensure workers are running before publishing tasks
  • Start actors in the right order: workers → router → orchestrator/pipeline
  • Check that worker_type in the task matches the worker's subscription (case-sensitive)
  • Check loom.tasks.dead_letter for unroutable tasks: loom dead-letter monitor

Workers

Worker produces empty or invalid output

Symptom: Worker completes but output doesn't match output_schema. Downstream stages fail with validation errors.

Fix:

  • Check the worker's system prompt — it must instruct the LLM to output valid JSON matching the schema
  • Use the Workshop test bench to test the worker in isolation: loom workshop --port 8080
  • Enable trace logging to see full I/O: LOOM_TRACE=1 loom worker --config ...
  • Verify the LLM backend is responding correctly (try a direct API call)

ANTHROPIC_API_KEY not set

Symptom: Workers using standard or frontier tier fail with authentication errors.

Fix:

export ANTHROPIC_API_KEY=sk-ant-...
# Or add to shell profile:
echo 'export ANTHROPIC_API_KEY=sk-ant-...' >> ~/.zshrc

OLLAMA_URL not set / Ollama not running

Symptom: Workers using local tier fail to connect.

Fix:

# Install and start Ollama
brew install ollama  # macOS
ollama serve &

# Set URL (default is http://localhost:11434)
export OLLAMA_URL=http://localhost:11434

# Pull a model
ollama pull llama3.2

Worker hangs or times out

Symptom: Worker never completes. Pipeline shows PipelineTimeoutError.

Fix:

  • Check LLM backend is responsive (try a direct API call)
  • Increase timeout_seconds in the stage config if the task is legitimately slow
  • For Ollama, check if the model is still loading (ollama ps)
  • Check if the worker is stuck in a tool-use loop (max 10 rounds by default)

Pipelines

PipelineMappingError: key not found in context

Symptom: Stage 'X' mapping error: Path 'Y.output.Z': key 'Z' not found in context

Cause: A stage's input_mapping references a field that the previous stage didn't produce.

Fix:

  • Check the upstream stage's output_schema — does it include the field?
  • Test the upstream worker in Workshop to see its actual output
  • If the field is optional, add a condition to skip the downstream stage when it's missing

PipelineValidationError: input/output validation failed

Symptom: Stage fails before or after execution with schema validation errors.

Fix:

  • Check input_schema / output_schema in the stage config
  • Use Workshop test bench to verify the worker's actual output format
  • Common issue: schema says "type": "integer" but worker outputs a string number

Circular dependency detected

Symptom: Pipeline fails to start with ValueError: Circular dependency detected among stages.

Fix:

  • Check input_mapping paths — stage A referencing stage B and B referencing A creates a cycle
  • Use depends_on to override automatic dependency inference if needed
  • Visualize the dependency graph in Workshop's pipeline editor

Router

Tasks going to dead letter

Symptom: Tasks appear in loom.tasks.dead_letter instead of reaching workers.

Cause: Router can't find a matching route for the worker_type + model_tier combination.

Fix:

  • Check configs/router_rules.yaml for tier overrides
  • Verify the worker_type in the task matches a running worker's config name
  • Check rate limits — rate-limited tasks may be dead-lettered
  • Monitor dead letters: loom dead-letter monitor --nats-url nats://localhost:4222

Workshop

Workshop won't start

Symptom: loom workshop fails with import errors.

Fix:

# Install workshop dependencies
uv sync --extra workshop

# Or all extras
uv sync --all-extras

App deployment fails

Symptom: ZIP upload returns error during app deployment.

Fix:

  • Verify ZIP contains manifest.yaml at the root (not in a subdirectory)
  • Check manifest fields: name, version, description are required
  • Ensure all config files referenced in entry_configs exist in the ZIP
  • ZIP must not contain symlinks or paths with ..
  • Build the ZIP using the app's scripts/build-app.sh for correct structure

Docker / Kubernetes

Container can't reach NATS

Symptom: Containers fail to connect to nats://nats:4222.

Fix:

  • In Docker Compose: services use the service name as hostname (nats)
  • Standalone Docker: use --network host or link containers
  • In Kubernetes: verify the NATS service is in the same namespace
  • Check: docker exec <container> nslookup nats

Workshop not accessible from host

Symptom: Workshop runs but browser can't reach it.

Fix:

  • Bind to 0.0.0.0 not 127.0.0.1: loom workshop --host 0.0.0.0 --port 8080
  • Docker: expose the port: -p 8080:8080
  • Kubernetes: use NodePort (30080) or port-forward: kubectl port-forward svc/loom-workshop 8080:8080

macOS Service (launchd)

Services not starting after install

Fix:

# Check service status
launchctl list | grep loom

# Check logs
cat ~/Library/Logs/loom/workshop.err
cat ~/Library/Logs/loom/router.err

# Reload services
launchctl unload ~/Library/LaunchAgents/com.loom.workshop.plist
launchctl load ~/Library/LaunchAgents/com.loom.workshop.plist

Permission denied

Fix:

  • launchd user agents don't need sudo — run as your user
  • If loom binary is in a restricted path, move it or adjust the plist

Windows Service (NSSM)

Services not starting

Fix:

# Check service status
nssm status LoomWorkshop
nssm status LoomRouter

# Check logs
Get-Content "$env:LOCALAPPDATA\loom\logs\workshop.err"

# Restart
nssm restart LoomWorkshop

NSSM not found

Fix:

# Install via Chocolatey
choco install nssm

# Or download from https://nssm.cc/download

Performance

Pipeline is slow

Fix:

  • Design stages with independent dependencies so they run in parallel
  • Scale workers horizontally via NATS queue groups (run multiple instances)
  • Set max_concurrent_goals in pipeline config for concurrent goal processing
  • Check token usage logs (worker.llm_usage) for expensive stages

High memory usage

Fix:

  • Workers are stateless and reset() between tasks — check for leaked references
  • DuckDB stores can grow large — monitor disk usage
  • Dead-letter consumer has a bounded store (default 1000 entries) — adjust max_size if needed
  • Valkey checkpoint store: check TTL settings for expired entries

Workshop Evaluation

LLM judge gives inconsistent scores

Symptom: Same test case produces different scores across eval runs.

Fix:

  • Set temperature=0.0 for the judge backend (this is the default)
  • Use a more capable model for judging (the Workshop prefers the standard tier)
  • Provide a more specific judge_prompt tailored to your domain
  • Check score_details.reasoning in eval results to understand scoring rationale

Baseline comparison shows no data

Symptom: Eval detail page shows no baseline comparison.

Fix:

  • Promote a successful eval run as baseline first: click "Promote as Baseline" on the eval detail page
  • Each worker has at most one baseline; promoting a new one replaces the previous
  • Baseline comparison is only shown when viewing a run that is not the baseline itself

Dead-letter replay keeps failing

Symptom: Replayed tasks end up back in the dead-letter queue.

Fix:

  • Check the original reason in the replay log — if "unroutable", ensure a worker for that worker_type + tier is running
  • If "rate_limited", wait for the rate limiter bucket to refill before replaying
  • If "malformed", the task data itself is invalid and needs to be fixed at the source

TUI Dashboard

TUI won't start

Symptom: loom ui fails with import errors.

Fix:

# Install TUI dependencies
uv sync --extra tui

# Or all extras
uv sync --all-extras

TUI shows "NATS connection failed"

Symptom: Dashboard starts but shows a red "disconnected" status and an error in the Events log.

Fix:

  • Check that NATS is running: nats-server --version or docker ps | grep nats
  • Verify the URL: loom ui --nats-url nats://localhost:4222 (default)
  • Check for firewall rules blocking port 4222
  • The TUI needs NATS running — it subscribes to loom.> to observe traffic

TUI shows no events

Symptom: Dashboard is connected (green status) but no goals, tasks, or events appear.

Fix:

  • The TUI is a passive observer — it only shows traffic that occurs while it's running
  • Submit a goal or run a pipeline to generate traffic
  • Check that actors (router, workers, orchestrator/pipeline) are running
  • The TUI subscribes to loom.> which catches all Loom NATS subjects

Pipeline stages not appearing

Symptom: Goals and tasks appear but the Pipeline tab is empty.

Fix:

  • Pipeline stage data comes from _timeline in result output — only pipeline orchestrators produce this
  • Dynamic orchestrators (OrchestratorActor) don't produce timeline data; use pipeline orchestrators for stage visibility
  • Check that the pipeline is producing results (look in the Events tab for loom.results.* messages)

Distributed Tracing (OpenTelemetry)

Tracing not producing spans

Symptom: No spans appear in your tracing backend (Jaeger, Zipkin, Tempo).

Fix:

# Install OTel dependencies
uv sync --extra otel

# Verify installation
python -c "from opentelemetry import trace; print('OTel available')"

Then initialize tracing at startup:

from loom.tracing import init_tracing
init_tracing(service_name="loom")

Or set the standard OTel environment variables:

export OTEL_SERVICE_NAME=loom
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Spans not linking across actors

Symptom: You see separate root spans for each actor instead of a connected trace.

Fix:

  • Loom propagates trace context via a _trace_context key in NATS messages (W3C traceparent format)
  • Both the sender and receiver must have OTel installed for propagation to work
  • Check that inject_trace_context() and extract_trace_context() are being called (they are in BaseActor._process_one() and all publish methods)
  • If using custom actors, ensure you call inject_trace_context(data) before publishing and extract_trace_context(data) when receiving

OTel not installed but code imports it

Symptom: Worried about import errors when OTel is not installed.

Fix:

  • This is handled automatically. The loom.tracing module uses runtime feature detection — if OTel SDK is not installed, all functions become no-ops. No code changes needed. See Design Invariant #9.

LLM spans missing prompt/completion text

Symptom: LLM call spans appear in your tracing backend but contain no prompt or completion content.

Fix:

Set the LOOM_TRACE_CONTENT environment variable to enable prompt and completion recording:

export LOOM_TRACE_CONTENT=1

With this enabled, LLM call spans include two span events:

  • gen_ai.content.prompt — the full prompt sent to the model
  • gen_ai.content.completion — the model's response text

This is disabled by default to avoid storing sensitive data in your tracing backend.

Note: Even without LOOM_TRACE_CONTENT, all LLM call spans include these gen_ai.* attributes per the OpenTelemetry GenAI semantic conventions:

Attribute Description
gen_ai.system LLM provider (e.g., anthropic, ollama)
gen_ai.request.model Model requested (e.g., claude-sonnet-4-20250514)
gen_ai.response.model Model that served the request
gen_ai.usage.input_tokens Prompt token count
gen_ai.usage.output_tokens Completion token count
gen_ai.request.temperature Sampling temperature
gen_ai.request.max_tokens Max output tokens requested