Decart Cookbook
Articles
⌘K
October 17, 2025

Sidekick Real-Time AI Companion

Authors
Roeh from Decart
Code on GitHub

Project overview

Sidekick is a real-time video call companion that lip-syncs prerecorded footage to live LLM conversations. The server stitches together Groq's Llama 3.3 70B for dialogue, ElevenLabs for speech synthesis, Whisper (optionally MLX-accelerated) for speech-to-text, and Decart's Lipsync service to align audio with character video. A lightweight WebRTC client (client.html) handles signaling and playback, so developers can meet Cleopatra, V1X3N, or their own persona with near-instant latency.

By the end of this cookbook you will:

  • Stand up the backend on localhost and connect from the in-repo web client.
  • Author new characters by mixing YAML prompts, ElevenLabs voices, and MP4 clips.
  • Understand how the Pipecat pipeline moves frames between STT, LLM, TTS, and Lipsync processors.
  • Extend the project with custom processors or deployment tweaks.

Join the Decart Discord if you get stuck or want to showcase your character builds.

Architecture tour

At a high level the backend is a Pipecat pipeline wrapped in a WebRTC signaling server:

Rendering Mermaid diagram…

Key building blocks:

  • SidekickWebRTCServer accepts offers, sets up SmallWebRTCConnection, then spins up the pipeline (sidekick.py:62).
  • Pipecat Pipeline assembles STT → context aggregation → LLM → TTS → video streamer → lipsync → WebRTC transport (sidekick.py:135).
  • Decart Lipsync service streams encoded frames to Decart's API and reads back synchronized media (processors/lipsync_service.py:20).
  • Browser client sends audio via WebRTC, receives muxed video/audio, and renders in <video> (client.html:200).

Local environment setup

  1. Clone and create a virtualenv
git clone https://github.com/DecartAI/sidekick.git cd sidekick python -m venv .venv source .venv/bin/activate
  1. Install dependencies (Pipecat, aiortc, Whisper MLX, ElevenLabs SDK, etc.)
pip install -r requirements.txt
  1. Configure secrets using the template. You need Groq, ElevenLabs, and Decart API keys.
cp .env.example .env # edit .env with your keys
  1. Place your character footage in videos/. MP4 with a static talking head works best; keep resolution modest (720p) to avoid bandwidth spikes.

  2. Optional MLX acceleration: use the --mlx flag to accelerate local WhisperSTT inference on Apple Silicon (sidekick.py:78).

Crafting character configs

Character behavior lives in YAML at the repo root. Here's the essential structure from cleopatra.yaml:

name: Cleopatra voice_id: elevenlabs_voice_id video_path: videos/cleopatra.mp4 greeting: "The Nile is calm today. How may I assist you?" system_prompt: | You are Cleopatra VII, charismatic and witty. Answer in 2-3 sentences and reference Alexandria when fitting.

Tips:

  • Match voice_id to a voice in your ElevenLabs studio.
  • Keep system_prompt descriptive but concise; the same text seeds the Pipecat context aggregator (sidekick.py:100).
  • Store large or alternate takes in a separate directory and symlink into videos/ so git stays slim.

Backend pipeline walkthrough

SidekickWebRTCServer is the heart of the backend.

  1. Signaling & offer handling: handle_websocket_message parses incoming JSON, upgrades the WebRTC connection with STUN (sidekick.py:50).
  2. Pipeline creation: create_sidekick_pipeline wires together each service (STT, LLM, TTS, video, lipsync, transport) and captures the resulting PipelineTask (sidekick.py:75-157).
  3. Context aggregation: Groq's LLM service provides a context-aware aggregator so Sidekick remembers the conversation via LLMAssistantAggregatorParams (sidekick.py:105).
  4. Event hooks: WebRTC events trigger greetings and cleanup. When a browser connects, we queue a TTSSpeakFrame using the character's greeting (sidekick.py:160-169).
  5. Runner lifecycle: PipelineRunner executes async tasks until cancellation, allowing clean teardown when the peer disconnects (sidekick.py:180-199).

Keep pipeline modifications localized—if you add a new processor, insert it before lipsync for audio-only transformations or before the transport output for display effects.

Lipsync processor internals

DecartLipsyncService wraps the async websockets client to Decart's cloud endpoint.

  • Connection lifecycle: setup connects and launches a consumer coroutine that pulls synchronized results (processors/lipsync_service.py:33-38).
  • Frame routing: Incoming TTSAudioRawFrame objects are diverted to Decart rather than forwarded downstream, ensuring only lip-synced audio/video reach the client (processors/lipsync_service.py:55-74).
  • Interrupt handling: When Pipecat detects speech overlap, an InterruptionFrame triggers interrupt_audio() so the lipsync queue flushes quietly (processors/lipsync_service.py:62-66).
  • Output loop: _consume_lipsynced_media reads back decoded frames, pushes voiced audio as new TTS frames, sends fallback silent frames to maintain cadence, and forwards RGB24 stills to the transport (processors/lipsync_service.py:86-114).

Need latency tuning? Pass sync_latency to the constructor and expose it via CLI to compensate for slower networks.

Streaming video to the browser

Two pieces collaborate to deliver visuals:

  • VideoFileStreamer preloads JPEG-encoded frames from disk, loops them at 25 FPS, and injects them into the pipeline (processors/video_file_streamer.py:20-113). Any video with a different frame rate will be throttled, so re-encode clips to 25 FPS to stay in sync with Decart's expectations.
  • Client WebRTC app establishes the data channel, publishes microphone audio, and renders the remote stream. The connect() routine in client.html:200 adds a recvonly video transceiver so the server can push lip-synced frames without negotiating additional codecs. Users toggle mute locally, which simply disables the outgoing track (client.html:286).

If you're embedding in another app, reuse the same offer/answer flow and add a canvas overlay if you want subtitles or UI chrome.

Run and test the experience

  1. Start the backend:
python sidekick.py --character cleopatra.yaml --host 0.0.0.0 --port 8080 --mlx
  1. Serve the client over localhost (secure origin is required for microphone access):
python -m http.server 5173 # or: npx http-server -p 5173

Then open http://localhost:5173/client.html and approve the mic prompt.

  1. Click Connect. You should hear the character's greeting within two seconds and see the video loop animate with synchronized speech.

  2. Speak into your mic. Watch the console logs for VAD trigger confirmations and ensure the bot interrupts gracefully when you talk over it (processors/lipsync_service.py:63).

  3. Disconnect and verify cleanup—Pipeline runner started followed by Cleanup completed in the terminal confirms all tasks cancelled cleanly.

For more repeatable QA, capture WebRTC internals via chrome://webrtc-internals and export stats when debugging jitter or bitrate issues.

Extending Sidekick

  • New processors: Insert emotion analysis, sentiment logging, or telemetry between STT and LLM to steer responses dynamically. Consider pipelining Moondream to equip the character with vision understanding.
  • Multiple characters: Launch separate processes with different YAML configs, or add a lobby server that assigns characters per room.
  • Deployment: Containerize with GPU-enabled Whisper builds, then terminate TLS at a reverse proxy that proxies both HTTPS (for static client) and WSS (for signaling). Add TURN servers for NAT-challenged clients.
  • Media swapping: Replace the static loop with a live-driven avatar using MediaPipe or blend shapes; feed those frames into the same lip-sync channel as long as you respect 25 FPS.

Troubleshooting checklist

  • No greeting audio: Confirm ElevenLabs API key and voice ID; verbose logs in sidekick.py:87 will reveal HTTP failures.
  • Video freezes after a minute: Ensure your MP4 loops cleanly and that the streamer loads frames (processors/video_file_streamer.py:57).
  • High latency: Drop optimize_streaming_latency to a lower tier or shorten the sync_latency buffer. Also inspect local CPU usage when running Whisper without MLX acceleration.
  • WebRTC negotiation fails: Verify the STUN server is reachable and that firewalls allow UDP 3478 and ephemeral ranges. Use a TURN service when deploying beyond localhost.

You now have a complete mental model of how Sidekick stitches LLM dialogue, speech synthesis, and lip-syncing into a convincing AI on video calls. Customize the prompts, swap the footage, and share your characters with the community!

Decart Cookbook

Technical tutorials and guides for developers building with the Decart platform.

Resources

  • All Articles
  • Platform

Community

  • GitHub

Legal

  • © 2025 Decart Team
  • Built with Next.js and Tailwind CSS