Sidekick Real-Time AI Companion
Project overview
Sidekick is a real-time video call companion that lip-syncs prerecorded footage to live LLM conversations. The server stitches together Groq's Llama 3.3 70B for dialogue, ElevenLabs for speech synthesis, Whisper (optionally MLX-accelerated) for speech-to-text, and Decart's Lipsync service to align audio with character video. A lightweight WebRTC client (client.html) handles signaling and playback, so developers can meet Cleopatra, V1X3N, or their own persona with near-instant latency.
By the end of this cookbook you will:
- Stand up the backend on localhost and connect from the in-repo web client.
- Author new characters by mixing YAML prompts, ElevenLabs voices, and MP4 clips.
- Understand how the Pipecat pipeline moves frames between STT, LLM, TTS, and Lipsync processors.
- Extend the project with custom processors or deployment tweaks.
Join the Decart Discord if you get stuck or want to showcase your character builds.
Architecture tour
At a high level the backend is a Pipecat pipeline wrapped in a WebRTC signaling server:
Rendering Mermaid diagram…
Key building blocks:
- SidekickWebRTCServer accepts offers, sets up
SmallWebRTCConnection, then spins up the pipeline (sidekick.py:62). - Pipecat Pipeline assembles STT → context aggregation → LLM → TTS → video streamer → lipsync → WebRTC transport (
sidekick.py:135). - Decart Lipsync service streams encoded frames to Decart's API and reads back synchronized media (
processors/lipsync_service.py:20). - Browser client sends audio via WebRTC, receives muxed video/audio, and renders in
<video>(client.html:200).
Local environment setup
- Clone and create a virtualenv
git clone https://github.com/DecartAI/sidekick.git cd sidekick python -m venv .venv source .venv/bin/activate
- Install dependencies (Pipecat, aiortc, Whisper MLX, ElevenLabs SDK, etc.)
pip install -r requirements.txt
- Configure secrets using the template. You need Groq, ElevenLabs, and Decart API keys.
cp .env.example .env # edit .env with your keys
-
Place your character footage in
videos/. MP4 with a static talking head works best; keep resolution modest (720p) to avoid bandwidth spikes. -
Optional MLX acceleration: use the
--mlxflag to accelerate local WhisperSTT inference on Apple Silicon (sidekick.py:78).
Crafting character configs
Character behavior lives in YAML at the repo root. Here's the essential structure from cleopatra.yaml:
name: Cleopatra voice_id: elevenlabs_voice_id video_path: videos/cleopatra.mp4 greeting: "The Nile is calm today. How may I assist you?" system_prompt: | You are Cleopatra VII, charismatic and witty. Answer in 2-3 sentences and reference Alexandria when fitting.
Tips:
- Match
voice_idto a voice in your ElevenLabs studio. - Keep
system_promptdescriptive but concise; the same text seeds the Pipecat context aggregator (sidekick.py:100). - Store large or alternate takes in a separate directory and symlink into
videos/so git stays slim.
Backend pipeline walkthrough
SidekickWebRTCServer is the heart of the backend.
- Signaling & offer handling:
handle_websocket_messageparses incoming JSON, upgrades the WebRTC connection with STUN (sidekick.py:50). - Pipeline creation:
create_sidekick_pipelinewires together each service (STT, LLM, TTS, video, lipsync, transport) and captures the resultingPipelineTask(sidekick.py:75-157). - Context aggregation: Groq's LLM service provides a context-aware aggregator so Sidekick remembers the conversation via
LLMAssistantAggregatorParams(sidekick.py:105). - Event hooks: WebRTC events trigger greetings and cleanup. When a browser connects, we queue a
TTSSpeakFrameusing the character's greeting (sidekick.py:160-169). - Runner lifecycle:
PipelineRunnerexecutes async tasks until cancellation, allowing clean teardown when the peer disconnects (sidekick.py:180-199).
Keep pipeline modifications localized—if you add a new processor, insert it before lipsync for audio-only transformations or before the transport output for display effects.
Lipsync processor internals
DecartLipsyncService wraps the async websockets client to Decart's cloud endpoint.
- Connection lifecycle:
setupconnects and launches a consumer coroutine that pulls synchronized results (processors/lipsync_service.py:33-38). - Frame routing: Incoming
TTSAudioRawFrameobjects are diverted to Decart rather than forwarded downstream, ensuring only lip-synced audio/video reach the client (processors/lipsync_service.py:55-74). - Interrupt handling: When Pipecat detects speech overlap, an
InterruptionFrametriggersinterrupt_audio()so the lipsync queue flushes quietly (processors/lipsync_service.py:62-66). - Output loop:
_consume_lipsynced_mediareads back decoded frames, pushes voiced audio as new TTS frames, sends fallback silent frames to maintain cadence, and forwards RGB24 stills to the transport (processors/lipsync_service.py:86-114).
Need latency tuning? Pass sync_latency to the constructor and expose it via CLI to compensate for slower networks.
Streaming video to the browser
Two pieces collaborate to deliver visuals:
- VideoFileStreamer preloads JPEG-encoded frames from disk, loops them at 25 FPS, and injects them into the pipeline (
processors/video_file_streamer.py:20-113). Any video with a different frame rate will be throttled, so re-encode clips to 25 FPS to stay in sync with Decart's expectations. - Client WebRTC app establishes the data channel, publishes microphone audio, and renders the remote stream. The
connect()routine inclient.html:200adds a recvonly video transceiver so the server can push lip-synced frames without negotiating additional codecs. Users toggle mute locally, which simply disables the outgoing track (client.html:286).
If you're embedding in another app, reuse the same offer/answer flow and add a canvas overlay if you want subtitles or UI chrome.
Run and test the experience
- Start the backend:
python sidekick.py --character cleopatra.yaml --host 0.0.0.0 --port 8080 --mlx
- Serve the client over localhost (secure origin is required for microphone access):
python -m http.server 5173 # or: npx http-server -p 5173
Then open http://localhost:5173/client.html and approve the mic prompt.
-
Click Connect. You should hear the character's greeting within two seconds and see the video loop animate with synchronized speech.
-
Speak into your mic. Watch the console logs for VAD trigger confirmations and ensure the bot interrupts gracefully when you talk over it (
processors/lipsync_service.py:63). -
Disconnect and verify cleanup—
Pipeline runner startedfollowed byCleanup completedin the terminal confirms all tasks cancelled cleanly.
For more repeatable QA, capture WebRTC internals via chrome://webrtc-internals and export stats when debugging jitter or bitrate issues.
Extending Sidekick
- New processors: Insert emotion analysis, sentiment logging, or telemetry between STT and LLM to steer responses dynamically. Consider pipelining Moondream to equip the character with vision understanding.
- Multiple characters: Launch separate processes with different YAML configs, or add a lobby server that assigns characters per room.
- Deployment: Containerize with GPU-enabled Whisper builds, then terminate TLS at a reverse proxy that proxies both HTTPS (for static client) and WSS (for signaling). Add TURN servers for NAT-challenged clients.
- Media swapping: Replace the static loop with a live-driven avatar using MediaPipe or blend shapes; feed those frames into the same lip-sync channel as long as you respect 25 FPS.
Troubleshooting checklist
- No greeting audio: Confirm ElevenLabs API key and voice ID; verbose logs in
sidekick.py:87will reveal HTTP failures. - Video freezes after a minute: Ensure your MP4 loops cleanly and that the streamer loads frames (
processors/video_file_streamer.py:57). - High latency: Drop
optimize_streaming_latencyto a lower tier or shorten thesync_latencybuffer. Also inspect local CPU usage when running Whisper without MLX acceleration. - WebRTC negotiation fails: Verify the STUN server is reachable and that firewalls allow UDP 3478 and ephemeral ranges. Use a TURN service when deploying beyond localhost.
You now have a complete mental model of how Sidekick stitches LLM dialogue, speech synthesis, and lip-syncing into a convincing AI on video calls. Customize the prompts, swap the footage, and share your characters with the community!