Realtime Storyteller with Vision Agents & Mirage 2
Realtime Storyteller with Vision Agents & Mirage 2
In this example we will build a realtime interactive storyteller using Vision Agents and Decart Mirage 2.
The Realtime Storyteller is an interactive agent that listens to your voice, generates a story based on your input, and dynamically restyles your video feed to match the narrative. It leverages Mirage 2 for ultra-low latency video style transfer and Vision Agents to orchestrate the AI components.
By the end of this tutorial, you will learn how to:
- Set up a Vision Agent with voice and video capabilities.
- Integrate Decart Mirage 2 for realtime video restyling.
- Use an LLM to drive the story and control the visual style.
- Run the application on Stream's low-latency video edge network.
Architecture
The application pipeline combines several powerful AI services all running in real-time:
Rendering Mermaid diagram…
- Vision Agents: Orchestrates and manages all AI components and handles the video and audio transport using Stream's global edge network.
- Deepgram: Transcribes user speech to text.
- OpenAI: Generates the story and decides when to change the video style.
- ElevenLabs: Synthesizes expressive speech for the storyteller.
- Decart Mirage 2: Transforms the video feed in real-time based on the current story context.
If you already have an app running with a different LLM, TTS or STT provider, you can use it instead of the ones used in this example. Vision Agents ships with over 16 out of the box providers across all major LLM and speech providers.
Prerequisites
Before you begin, ensure you have the following:
- Python 3.10+ installed.
- API Keys for:
- OpenAI (LLM)
- Decart (Video Restyling)
- ElevenLabs (TTS)
- Deepgram (STT)
- Stream (Video/Audio Infrastructure)
Setup & Installation
-
Clone the Repository
git clone https://github.com/GetStream/Vision-Agents.git cd Vision-Agents/plugins/decart/example -
Install Dependencies
We use
uvfor fast dependency management.uv sync -
Configure Environment Variables
Create a
.envfile in the example directory:OPENAI_API_KEY=your_openai_key DECART_API_KEY=your_decart_key ELEVENLABS_API_KEY=your_11labs_key DEEPGRAM_API_KEY=your_deepgram_key STREAM_API_KEY=your_stream_key STREAM_API_SECRET=your_stream_secret
Code Walkthrough
The core logic lives in decart_example.py. Let's break it down.
1. Initialize the Restyling Processor
First, we set up the Decart processor. This component handles the video transformation.
processor = decart.RestylingProcessor( initial_prompt="A cute animated movie with vibrant colours", model="mirage_v2" )
initial_prompt: Sets the starting visual style.model: We usemirage_v2for its speed and quality.
2. Define the Agent
Next, we create the Agent with all its capabilities.
agent = Agent( edge=getstream.Edge(), agent_user=User(name="Story teller", id="agent"), instructions="You are a story teller. You will tell a short story to the user. You will use the Decart processor to change the style of the video and user's background. You can embed audio tags in your responses for added effect Emotional tone: [EXCITED], [NERVOUS], [FRUSTRATED], [TIRED] Reactions: [GASP], [SIGH], [LAUGHS], [GULPS] Volume & energy: [WHISPERING], [SHOUTING], [QUIETLY], [LOUDLY] Pacing & rhythm: [PAUSES], [STAMMERS], [RUSHED]", llm=openai.LLM(model="gpt-4o-mini"), # You can use any LLM that supports tool calling tts=elevenlabs.TTS(voice_id="N2lVS1w4EtoT3dr4eOWO"), stt=deepgram.STT(), processors=[processor], )
- instructions: Defines the agent's persona and behavior, including emotional cues.
- processors: We pass our
processorhere so the agent can route video frames through it.
3. Dynamic Style Switching
This is the magic part. We register a tool that allows the LLM to change the video style programmatically.
@llm.register_function( description="This function changes the prompt of the Decart processor which in turn changes the style of the video and user's background" ) async def change_prompt(prompt: str) -> str: await processor.update_prompt(prompt) return f"Prompt changed to {prompt}"
When the story shifts (e.g., "suddenly, a storm approached"), the LLM calls change_prompt("dark stormy night"), and Mirage 2 instantly updates the video feed.
4. Run the Agent
Finally, we launch the agent.
if __name__ == "__main__": cli(AgentLauncher(create_agent=create_agent, join_call=join_call))
Run it with:
uv run decart_example.py
Customization
Changing the Persona
Modify the instructions string to change how the storyteller behaves. You can add instructions for emotional tone, pacing, or specific character traits.
Using a Different Voice
Swap the voice_id in the elevenlabs.TTS configuration to match your character. Vision Agents also supports other TTS services that support audio tags. The full list can be found here.
Adjusting Visual Styles
Experiment with different initial_prompt values or guide the LLM to use specific visual descriptors in its change_prompt calls.