Realtime Storyteller with Vision Agents & Mirage 2

In this example we will build a realtime interactive storyteller using Vision Agents and Decart Mirage 2.

The Realtime Storyteller is an interactive agent that listens to your voice, generates a story based on your input, and dynamically restyles your video feed to match the narrative. It leverages Mirage 2 for ultra-low latency video style transfer and Vision Agents to orchestrate the AI components.

By the end of this tutorial, you will learn how to:

Set up a Vision Agent with voice and video capabilities.
Integrate Decart Mirage 2 for realtime video restyling.
Use an LLM to drive the story and control the visual style.
Run the application on Stream's low-latency video edge network.

Architecture

The application pipeline combines several powerful AI services all running in real-time:

Rendering Mermaid diagram…

Vision Agents: Orchestrates and manages all AI components and handles the video and audio transport using Stream's global edge network.
Deepgram: Transcribes user speech to text.
OpenAI: Generates the story and decides when to change the video style.
ElevenLabs: Synthesizes expressive speech for the storyteller.
Decart Mirage 2: Transforms the video feed in real-time based on the current story context.

If you already have an app running with a different LLM, TTS or STT provider, you can use it instead of the ones used in this example. Vision Agents ships with over 16 out of the box providers across all major LLM and speech providers.

Prerequisites

Before you begin, ensure you have the following:

Python 3.10+ installed.
API Keys for:
- OpenAI (LLM)
- Decart (Video Restyling)
- ElevenLabs (TTS)
- Deepgram (STT)
- Stream (Video/Audio Infrastructure)

Setup & Installation

Clone the Repository

git clone https://github.com/GetStream/Vision-Agents.git
cd Vision-Agents/plugins/decart/example

Install Dependencies

We use uv for fast dependency management.
```
uv sync
```

Configure Environment Variables

Create a .env file in the example directory:

OPENAI_API_KEY=your_openai_key
DECART_API_KEY=your_decart_key
ELEVENLABS_API_KEY=your_11labs_key
DEEPGRAM_API_KEY=your_deepgram_key
STREAM_API_KEY=your_stream_key
STREAM_API_SECRET=your_stream_secret

Code Walkthrough

The core logic lives in decart_example.py. Let's break it down.

1. Initialize the Restyling Processor

First, we set up the Decart processor. This component handles the video transformation.

processor = decart.RestylingProcessor(
    initial_prompt="A cute animated movie with vibrant colours", model="mirage_v2"
)

initial_prompt: Sets the starting visual style.
model: We use mirage_v2 for its speed and quality.

2. Define the Agent

Next, we create the Agent with all its capabilities.

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Story teller", id="agent"),
    instructions="You are a story teller. You will tell a short story to the user. You will use the Decart processor to change the style of the video and user's background. You can embed audio tags in your responses for added effect Emotional tone: [EXCITED], [NERVOUS], [FRUSTRATED], [TIRED] Reactions: [GASP], [SIGH], [LAUGHS], [GULPS] Volume & energy: [WHISPERING], [SHOUTING], [QUIETLY], [LOUDLY] Pacing & rhythm: [PAUSES], [STAMMERS], [RUSHED]",
    llm=openai.LLM(model="gpt-4o-mini"), # You can use any LLM that supports tool calling
    tts=elevenlabs.TTS(voice_id="N2lVS1w4EtoT3dr4eOWO"), 
    stt=deepgram.STT(),
    processors=[processor],
)

instructions: Defines the agent's persona and behavior, including emotional cues.
processors: We pass our processor here so the agent can route video frames through it.

3. Dynamic Style Switching

This is the magic part. We register a tool that allows the LLM to change the video style programmatically.

@llm.register_function(
    description="This function changes the prompt of the Decart processor which in turn changes the style of the video and user's background"
)
async def change_prompt(prompt: str) -> str:
    await processor.update_prompt(prompt)
    return f"Prompt changed to {prompt}"

When the story shifts (e.g., "suddenly, a storm approached"), the LLM calls change_prompt("dark stormy night"), and Mirage 2 instantly updates the video feed.

4. Run the Agent

Finally, we launch the agent.

if __name__ == "__main__":
    cli(AgentLauncher(create_agent=create_agent, join_call=join_call))

Run it with:

uv run decart_example.py

Customization

Changing the Persona

Modify the instructions string to change how the storyteller behaves. You can add instructions for emotional tone, pacing, or specific character traits.

Using a Different Voice

Swap the voice_id in the elevenlabs.TTS configuration to match your character. Vision Agents also supports other TTS services that support audio tags. The full list can be found here.

Adjusting Visual Styles

Experiment with different initial_prompt values or guide the LLM to use specific visual descriptors in its change_prompt calls.

Resources