DecartXR - AI ❤️ XR
"This is the Oasis DecartXR"
Ever since we launched MirageLSD one theme kept echoing through the comments — VR. “Put this on Quest”, “Anyone said Ready Player One?” and “This would be amazing on a headset!” So, we decided to make it real. We grabbed a Meta Quest 3, figured out if it’s even possible to build a camera-based app in VR (yes, in that order), and got to work.
What we Build: an AI World Transformation app
The concept is deceptively simple: capture live video from Quest's passthrough cameras, stream it to our AI service, apply style transformation, and display the processed result back in VR. In practice, this meant building a pipeline that looks like this:
Quest Camera → Unity WebRTC → Decart AI Service → Processed Video → Quest Display
↑ ↑ ↑ ↑ ↑
Camera Access VP8 Encoding Style Transfer VP8 Decoding UI Rendering
& Permissions @30fps/1-4Mbps 40ms latency Real-time Real-time
Let's walk through each step of this pipeline and the components that make it work:
Step 1: Quest Camera Access & Permissions
The first challenge was getting access to Quest's passthrough cameras. We discovered this requires two specific components working together:
PassthroughCameraPermissions.cs - Handles the Android permission dance PassthroughCameraUtils.cs - Discovers and configures Quest cameras using Android Camera2 API WebCamTextureManager.cs - Creates Unity WebCamTextures that actually work with Quest hardware
We figured out that Quest cameras aren't regular webcams — they're Android Camera2 devices with Meta-specific metadata that requires special discovery logic.
Step 2: Unity WebRTC Integration
Once we have camera access, three core components handle the WebRTC connection to our AI service:
WebRTCConnection.cs - Manages the Unity lifecycle, video streaming setup, and model selection WebRTCManager.cs - Contains the core WebRTC logic, signaling protocol, and dual AI prompt banks (61 Mirage + 15 Lucy) WebRTCController.cs - Handles user input (model selection at startup, then A/B buttons cycle through prompts) and manages video display
The WebRTC stack handles VP8 encoding at 30fps with adaptive bitrate control, WebSocket signaling for peer connection setup, and ICE candidate exchange for NAT traversal.
Step 3: Decart AI Service Integration
This is where the real transformation happens. The WebRTCConnection component selects between two AI models:
[SerializeField] private string MirageWebSocket = "wss://api3.decart.ai/v1/stream-trial?model=mirage"; [SerializeField] private string LucyWebSocket = "wss://api3.decart.ai/v1/stream-trial?model=lucy_v2v_720p_rt"; // User selects model at startup (A button = Mirage, B button = Lucy) string selectedEndpoint = UseLucyModel ? LucyWebSocket : MirageWebSocket; webRTCManager.Connect(selectedEndpoint, IsVideoAudioSender, IsVideoAudioReceiver);
When the WebSocket opens, the system immediately sets up the WebRTC peer connection:
ws.OnOpen += () => { IsWebSocketConnected = true; CreateNewPeerVideoAudioReceivingResources(sessionId); SetupPeerConnection(); };
The two models serve different purposes: Mirage transforms entire environments (61 prompts like Cyberpunk, Frozen, Lego), while Lucy transforms people (15 prompts like Spiderman, Medieval Knight, Anime Character). Both are optimized for real-time VR applications with sub-200ms latency.
Step 4: Processed Video Reception
When AI-processed video streams back, Unity WebRTC automatically creates video receiver components. Our WebRTCController discovers these dynamically created receivers and routes them to the display system.
Step 5: Quest Display & UI Rendering
The final step uses Unity's Canvas system to display both the original camera feed and AI-processed video streams. The PortalController and UIFader components create smooth VR transitions and effects.
The end result? You put on the headset, select your AI model (Mirage for worlds, Lucy for people), see your live camera feed immediately, then watch as AI processing kicks in within 3-5 seconds. A/B buttons cycle through the model's prompts, or you can speak custom prompts. The whole experience runs at 1280×720 resolution, 30fps, with that satisfying sub-200ms latency.
Challenge 1: Quest Camera Access (Or: Android is Weird)
Getting access to Quest's cameras turned out to be way more involved than "just use Unity's WebCamTexture." Quest 3 runs a modified Android called Horizon OS, and we discovered that accessing the passthrough cameras requires specific permissions and some Android Camera2 API work.
First, we figured out that we need two permissions — not just one:
public static readonly string[] CameraPermissions = { "android.permission.CAMERA", // Standard Android camera "horizonos.permission.HEADSET_CAMERA" // Quest-specific (v74+) };
Then comes camera discovery. Quest cameras aren't labeled "Left Camera" and "Right Camera" — they're just Android camera devices with Meta-specific metadata buried in their characteristics. We discovered we have to enumerate all cameras and look for special properties:
// Look for Meta-specific camera metadata if (string.Equals(keyName, "com.meta.extra_metadata.camera_source", StringComparison.OrdinalIgnoreCase)) { var cameraSourceArr = GetCameraValueByKey<sbyte[]>(cameraCharacteristics, key); if (cameraSourceArr?.Length == 1) cameraSource = (CameraSource)cameraSourceArr[0]; // 0 = Passthrough camera } else if (string.Equals(keyName, "com.meta.extra_metadata.position", StringComparison.OrdinalIgnoreCase)) { var cameraPositionArr = GetCameraValueByKey<sbyte[]>(cameraCharacteristics, key); if (cameraPositionArr?.Length == 1) cameraPosition = (CameraPosition)cameraPositionArr[0]; // 0 = Left, 1 = Right }
The payoff is worth it though. Once working, we get direct access to Quest's high-resolution passthrough cameras with full Unity integration.
Challenge 2: WebRTC Deep Dive (Or: The Internet is Weird Too)
Getting video from Unity to our AI service meant diving deep into WebRTC — a protocol that's simultaneously powerful and incredibly specific about how things need to be done. We're not just sending a video file; we're streaming live camera feed over UDP with real-time encoding, ICE negotiation for NAT traversal, and custom signaling to our AI servers.
WebSocket Connection & Immediate Setup
The integration starts in WebRTCManager.cs:179-231
with WebSocket creation:
public async void Connect(string webSocketUrl, bool isVideoAudioSender, bool isVideoAudioReceiver) { // webSocketUrl is either: // "wss://api3.decart.ai/v1/stream-trial?model=mirage" or // "wss://api3.decart.ai/v1/stream-trial?model=lucy_v2v_720p_rt" ws = new WebSocket(webSocketUrl); currentWebSocketUrl = webSocketUrl; ws.OnOpen += () => { SimpleWebRTCLogger.Log("WebSocket connection opened!"); IsWebSocketConnected = true; IsWebSocketConnectionInProgress = false; OnWebSocketConnection?.Invoke(WebSocketState.Open); // Immediately set up WebRTC peer connection try { if (ws != null) { CreateNewPeerVideoAudioReceivingResources(sessionId); SetupPeerConnection(); SimpleWebRTCLogger.Log($"NEWPEER: Created new peerconnection {localPeerId} on peer {localPeerId}"); } } catch (Exception e) { SimpleWebRTCLogger.LogError("Error in connection setup: " + e.Message); } }; }
The API connection is streamlined - model selection is handled in the URL parameter, allowing the WebRTC peer connection to be set up immediately after the WebSocket opens. No separate initialization message is needed.
WebRTC Peer Connection Creation
Once the WebSocket is connected, we set up the WebRTC peer connection in WebRTCManager.cs:216-234
:
private void SetupPeerConnection() { pc = CreateNewRTCPeerConnection(); SetupEventHandlers(); } private RTCPeerConnection CreateNewRTCPeerConnection() { RTCConfiguration config = new RTCConfiguration { iceServers = new[] { new RTCIceServer { urls = new[] { "stun:stun.l.google.com:19302" } } }, iceCandidatePoolSize = 10 // Pre-gather candidates for faster connection }; return new RTCPeerConnection(ref config); }
ICE Candidate Handling
The ICE candidate exchange is crucial for NAT traversal. Here's how we handle it in WebRTCManager.cs:236-248
:
pc.OnIceCandidate = candidate => { var candidateMessage = new outboundICECandidateMessage { type = "ice-candidate", candidate = new ICECandidateData { candidate = candidate.Candidate, sdpMid = candidate.SdpMid, sdpMLineIndex = candidate.SdpMLineIndex ?? 0 } }; ws.SendText(JsonUtility.ToJson(candidateMessage)); };
Video Stream Creation and Encoding
The real work happens when we create the video stream from our Unity camera. In WebRTCConnection.cs:328-367
:
private IEnumerator StartVideoTransmissionAsync() { StreamingCamera.gameObject.SetActive(true); // Activate the Unity Camera yield return null; // Wait for camera initialization yield return null; // Create video stream track from Unity Camera if (UseImmersiveSetup) { videoStreamTrack = new VideoStreamTrack(videoEquirect); } else { // Standard camera capture from StreamingCamera GameObject videoStreamTrack = StreamingCamera.CaptureStreamTrack(VideoResolution.x, VideoResolution.y); } webRTCManager.AddVideoTrack(videoStreamTrack); StartCoroutine(CreateOfferWithWarmup()); }
VP8 Encoding Configuration
Here's where Unity WebRTC gets interesting. We configure VP8 encoding for optimal streaming in WebRTCConnection.cs:369-415
:
public IEnumerator CreateOffer() { SimpleWebRTCLogger.Log("Creating offer"); // Enforce VP8 codec var transceivers = webRTCManager.pc.GetTransceivers(); foreach (var transceiver in transceivers) { if (transceiver.Sender != null && transceiver.Sender?.Track?.Kind == TrackKind.Video) { var vp8 = RTCRtpSender.GetCapabilities(TrackKind.Video).codecs .Where(c => c.mimeType == "video/VP8").ToArray(); transceiver.SetCodecPreferences(vp8); // Manual encoding parameters and FPS control var parameters = transceiver.Sender.GetParameters(); foreach (var encoding in parameters.encodings) { encoding.maxBitrate = 4000000UL; // 4Mbps max encoding.minBitrate = 1000000UL; // 1Mbps min encoding.maxFramerate = 30U; // 30fps encoding.scaleResolutionDownBy = 1.0; // No downscaling } transceiver.Sender.SetParameters(parameters); SimpleWebRTCLogger.Log("Set manual encoding parameters"); } } var offer = webRTCManager.pc.CreateOffer(); yield return offer; if (!offer.IsError) { var offerMessage = new outboundOfferMessage { type = "offer", sdp = offerSessionDesc.sdp }; webRTCManager.ws.SendText(JsonUtility.ToJson(offerMessage)); } }
Signaling Message Handling
All incoming messages are processed in WebRTCManager.cs:331-401
. Here's how we handle the answer:
var answerMessage = JsonUtility.FromJson<inboundAnswerMessage>(data); if (answerMessage.type == "answer") { RTCSessionDescription answerSessionDesc = new RTCSessionDescription() { type = RTCSdpType.Answer, sdp = answerMessage.sdp }; pc.SetRemoteDescription(ref answerSessionDesc); }
Received Video Frame Processing
When processed video comes back from our AI service, Unity WebRTC automatically creates video receivers. We find them in WebRTCController.cs:50-64
:
private IEnumerator FindReceivedVideo() { yield return new WaitForSeconds(0.5f); // Allow time for video receiver creation // Search for automatically created video receivers var receivingObjects = GameObject.FindObjectsByType<RawImage>(FindObjectsSortMode.None); foreach (var rawImage in receivingObjects) { if (rawImage.name.Contains("Receiving-RawImage") && rawImage.texture != null) { Debug.Log($"Found received video: {rawImage.name}"); if (receivedVideoImage != null) { receivedVideoImage.texture = rawImage.texture; // Display on our Canvas } break; } } }
This whole dance — WebSocket signaling, WebRTC negotiation, VP8 encoding, and frame processing — happens in under 3 seconds. The streamlined connection flow with immediate peer setup makes the connection feel instantaneous.
Challenge 3: Quest Development Quirks (Or: Quest is Weird As Well)
Building for Quest isn't just about VR development — it's about navigating a maze of platform-specific APIs, Unity integration patterns, and hardware limitations that each have their own personality. Let's dive into what we discovered.
Voice Control: The Wit.ai Integration Adventure
Here's where things get really interesting. Instead of cycling through predefined styles, users can just speak what they want: "Make this look like a cyberpunk city at night" or "Transform this into a medieval castle." Sounds straightforward, right? Just add speech recognition!
Well, we discovered that to get basic speech-to-text functionality, we needed to:
- Create a Wit.ai account (Meta's speech platform)
- Set up a new app in their console (with comprehensive configuration options)
- Configure intents and entities (which we don't even use for basic transcription)
- Generate server access tokens (multiple types, naturally)
- Integrate their Meta Voice SDK
All of this... for simple speech-to-text transcription. Quite comprehensive for basic transcription, but the end result works reliably.
The voice integration lives in VoiceIntentController.cs
and connects their speech recognition to our WebRTC prompt system:
[SerializeField] private AppVoiceExperience appVoiceExperience; // Meta's voice solution [SerializeField] private WebRTCController webRTCController; // Our WebRTC bridge private void Start() { // Connect voice events to our WebRTC system (the straightforward part) appVoiceExperience.VoiceEvents.OnFullTranscription.AddListener((transcription) => { webRTCController.QueueCustomPrompt(transcription); // Send the text to AI fullTranscriptText.text = transcription; // Show what we heard }); // Live transcription for real-time feedback appVoiceExperience.VoiceEvents.OnPartialTranscription.AddListener((transcription) => { partialTranscriptText.text = transcription; // "Look ma, I'm typing!" }); }
Users activate this speech system by holding the Quest controller's trigger button. While holding, they see live transcription in the UI. When released, the complete transcription gets sent to our AI service as a custom prompt.
Meta's Camera Security Approach
Now for an interesting discovery: Meta doesn't allow access to both passthrough cameras or the full immersive video feed. We can only access one camera (left or right eye). We figured out this is their security approach — apparently, limiting access to the same camera feed that users already see in passthrough mode provides additional protection.
In WebCamTextureManager.cs
, we configure our single-camera setup:
[SerializeField] public PassthroughCameraEye Eye = PassthroughCameraEye.Left; // Pick your favorite [SerializeField] public Vector2Int RequestedResolution; // Max resolution per camera
This limitation drove our canvas-based display approach — different presentation than originally envisioned, but it works well for the VR experience.
Unity Scene Configuration Requirements
The Unity scene setup in DecartAI-Main.unity
required careful orchestration. Here's the GameObject hierarchy that actually works (and we discovered is quite sensitive to changes):
MainCanvas (Active)
├─ receivedVideoImage (RawImage) - Displays AI-processed video
├─ promptNameText (TextMeshPro) - Shows current style
└─ ReceivingRawImagesParent - Container for auto-generated receivers
WebRTC Webcam Canvas (Initially Inactive, for good reasons)
└─ canvasRawImage (RawImage) - Local camera preview
Client-StreamingCamera (Initially Inactive, because Unity timing)
└─ Camera Component - Unity Camera for WebRTC video capture
The WebRTCConnection
component needs all these references configured in the Unity Inspector with precision:
[Header("Camera Integration")] public Camera StreamingCamera; // References the Client-StreamingCamera GameObject public Vector2Int VideoResolution = new Vector2Int(1280, 704); // Target resolution [Header("UI References")] public Transform ReceivingRawImagesParent; // Container for dynamically created video receivers public Material streamMaterial; // Optional material for video rendering
The Classic WebCamTexture Play() Timing Issue
Unity has an interesting timing behavior where calling webCamTexture.Play()
immediately after creation sometimes fails silently. Not "throws an error" fails — just silently fails, leaving you with a black screen and debugging confusion. Our solution in WebCamTextureManager.cs:437-438
:
// Important: Wait before calling Play() to avoid Unity's timing quirk yield return null; yield return new WaitForSeconds(1); // The patience solution webCamTexture.Play();
Yes, we just wait a whole second. Is it elegant? No. Does it work reliably? Yes. Sometimes the simplest solutions are the least glamorous ones.
Meta XR Project Setup Tools: Interesting Recommendations
Here's where it gets fun. Meta provides helpful project setup tools that analyze your Unity project and give you recommendations. We discovered we have:
- 2 "Outstanding Issues" that if we "fix," the app breaks completely
- 2 "Recommended Fixes" that if we apply, surprise! The app also breaks
The solution? We learned to leave the "issues" alone. They're features now.
Custom Unity Player Settings Requirements
Through experimentation, we discovered these specific settings that must be configured exactly right:
Player Settings → Android Settings:
Graphics APIs: OpenGLES3 only (remove Vulkan if present)
Scripting Backend: IL2CPP
Architecture: ARM64 only (ARMv7 causes issues)
Target API Level: 34+ (Android 14)
Minimum API Level: 29+ (Android 10)
XR Plugin Management → Oculus:
✅ Initialize XR on Startup
❌ Low Overhead Mode (GLES) - MUST be DISABLED (counter-intuitive but necessary)
❌ Meta Quest: Occlusion - MUST be DISABLED
❌ Meta XR Subsampled Layout - MUST be DISABLED
We found out that enabling those disabled options causes the camera feed to disappear into the void.
The Dynamic Video Receiver Creation Pattern
Unity WebRTC automatically creates RawImage GameObjects for incoming video streams, but gives them generic names. We discovered we have to search for them manually:
// From WebRTCConnection.cs:464-491 - The receiver discovery pattern public void CreateVideoReceiverGameObject(string senderPeerId) { var receivingRawImage = new GameObject().AddComponent<RawImage>(); receivingRawImage.name = $"{senderPeerId}-Receiving-RawImage"; // Searchable name receivingRawImage.rectTransform.SetParent(ReceivingRawImagesParent, false); // Configure RawImage layout with precision receivingRawImage.rectTransform.localScale = Vector3.one; receivingRawImage.rectTransform.anchorMin = Vector2.zero; receivingRawImage.rectTransform.anchorMax = Vector2.one; webRTCManager.VideoReceiver = receivingRawImage; }
The Camera-to-WebRTC Bridge
Getting the Quest camera feed into Unity's WebRTC system requires bridging multiple APIs. The flow in all its glory:
// In WebRTCController.cs:19-24 - Wait for the stars to align private IEnumerator Start() { yield return new WaitUntil(() => passthroughCameraManager.WebCamTexture != null && passthroughCameraManager.WebCamTexture.isPlaying); _webcamTexture = passthroughCameraManager.WebCamTexture; canvasRawImage.texture = _webcamTexture; // Finally display something! }
Then in WebRTCConnection.cs:347-355
, we perform the bridging ceremony:
// Create video stream track from Unity Camera (not directly from WebCamTexture, for reasons) if (UseImmersiveSetup) { videoStreamTrack = new VideoStreamTrack(videoEquirect); } else { // StreamingCamera captures the scene, including our camera texture videoStreamTrack = StreamingCamera.CaptureStreamTrack(VideoResolution.x, VideoResolution.y); }
WebRTCController Inspector Configuration
The WebRTCController
also needs its own set of precise Inspector assignments:
[Header("Camera Integration")] public WebCamTextureManager passthroughCameraManager; // The camera manager singleton [Header("UI Display Components")] public RawImage canvasRawImage; // Local camera preview (fileID: 557677174) public RawImage receivedVideoImage; // AI-processed video display public TMP_Text promptNameText; // Current prompt display
Miss one reference? Silent failure. Wrong reference? Silent failure. Reference something that used to exist but got renamed? You guessed it — silent failure with bonus debugging time.
Why We Embraced the Canvas Approach
The single-camera limitation plus Unity's WebRTC patterns plus Meta's security considerations pushed us toward a canvas-based approach:
- Meta's Security: Only one passthrough camera accessible (by design)
- Unity WebRTC: Works smoothly with Camera.CaptureStreamTrack() from scene cameras
- VR Display: RawImage components on World Space canvases work reliably
- Reality: Sometimes you gotta work with what you've got
The result feels natural in VR — like having floating video screens that show your reality transformed by AI. It's not the full-immersion Matrix experience we dreamed of, but it's surprisingly engaging nonetheless. Plus, floating screens are easier to debug than reality-warping shaders, so there's that practical benefit.
What This Opens Up
The open-source release is already sparking ideas we never considered. Developers are talking about AR games where the environment transforms based on player actions, architectural visualization where you can instantly restyle spaces, and educational experiences where you can "visit" historical periods or artistic movements.
The technical foundation we built — real-time Quest camera access, WebRTC streaming to AI services, and the Unity integration patterns — could power a whole new generation of VR applications. We solved the challenging problems: permissions, camera discovery, encoding optimization, network reliability, and sub-200ms latency. Now anyone can build on top of this.
The possibilities are exciting: multiple camera angles, higher resolutions, audio processing integration, and maybe even real-time collaboration where multiple users can share AI-transformed spaces. All the building blocks are here, waiting for the next creative idea.
We're just grinning like kids who built a rocket ship in their garage and watched it actually fly. Sometimes the best projects start with someone saying "wouldn't it be cool if..." and end with you showing everyone that yes, it really would be cool.
The future of VR isn't just about better displays or higher refresh rates — it's about AI that can transform reality itself, in real-time, while you're living inside it. And honestly? We can't wait to see what you build next.
The complete source code and documentation for this Quest + MirageLSD + Lucy Edit integration is available on GitHub. Ready to start building? Load up Unity, grab a Quest 3, and dive in. also, you can just download the APK and try it straight away in SideQuest