Oppo's Multi-X Team has published X-OmniClaw, an open-source Android AI agent framework that keeps core logic on-device while calling cloud-based language models only for heavy reasoning tasks. Unlike most mobile AI systems that run on cloud servers hosting virtual Android copies, X-OmniClaw executes directly on the user's physical device, maintaining access to the phone's camera, photos, and local files.
Architecture: Three Pillars of On-Device Intelligence
X-OmniClaw operates through three interconnected components that work as one continuous loop, according to Oppo's technical documentation.
Omni Perception combines camera feeds, screen content, and voice input into a single pipeline. A vision-language model interprets the scene before the agent takes action. For example, if a user points their camera at a product and asks for its price, the agent first identifies what it's viewing, then opens the relevant shopping app and begins searching without requiring manual input.
Omni Memory distinguishes X-OmniClaw from one-shot chatbots by maintaining context across tasks, app switches, and sessions. The agent builds long-term semantic memory from the user's photo gallery, converting raw images into structured notes about objects, scenes, and events. According to the report, "runtime continuity is what lets X-OmniClaw operate as an ongoing device agent rather than a one-shot response system."
Omni Action handles execution by combining XML interface data with on-device visual models and optical character recognition (OCR) to determine exactly what to tap, even on cluttered screens. The framework includes a behavior cloning feature that allows users to record a navigation path once, then replay it instantly via Android deeplink shortcuts in future sessions, bypassing multi-step app navigation.
Operational Examples
Oppo demonstrated several practical applications of X-OmniClaw:
-
Product identification and pricing: The agent identifies a physical product via camera, opens Taobao, scrolls through results, and returns a price summary without requiring any typing.
-
Educational assistance: A floating on-screen companion helps users work through math exercises step by step, autonomously reading screen content, processing each question, and advancing when complete.
-
Video creation from gallery: When asked to assemble a highlight video from parrot-themed photos, the system scans the gallery using semantic memory to find matching images, opens CapCut's video editor via deeplink, batch-selects files, and generates the video. The report indicates this process, which previously required "a few minutes or longer," is reduced to a handful of automated steps.
Positioning Within AI Agent Ecosystem
X-OmniClaw extends an architecture pioneered by OpenClaw, an open-source agent framework that reached over 373,000 GitHub stars and was eventually backed by OpenAI. Hermes Agent by Nous Research advanced the concept further with a self-improving learning loop that compounds capabilities over time. Both projects ran primarily on desktop hardware. X-OmniClaw adapts this architecture for smartphones by building on the open-source HermesApp codebase and incorporating OpenClaw's structured skill model as foundational inspiration, then customizing it for the multimodal, always-on nature of mobile devices.
The code is available on GitHub, with Oppo committing to release all assets and continue updating the project as the system evolves.