• Tue. May 12th, 2026

Perceptron Mk1 shocks with highly performant video analysis AI model 80-90% cheaper than Anthropic, OpenAI & Google

By

May 12, 2026

AI that can see and understand what’s happening in a video — especially a live feed — is understandably an attractive product to lots of enterprises and organizations. Beyond acting as a security “watchdog” over sites and facilities, such an AI model could also be used to clip out the most exciting parts of marketing videos and repurpose them for social, identify inconsistencies and gaffs in videos and flag them for removal, and identify body language and actions of participants in controlled studies or candidates applying for new roles.

While there are some AI models that offer this type of functionality today, it’s far from a mainstream capability. The two-year-old startup Perceptron Inc. is seeking to change all that, however. Today, it announced the release of its flagship proprietary video analysis reasoning model, Mk1 (short for “Mark One”) at a cost — $0.15 per million tokens input / $1.50 per million output through its application programming interface (API) — that comes in about 80-90% less than other leading proprietary rivals, namely, Anthropic’s Claude Sonnet 4.5, OpenAI’s GPT-5, and Google’s Gemini 3.1 Pro.

Led by Co-founder and CEO Armen Aghajanyan, formerly of Meta FAIR and Microsoft, the company spent 16 months developing a “multi-modal recipe” from the ground up to address the complexities of the physical world.

This launch signals a new era where models are expected to understand cause-and-effect, object dynamics, and the laws of physics with the same fluency they once applied to grammar.

Interested users and potential enterprise customers can try it out for themselves on a public demo site from Perceptron here.

Performance across spatial and video benchmarks

The model’s performance is backed by a suite of industry-standard benchmarks focused on grounded understanding.

In spatial reasoning (ER Benchmarks), Mk1 achieved a score of 85.1 on EmbSpatialBench, surpassing Google’s Robotics-ER 1.5 (78.4) and Alibaba’s Q3.5-27B (approx. 84.5).

In the specialized RefSpatialBench, Mk1’s score of 72.4 represents a massive leap over competitors like GPT-5m (9.0) and Sonnet 4.5 (2.2), highlighting a significant advantage in referring expression comprehension.

Video benchmarks show similar dominance; on the EgoSchema “Hard Subset”—where first-and-last-frame inference is insufficient—Mk1 scored 41.4, matching Alibaba’s Q3.5-27B and significantly beating Gemini 3.1 Flash-Lite (25.0).

On the VSI-Bench, Mk1 reached 88.5, the highest recorded score among the compared models, further validating its ability to handle actual temporal reasoning tasks.

Market positioning and the efficiency frontier

Perceptron has explicitly targeted the “Efficiency Frontier,” a metric that plots mean scores across video and embodied reasoning benchmarks against the blended cost per million tokens.

Benchmarking data reveals that Mk1 occupies a unique position: it matches or exceeds the performance of “frontier” models like GPT-5 and Gemini 3.1 Pro while maintaining a cost profile closer to “Lite” or “Flash” versions.

Specifically, Perceptron Mk1 is priced at $0.15 per million input tokens and $1.50 per million output tokens. In comparison, the “Efficiency Frontier” chart shows GPT-5 at a significantly higher blended cost (near $2.00) and Gemini 3.1 Pro at approximately $3.00, while Mk1 sits at the $0.30 blended cost mark with superior reasoning scores.

This aggressive pricing strategy is intended to make high-end physical AI accessible for large-scale industrial use rather than just experimental research.

Architecture and temporal continuity

The technical core of Perceptron Mk1 is its ability to process native video at up to 2 frames per second (FPS) across a significant 32K token context window.

Unlike traditional vision-language models (VLMs) that often treat video as a disjointed sequence of still images, Mk1 is designed for temporal continuity.

This architecture allows the model to “watch” extended streams and maintain object identity even through occlusions, a critical requirement for robotics and surveillance applications.

Developers can query the model for specific moments in a long stream and receive structured time codes in return, streamlining the process of video clipping and event detection.

Reasoning with the laws of physics

A primary differentiator for Mk1 is its “Physical Reasoning” capability. Perceptron defines this as a high-precision spatial awareness that allows the model to understand object dynamics and physical interactions in real-world settings.

For example, the model can analyze a scene to determine if a basketball shot was taken before or after a buzzer by jointly reasoning over the ball’s position in the air and the readout on a shot clock.

This requires more than just pattern recognition; it requires an understanding of how objects move through space and time.

The model is capable of “pixel-precise” pointing and counting into the hundreds within dense, complex scenes. It can also read analog gauges and clocks, which have historically been difficult for purely digital vision systems to interpret with high reliability.

It also seems to have strong general world and historical knowledge. In my brief test, I uploaded a vintage public domain film of skyscraper construction in New York City dated 1906 from the U.S. Library of Congress, and Mk1 was able to not only correctly describe the contents of the footage — including odd, atypical sights as workers being suspended by ropes — but did so rapidly and even correctly identified the rough date (early 1900s) from the look of the footage alone.

A developer platform for physical AI

Accompanying the model release is an expanded developer platform designed to turn these high-level perception capabilities into functional applications with minimal code.

The Perceptron SDK, available via Python, introduces several specialized functions such as “Focus,” “Counting,” and “In-Context Learning”.

The Focus feature allows users to zoom and crop into specific regions of a frame automatically based on a natural language prompt, such as detecting and localizing personal protective equipment (PPE) on a construction site. The Counting function is optimized for dense scenes, such as identifying and pointing to every puppy in a group or individual items of produce.

Furthermore, the platform supports in-context learning, allowing developers to adapt Mk1 to specific tasks by providing just a few examples, such as showing an image of an apple and instructing the model to label every instance of Category 1 in a new scene.

Licensing strategies and the Isaac series

Perceptron is employing a dual-track strategy for its model weights and licensing. The flagship Perceptron Mk1 is a closed-source model accessed via API, designed for enterprise-grade performance and security.

However, the company is also maintaining its “Isaac” series, which kicked off with the launch of Isaac 0.1 in September 2025, as an open-weights alternative. Isaac 0.2-2b-preview, released in December 2025, is a 2-billion parameter vision-language model with reasoning capabilities that is available for edge and low-latency deployments.

While the weights for the Isaac models are open on the popular AI code sharing community Hugging Face, Perceptron offers commercial licenses for companies that require maximum control or on-premise deployment of the weights.

This approach allows the company to support both the open-source community and specialized industrial partners who need proprietary flexibility. The documentation notes that Isaac 0.2 models are specifically optimized for sub-200ms time-to-first-token, making them ideal for real-time edge devices.

Background on Perceptron founding and focus

Perceptron AI is a Bellevue, Washington-based physical AI startup founded by Aghajanyan and Akshat Shrivastava, both former research scientists at Meta’s Facebook AI Research (FAIR) lab.

The company’s public materials date its founding to November 2024, while a Washington corporate filing record for Perceptron.ai Inc. shows an earlier foreign registration filing on October 9, 2024, listing Shrivastava and Aghajanyan as governors.

In founder launch posts from late 2024, Aghajanyan said he had left Meta after nearly six years and “joined forces” with Shrivastava to build AI for the physical world, while Shrivastava said the company grew out of his work on efficiency, multimodality and new model architectures.

The founding appears to have followed directly from the pair’s work on multimodal foundation models at Meta. In May 2024, Meta researchers published Chameleon, a family of early-fusion models designed to understand and generate mixed sequences of text and images, work that Perceptron later described as part of the lineage behind its own models.

A July 2024 follow-on paper, MoMa, explored more efficient early-fusion training for mixed-modal models and listed both Shrivastava and Aghajanyan among the authors. Perceptron’s stated thesis extends that research direction into “physical AI”: models that can process real-world video and other sensory streams for use cases such as robotics, manufacturing, geospatial analysis, security and content moderation.

Partner ecosystems and future outlook

The real-world impact of Mk1 is already being demonstrated through Perceptron’s partner network. Early adopters are using the model for diverse applications, such as auto-clipping highlights from live sports, which leverages the model’s temporal understanding to identify key plays without human intervention.

In the robotics sector, partners are curating teleoperation episodes into training data, effectively automating the process of labeling and cleaning data for robotic arms and mobile units.

Other use cases include multimodal quality control agents on manufacturing lines, which can detect defects and verify assembly steps in real-time, and wearable assistants on smart glasses that provide context-aware help to users.

Aghajanyan stated that these releases are the culmination of research intended to make AI function best in the physical world, moving toward a future where “physical AI” is as ubiquitous as digital AI.

Leave a Reply

Your email address will not be published. Required fields are marked *

Generated by Feedzy