SenseTime’s new SenseNova U1 model eliminates the core VAE component used in nearly all major image generation systems, a fundamental architectural shift that could lower costs and reduce visual artifacts.
Back
SenseTime’s new SenseNova U1 model eliminates the core VAE component used in nearly all major image generation systems, a fundamental architectural shift that could lower costs and reduce visual artifacts.

With the release of SenseNova U1, Chinese AI firm SenseTime (00020.HK) is challenging the foundational architecture of most modern image generation models. The company open-sourced a 2B parameter preview of the model, built on a NEO-Unify architecture that works directly on pixels and discards the variational autoencoder (VAE) used by systems from Stable Diffusion to Google’s Flux. This approach could significantly reduce inference overhead and improve image fidelity by avoiding the VAE’s compression step.
“We intend to charge future AI products based on problem-solving outcomes rather than token consumption,” SenseTime Chairman Xu Li said in March 2026, a philosophy that aligns with the cost-saving potential of this more efficient architecture.
The 2B preview model achieves a 31.56 peak signal-to-noise ratio (PSNR) on image reconstruction, according to the model card, a score that approaches the 32.65 PSNR of the much larger Flux model, but without requiring a separate VAE. The model was developed jointly with Nanyang Technological University’s S-Lab and released on Hugging Face on April 26. An 8B base model is also confirmed.
For developers and enterprise users, the release signals a move toward simpler, more efficient AI stacks. Removing the VAE eliminates a major source of visual artifacts and a component that requires significant tuning. This could lower the barrier to entry for building high-quality image generation pipelines and reduce operational costs for production systems, directly threatening the API-based business models of Western vendors like Midjourney and OpenAI.
The variational autoencoder has long been a practical necessity, not a fundamental one. It compresses high-resolution images into a smaller, computationally manageable latent space where the diffusion process occurs. However, this compression is lossy, discarding fine details and introducing artifacts that developers spend considerable time engineering around. SenseNova’s NEO-Unify architecture bypasses this step entirely.
By treating visual and language data as deeply correlated from the start, the model learns to generate directly on pixels. A dual-stage training strategy allows the model to integrate language reasoning from a pre-trained large language model while building its visual perception from the ground up. This unified pathway for understanding and generation avoids the performance trade-offs that have often plagued multimodal model training, where gains in one domain can degrade capability in another.
SenseNova U1 is the latest in a series of competitive open-weight models emerging from China, joining notable releases from companies like DeepSeek, Alibaba’s Qwen, and the InternVL project. This pattern of rapid architectural experimentation combined with open-source releases is building a robust developer ecosystem that presents a meaningful alternative to the closed, US-centric models from OpenAI, Google, and Anthropic, or even the open-weight models from Western firms like Meta.
For enterprise buyers, particularly in markets where data sovereignty and local infrastructure are key, these models are becoming increasingly viable. SenseTime has already been deepening its integration with domestic chip providers, a move that insulates its development pipeline from US export controls affecting Nvidia’s GPU supply chains. The combination of architectural innovation, open-source strategy, and supply chain resilience strengthens the position of China's AI sector in the fragmenting global market.
This article is for informational purposes only and does not constitute investment advice.