OmniVinci introduces a new model architecture and data curation for omni-modal large language models (LLMs), achieving state-of-the-art performance in understanding images, videos, audio, and text. Key innovations include OmniAlignNet, Temporal Embedding Grouping, and Constrained Rotary Time Embedding, leading to improved cross-modal perception and reasoning while significantly reducing training data requirements. The model's advantages extend to applications in robotics, medical AI, and smart factories.