3D CoCa is a unified framework for 3D captioning that integrates contrastive vision-language learning with 3D caption generation. By leveraging a frozen CLIP backbone and a spatially-aware 3D scene encoder, it jointly optimizes contrastive and captioning objectives in a shared feature space, leading to improved spatial reasoning and semantic grounding. Extensive experiments show that 3D CoCa surpasses existing methods, achieving significant performance gains on benchmark datasets.