KGMEL is a novel framework for multimodal entity linking that enhances the alignment of textual mentions with knowledge base entities by incorporating knowledge graph (KG) triples. It operates in three stages: generating high-quality triples, learning joint representations through contrastive learning, and refining candidate entities using large language models. Experimental results show that KGMEL outperforms existing methods in accuracy and efficiency.
3D CoCa is a unified framework for 3D captioning that integrates contrastive vision-language learning with 3D caption generation. By leveraging a frozen CLIP backbone and a spatially-aware 3D scene encoder, it jointly optimizes contrastive and captioning objectives in a shared feature space, leading to improved spatial reasoning and semantic grounding. Extensive experiments show that 3D CoCa surpasses existing methods, achieving significant performance gains on benchmark datasets.