Large language models adapted for speech modeling often lose naturalness due to a focus on linguistic aspects while neglecting prosodic features. This paper proposes an end-to-end variational approach that automatically encodes continuous speech attributes to enhance semantic tokens, eliminating the need for manual feature selection and resulting in more natural speech generation. The approach shows improved performance according to human evaluations.