Understanding the effectiveness of new AI models can take months, as initial impressions often misrepresent their capabilities. Traditional evaluation methods are unreliable, and personal interactions yield subjective assessments, making it difficult to determine whether AI progress is truly stagnating or advancing.