AI and the long term

Andon Labs has set up a vending machine eval (evaluation benchmark) for LLMs and written a paper about it. They reached an interesting initial conclusion:

Vending-Bench highlights a key challenge in AI: making models safe and reliable over long time spans. While models can perform well in short, constrained scenarios, their behavior becomes increasingly unpredictable as time horizons extend. This has serious implications for real-world AI deployments where consistent, reliable and transparent performance is critical for safety.