• 0 Posts
  • 6 Comments
Joined 1 year ago
cake
Cake day: February 6th, 2024

help-circle





  • That o3 does well on frontier math held-out set is impressive, no doubt

    I think there is plenty of room for doubt still. elliotglazer on reddit writes:

    Epoch’s lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We haven’t yet independently verified their 25% claim. To do so, we’re currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.

    My personal opinion is that OAI’s score is legit (i.e., they didn’t train on the dataset), and that they have no incentive to lie about internal benchmarking performances. However, we can’t vouch for them until our independent evaluation is complete.

    (emphasis mine). So there is good reason to doubt that the “held-out dataset” even exists.