OpenAI confirms new frontier models o3 and o3-mini

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more

OpenAI is slowly inviting select users to test a whole new set of reasoning models named o3 and o3 mini, successors to the o1 and o1-mini models that went into full release earlier this month.

OpenAI o3, named to avoid copyright issues with phone company O2 and because CEO Sam Altman said the company “has a tradition of being really bad with names,” was announced last day of “12 Days of OpenAI” livestreams today.

Altman said the two new models will first be released to select third-party researchers for the safety testwith o3-mini expected by the end of January 2025 and o3 “after that.”

“We see this as the beginning of the next phase of AI, where you can use these models to do more complex tasks that require a lot of reasoning,” Altman said. “For the last day of this event we thought it would be fun to go from one frontier model to the next frontier model.”

The announcement comes just a day after Google unveiled and allowed the public to use it new Gemini 2.0 Flash Thinking modelanother rival “reasoning” model that, unlike the OpenAI o1 series, allows users to see the steps in its “thinking” process documented in text bullet points.

The release of Gemini 2.0 Flash Thinking and now the o3 announcement show that the competition between OpenAI and Google, and the wider field of AI model providers, is entering a new and intense phase as they offer not only of LLMs or multimodal models, but advanced reasoning models. These can be more applicable to more difficult problems in science, math, technology, physics and more.

The best performance in third-party benchmarks

Altman also says that the o3 model is “incredible at coding,” and benchmarks shared by OpenAI back that up, showing that the model outperforms the o1 in programming tasks.

• Exceptional Coding Performance: o3 outperformed o1 by 22.8 percentage points in SWE-Bench Verified and achieved a Codeforces rating of 2727, surpassing OpenAI's Chief Scientist score of 2665.

• Math and Science Mastery: The o3 scored 96.7% on the AIME 2024 exam, missing only one question, and achieved 87.7% on the GPQA Diamond, far exceeding human expert performance.

• Frontier Benchmarks: The model sets new records in challenging tests like EpochAI's Frontier Math, solving 25.2% of problems where no other model exceeded 2%. In the ARC-AGI test, o3 triples the score of o1 and exceeds 85% (as verified live by the ARC Prize team), representing a milestone in abstract reasoning.

Deliberative alignment

Along with these advancements, OpenAI has strengthened its commitment to safety and compatibility.

The company introduced itself new research on deliberative alignmenta strategy that helped make the o1 its most stable and aligned model to date.

This approach embeds human-written safety specifications into models, allowing them to implicitly reason about these policies before generating responses.

The approach aims to solve common security challenges in LLMs, such as vulnerability to jailbreak attacks and excessive rejection of bad signals, by providing chain-of-thought (CoT) models ) reasoning. This process allows models to remember and apply safety details dynamically during inference.

Deliberative alignment improves upon previous methods such as reinforcement learning from human feedback (RLHF) and constitutional AI, which rely on safety details only for label generation rather than directly embedding rules into models.

By fine-tuning LLMs to safety-related prompts and their associated details, this approach creates models capable of policy-based reasoning without relying heavily on data that labeled human.

Results shared by OpenAI researchers in a new, non-peer-reviewed paper indicate that this approach improves performance on safety benchmarks, reduces harmful outputs, and ensures better adherence to content and style guidelines.

Key findings highlight the advances of the o1 model over predecessors such as GPT-4o and other state-of-the-art models. The deliberative alignment enables the o1 series to excel in combating jailbreaks and provides secure completions while minimizing over-rejection of benign prompts. Additionally, the method facilitates out-of-distribution generalization, showing robustness in multilingual and encoded jailbreak scenarios. These improvements align with OpenAI's goal of making AI systems safer and easier to interpret as their capabilities grow.

This research will also play an important role in aligning o3 and o3-mini, ensuring that their capabilities are both powerful and responsible.

How to apply for o3 and o3-mini trial access

Applications for early access are now open at OpenAI website and will close on January 10, 2025.

Applicants must fill one out online form asking them for various information, including research focus, past experience, and links to previously published papers and their Github code repositories, and choose which of the models — o3 or o3-mini — what they want to try, as well as what they plan to use them for.

Selected researchers will be given access to o3 and o3-mini to explore their capabilities and contribute to safety tests, though OpenAI's form warns that o3 won't be available for several weeks .

Researchers are encouraged to develop robust analyses, create controlled demonstrations of high-risk capabilities, and test models in situations not possible with widely adopted tools.

This initiative will build on the company's established practices, including rigorous internal safety testing, collaboration with organizations such as the US and UK AI Safety Institutes, and its Preparedness Framework.

OpenAI will review applications continuously, with selections starting immediately.

A new step forward?

The introduction of o3 and o3-mini marks a breakthrough in AI performance, especially in areas that require advanced reasoning and problem-solving capabilities.

With their exceptional results in coding, math, and conceptual benchmarks, these models highlight the rapid progress being made in AI research.

By inviting the wider research community to collaborate on safety testing, OpenAI aims to ensure that these capabilities are deployed responsibly.

Watch the stream below:

Daily insight into business use cases on VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. See more VB newsletters here.

An error has occurred.