A new, challenging AGI test covers most AI models


The Arc Prize Foundation, a nonprofit co-established by well-known AI researcher François Chollet, was announced in a Blog post On Monday it created a new, challenging test to measure the general intelligence of the leading AI models.

So far, the new test, called Arc -i-2, has stumped most of the models.

“Reasoning” AI models such as Openai's O1-Pro and Deepseek's R1 score between 1% and 1.3% in ARC-Iri-2, according to Arc Prize Leaderboard. Excellent unreasonable models including the GPT-4.5, Claude 3.7 sonnet, and Gemini 2.0 flash scores around 1%.

The arc-surgery tests consist of puzzle-like problems in which an AI needs to identify visual patterns from a collection of different colored squares, and generate the right “answer” grid. Problems are designed to force an AI to adapt to new problems that have never been seen.

The Arc Prize Foundation has more than 400 people to take the Arc -i-2 to establish a human baseline. Usually, these “panels” of these people get 60% of the test questions correctly – better than any of the models marks.

An example question from the arc -i-2 (credit: arc prize).

In a Post on xChollet claimed the arc-igi-2 was a better measure of the actual intelligence of an AI model than the first test iteration, Arc-Ayi-1. The ARC Prize Foundation tests are aimed at checking if an AI system can greatly get new skills out of the data that it has trained.

Chollet said that unlike the Arc-Agi-1, the new trial prevents AI models from relying on the “brute force”-extensive computing power-to find solutions. Chollet was previously identified This is a major arc-surge-1 flaw.

To meet the flaws of the first trial, the arc-igi-2 introduced a new measure: efficiency. It also requires models to interpret fly patterns instead of relying on memorization.

“Intelligence is not merely defined by the ability to solve problems or achieve high scores,” Co-Founder of Arc Prize Foundation Greg Kamradt wrote in a Blog post. “The efficiency in which those capabilities are obtained and that has been eradicated is an important, determining substance. The main question asked is not just, 'AI can be obtained [the] Skill to solve a task? 'But also,' at what efficiency or cost? ' “

The arc-hi-1 Advanced Reasoning Model, O3that exceeds all other AI models and matches human performance in examination. However, as we mentioned in time, O3 performance captures in arc-oi-1 come with a heavy price tag.

The Openai's O3 Model-O3 (Low) version-is the first to reach the new heights in the arc-hi-1, marking 75.7% to the test, gaining 4% of the arc -i-2 with a $ 200 worth of computing power per task.

Comparison of Frontier AI Model Performance with Arc-Agi-1 and ARC-I-Igi-2 (Credit: Arc Prize).

The arrival of arc -i-2 has come to many in the tech industry calling for new, not saturated benchmarks to measure AI development. Hugging Face co-founder Thomas Wolf, recently told Techcrunch that The AI ​​industry does not have enough tests to measure the key features of the so -called artificial general intelligenceIncludes creativity.

Next to the new benchmark, announced by the ARC Prize Foundation A new arc prize 2025 contest.

Leave a Reply

Your email address will not be published. Required fields are marked *