ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.18486
33
76

Evaluation of OpenAI o1: Opportunities and Challenges of AGI

27 September 2024
Tianyang Zhong
Zhengliang Liu
Yi Pan
Yutong Zhang
Yifan Zhou
Shizhe Liang
Zihao Wu
Yanjun Lyu
Peng Shu
Xiaowei Yu
Chao-Yang Cao
Hanqi Jiang
Hanxu Chen
Yiwei Li
Junhao Chen
Huawen Hu
Yihen Liu
Huaqin Zhao
Shaochen Xu
Haixing Dai
Lin Zhao
Ruidong Zhang
Wei Zhao
Zhenyuan Yang
Jingyuan Chen
Peilong Wang
Wei Ruan
Hui Wang
Huan Zhao
Jing Zhang
Yiming Ren
Shihuan Qin
Tong Chen
Jiaxi Li
Arif Hassan Zidan
Afrar Jahin
Minheng Chen
Sichen Xia
J. Holmes
Yan Zhuang
Jiaqi Wang
Bochen Xu
Weiran Xia
Jichao Yu
Kaibo Tang
Yaxuan Yang
B. S.
Tao Yang
Guoyu Lu
Xianqiao Wang
Lilong Chai
He Li
Jin Lu
Lichao Sun
Xin Zhang
Bao Ge
Xintao Hu
Lian-Cheng Zhang
Hua Zhou
Lu Zhang
Shu Zhang
Ninghao Liu
Bei Jiang
Linglong Kong
Zhen Xiang
Yudan Ren
Jun Liu
Xi Jiang
Yu Bao
Wei Zhang
Xiang Li
Gang Li
Wei Liu
Dinggang Shen
Andrea Sikora
Xiaoming Zhai
Dajiang Zhu
Tianming Liu
    ReLM
    LRM
    AI4CE
    ELM
    VLM
ArXivPDFHTML
Abstract

This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.

View on arXiv
Comments on this paper