“PhD-level AI” seems to have become the latest buzzword among tech industry executives and AI enthusiasts online.
The term broadly refers to AI models that are supposedly capable of executing tasks requiring PhD-level expertise. The hype around PhD-level AI comes a week after reports stating that OpenAI is looking to roll out a handful of specialised AI agents, including a “PhD-level research” tool priced at $20,000 per month.
OpenAI also plans to launch a high-income knowledge worker agent at $2,000 a month, and a software developer agent at $10,000 a month, according to a report by The Information.
The claim is that a PhD-level AI agent will be able to tackle problems that typically require years of specialised academic training. Such AI agents are expected to conduct advanced research by analysing large datasets and generate comprehensive research reports.
However, some critics have dismissed the “PhD-level” label as a marketing term. Others have raised concerns over the accuracy and reliability of AI-generated research reports.
Can AI models reason like a PhD researcher?
OpenAI has claimed that its flagship o1 and o3 reasoning models make use of a technique called “private chain of thought” in order to mirror how human researchers perform tasks.
Unlike traditional large language models (LLMs), reasoning AI models do not immediately provide responses to user prompts. Instead, they use machine learning techniques to run through an internal dialogue and iteratively work out the steps involved in solving complex problems.
Story continues below this ad
PhD-level AI agents should ideally be able to perform complex tasks that include analysing medical research data, supporting climate modeling, and handling routine aspects of research work.
How well do existing AI models perform on key benchmarks?
In the past, OpenAI has claimed that its o1 model performed similarly to human PhD students on certain science, coding, and math tests.
The company further claimed that its o3 model achieved 87.5 per cent in high-compute testing on the ARC-AGI visual reasoning benchmark, surpassing the 85 per cent score by humans.
o3 scored 87.7 per cent on the GPQA Diamond benchmark, which contains graduate-level biology, physics, and chemistry questions, while it received 96.7 per cent on the 2024 American Invitational Mathematics Exam, missing just one question, according to OpenAI.
Story continues below this ad
Furthermore, o3 reportedly solved 25.2 per cent of problems in Frontier Math, a benchmark designed by EpochAI, with other models trailing at two per cent. To be sure, the non-profit revealed in December last year that OpenAI funded the creation of the Frontier Math benchmark for evaluating AI models.
What are the major concerns with PhD-level AI agents?
While the benchmark performances of simulated reasoning models might be considered impressive, experts have pointed out that these models could still struggle to generate plausible-sounding, factually accurate information.
The abilities of AI models to engage in creative thinking and intellectual scepticism have also been questioned. OpenAI has not confirmed the prices of its upcoming specialised AI agents, but users on social media opined that “most PhD students, including the brightest stars who can do way better work than any current LLMs—are not paid $20K / month.”
The buzz around OpenAI’s rumoured launch has also reached a fever pitch with the company’s own AI researcher, Noam Brown, stating that there’s “lots of vague AI hype on social media these days.”
Story continues below this ad
“There are good reasons to be optimistic about further progress, but plenty of unsolved research problems remain,” Brown said in a post on X.