According to a new study, the answer is “not yet.” GPT-4 Turbo couldn’t get most of the answers right: it had a balanced accuracy of 46%.
For the past decade, complexity scientist Peter Turchin has been working with collaborators to bring together the most current and structured body of knowledge about human history in one place: the Seshat Global History Databank. Over the past year, together with computer scientist Maria del Rio-Chanona, he has begun to wonder if artificial intelligence chatbots could help historians and archeologists to gather data and better understand the past. As a first step, they wanted to assess the A.I. tools' understanding of historical knowledge.
In collaboration with an international team of experts, they decided to evaluate the historical knowledge of advanced A.I. models such as ChatGPT-4, Llama, and Gemini.
“Large language models (LLMs), such as ChatGPT, have been enormously successful in some fields—for example, they have largely succeeded by replacing paralegals. But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited,” says Turchin, who leads the Complexity Science Hub's (CSH) research group on social complexity and collapse.
One surprising finding, which emerged from this study, was just how bad these models were. This result shows that artificial ‘intelligence’ is quite domain-specific. LLMs do well in some contexts, but very poorly, compared to humans, in others.
Peter Turchin, from the Complexity Science Hub
The results of the study were presented recently at the NeurIPS conference, A.I.’s premier annual gathering, in Vancouver. GPT-4 Turbo, the best-performing model, scored 46% on a four-choice question test. According to Turchin and his team, although these results are an improvement over the baseline of 25% of random guessing, they highlight the considerable gaps in A.I.'s understanding of historical knowledge.
“I thought the A.I. chatbots would do a lot better,” says del Rio-Chanona, the study’s corresponding author. “History is often viewed as facts, but sometimes interpretation is necessary to make sense of it,” adds del Rio-Chanona, an external faculty member at CSH and an assistant professor at University College London.
Setting a Benchmark for LLMs
This new assessment, the first of its kind, challenged these A.I. systems to answer questions at a graduate and expert level, similar to ones answered in Seshat (and the researchers used the knowledge in Seshat to test the accuracy of the A.I. answers). Seshat is a vast, evidence-based resource that compiles historical knowledge across 600 societies worldwide, spanning more than 36,000 data points and over 2,700 scholarly references.
“We wanted to set a benchmark for assessing the ability of these LLMs to handle expert-level history knowledge,” explains first author Jakob Hauser, a resident scientist at CSH. “The Seshat Databank allows us to go beyond ‘general knowledge’ questions. A key component of our benchmark is that we not only test whether these LLMs can identify correct facts, but also explicitly ask whether a fact can be proven or inferred from indirect evidence.”
Disparities Across Time Periods and Geographic Regions
The benchmark also reveals other important insights into the ability of current chatbots—a total of seven models from the Gemini, OpenAI, and Llama families—to comprehend global history. For instance, they were most accurate in answering questions about ancient history, particularly from 8,000 BCE to 3,000 BCE. However, their accuracy dropped sharply for more recent periods, with the largest gaps in understanding events from 1,500 CE to the present.
In addition, the results highlight the disparity in model performance across geographic regions. OpenAI’s models performed better for Latin America and the Caribbean, while Llama performed best for Northern America. Both OpenAI’s and Llama models’ performance was worse for Sub-Saharan Africa. Llama also performed poorly for Oceania. This suggests potential biases in the training data, which may overemphasize certain historical narratives while neglecting others, according to the study.
Better on Legal System, Worse on Discrimination
The benchmark also found differences in performance across categories. Models performed best on legal systems and social complexity. “But they struggled with topics such as discrimination and social mobility,” says del Rio-Chanona.
"The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They're great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they're not yet up to the task,” adds del Rio-Chanona. According to the benchmark, the model that performed best was GPT-4 Turbo, with a balanced accuracy of 46%, while the weakest was Llama-3.1-8B with 33.6%.
Next Steps
Del Rio-Chanona and the other researchers—from CSH, the University of Oxford, and the Alan Turing Institute—are committed to expanding the dataset and improving the benchmark. They plan to include more data from underrepresented regions and incorporate more complex historical questions, according to Hauser.
"We plan to continue refining the benchmark by integrating additional data points from diverse regions, especially the Global South. We also look forward to testing more recent LLM models, such as o3, to see if they can bridge the gaps identified in this study," says Hauser.
The CSH scientist emphasizes that the benchmark's findings can be valuable to both historians and AI developers. For historians, archaeologists, and social scientists, knowing the strengths and limitations of A.I. chatbots can help guide their use in historical research. For A.I. developers, these results highlight areas for improvement, particularly in mitigating regional biases and enhancing the models' ability to handle complex, nuanced historical knowledge.
About the study
The paper “Large Language Models’ Expert-level Global History Knowledge Benchmark (HiST-LLM),” by Jakob Hauser, Daniel Kondor, Jenny Reddish, Majid Benam, Enrico Cioni, Federica Villa, James S. Bennett, Daniel Hoyer, Pieter François, Peter Turchin, and R. Maria del Rio-Chanona, was presented at the NeurIPS conference, in Vancouver, in December.
The study was funded by Clariah-AT, the University of Oxford, and the Alan Turing Institute.
About CSH
The Complexity Science Hub (CSH) is Europe’s research center for the study of complex systems. We derive meaning from data from a range of disciplines—economics, medicine, ecology, and the social sciences—as a basis for actionable solutions for a better world. Established in 2015, we have grown to over 70 researchers, driven by the increasing demand to gain a genuine understanding of the networks that underlie society, from healthcare to supply chains. Through our complexity science approaches linking physics, mathematics, and computational modeling with data and network science, we develop the capacity to address today’s and tomorrow’s challenges.