Developing artificial intelligence (AI) for vertical domains requires a solid data foundation for both training and evaluation. In this work, we introduce TrialPanorama, a large-scale, structured database comprising 1,657,476 clinical trial records aggregated from 15 global sources. The database captures key aspects of trial design and execution, including trial setups, interventions, conditions, biomarkers, and outcomes, and links them to standard biomedical ontologies such as DrugBank and MedDRA. This structured and ontology-grounded design enables TrialPanorama to serve as a unified, extensible resource for a wide range of clinical trial tasks, including trial planning, design, and summarization. To demonstrate its utility, we derive a suite of benchmark tasks directly from the TrialPanorama database. The benchmark spans eight tasks across two categories: three for systematic review (study search, study screening, and evidence summarization) and five for trial design (arm design, eligibility criteria, endpoint selection, sample size estimation, and trial completion assessment). The experiments using five state-of-the-art large language models (LLMs) show that while general-purpose LLMs exhibit some zero-shot capability, their performance is still inadequate for high-stakes clinical trial workflows. We release TrialPanorama database and the benchmark to facilitate further research on AI for clinical trials.
The TrialPanorama database comprises ten main tables: studies, conditions, drugs, endpoints, relations, biomarkers, outcomes, adverse events, results, and relations.
These tables are organized into four conceptual clusters:
The TrialPanorama benchmark contains eight tasks across two categories: Systematic Review Tasks and Trial Design Tasks.
Each task is designed to evaluate AI capabilities in supporting different aspects of clinical trial research and design.
Across systematic review tasks, LLMs showed task-dependent performance with O3-mini excelling at study search (Recall@100=27.6) and screening (accuracy=71.6), while LLaMA-70B unexpectedly led in evidence summarization (Macro-F1=76.7). The overall modest performance across all models, particularly in search tasks (best Recall@100 only 27.6), demonstrates that current LLMs still have significant room for improvement before being reliable for high-stakes clinical trial applications.
While LLMs performed reasonably well on arm design tasks (O3-mini achieving 85.9% accuracy), they struggled significantly with more complex statistical and predictive challenges, including endpoint design (best accuracy 69.1%), sample size estimation (no model exceeding 26% accuracy), and trial completion assessment (near-random performance). These results reveal that current general-purpose LLMs, despite their capabilities with explicit design elements, cannot yet be reliably deployed for clinical trial tasks requiring statistical reasoning, feasibility forecasting, or context-aware endpoint selection.
@article{wang2025trialpanorama,
title = {TrialPanorama: Database and Benchmark for Systematic Review and Design of Clinical Trials},
author = {Wang, Zifeng and Jin, Qiao and Lin, Jiacheng and Gao, Junyi and Pradeepkumar, Jathurshan and Jiang, Pengcheng and Danek, Benjamin and Lu, Zhiyong and Sun, Jimeng},
year = {2025},
}