TrialPanorama: Database and Benchmark for Systematic Review and Design of Clinical Trials

Zifeng Wang1, Qiao Jin2, Jiacheng Lin1, Junyi Gao3,4, Jathurshan Pradeepkumar1, Pengcheng Jiang1, Benjamin Danek1, Zhiyong Lu2, Jimeng Sun1
1University of Illinois Urbana-Champaign
2National Institutes of Health
3University of Edinburgh
4Health Data Research UK

Abstract

Developing artificial intelligence (AI) for vertical domains requires a solid data foundation for both training and evaluation. In this work, we introduce TrialPanorama, a large-scale, structured database comprising 1,657,476 clinical trial records aggregated from 15 global sources. The database captures key aspects of trial design and execution, including trial setups, interventions, conditions, biomarkers, and outcomes, and links them to standard biomedical ontologies such as DrugBank and MedDRA. This structured and ontology-grounded design enables TrialPanorama to serve as a unified, extensible resource for a wide range of clinical trial tasks, including trial planning, design, and summarization. To demonstrate its utility, we derive a suite of benchmark tasks directly from the TrialPanorama database. The benchmark spans eight tasks across two categories: three for systematic review (study search, study screening, and evidence summarization) and five for trial design (arm design, eligibility criteria, endpoint selection, sample size estimation, and trial completion assessment). The experiments using five state-of-the-art large language models (LLMs) show that while general-purpose LLMs exhibit some zero-shot capability, their performance is still inadequate for high-stakes clinical trial workflows. We release TrialPanorama database and the benchmark to facilitate further research on AI for clinical trials.

Figure 1: TrialPanorama contains two parts: the database and the benchmark. The database is a collection of 1,657,476 clinical trial records aggregated from 15 global sources. The benchmark is a collection of 8 tasks across two categories: systematic review and trial design.

Database

The TrialPanorama database comprises ten main tables: studies, conditions, drugs, endpoints, relations, biomarkers, outcomes, adverse events, results, and relations.

These tables are organized into four conceptual clusters:

  • Trial attributes: Captures core metadata such as study title, brief summary, sponsor, start year, and recruitment status.
  • Trial protocols: Describes the setup and design of the study, including tested drugs, targeted conditions, and patient group allocations.
  • Trial results: Encompasses reported findings, including the trial outcomes, adverse events, and efficacy results.
  • Study-level links: Encodes relationships across studies, such as mappings between registry records and corresponding publications, or reviews aggregating multiple studies on a common clinical topic.

Benchmark tasks

The TrialPanorama benchmark contains eight tasks across two categories: Systematic Review Tasks and Trial Design Tasks.

  • Study Search: Retrieve relevant clinical trials based on a research question
  • Study Screening: Determine if a study meets specific inclusion criteria
  • Evidence Summarization: Synthesize findings across multiple related studies
  • Arm Design: Design intervention and control arms with appropriate dosages
  • Eligibility Criteria: Define inclusion and exclusion criteria for trial participants
  • Endpoint Selection: Determine appropriate primary and secondary endpoints
  • Sample Size Estimation: Calculate required participant numbers for statistical power
  • Trial Completion Assessment: Predict likelihood of successful trial completion

Each task is designed to evaluate AI capabilities in supporting different aspects of clinical trial research and design.

Benchmark results: Systematic review tasks

Across systematic review tasks, LLMs showed task-dependent performance with O3-mini excelling at study search (Recall@100=27.6) and screening (accuracy=71.6), while LLaMA-70B unexpectedly led in evidence summarization (Macro-F1=76.7). The overall modest performance across all models, particularly in search tasks (best Recall@100 only 27.6), demonstrates that current LLMs still have significant room for improvement before being reliable for high-stakes clinical trial applications.

Benchmark results: Trial design tasks

While LLMs performed reasonably well on arm design tasks (O3-mini achieving 85.9% accuracy), they struggled significantly with more complex statistical and predictive challenges, including endpoint design (best accuracy 69.1%), sample size estimation (no model exceeding 26% accuracy), and trial completion assessment (near-random performance). These results reveal that current general-purpose LLMs, despite their capabilities with explicit design elements, cannot yet be reliably deployed for clinical trial tasks requiring statistical reasoning, feasibility forecasting, or context-aware endpoint selection.

CFT

Reference

Please kindly cite our paper if you use our code or results:
@article{wang2025trialpanorama,
        title     = {TrialPanorama: Database and Benchmark for Systematic Review and Design of Clinical Trials},
        author    = {Wang, Zifeng and Jin, Qiao and Lin, Jiacheng and Gao, Junyi and Pradeepkumar, Jathurshan and Jiang, Pengcheng and Danek, Benjamin and Lu, Zhiyong and Sun, Jimeng},
        year      = {2025},
      }