Can artificial intelligence predict clinical trial outcomes?
- ELM
This study evaluates the performance of large language models (LLMs) and the HINT model in predicting clinical trial outcomes, focusing on metrics including Balanced Accuracy, Matthews Correlation Coefficient (MCC), Recall, and Specificity. Results show that GPT-4o achieves superior overall performance among LLMs but, like its counterparts (GPT-3.5, GPT-4mini, Llama3), struggles with identifying negative outcomes. In contrast, HINT excels in negative sample recognition and demonstrates resilience to external factors (e.g., recruitment challenges) but underperforms in oncology trials, a major component of the dataset. LLMs exhibit strengths in early-phase trials and simpler endpoints like Overall Survival (OS), while HINT shows consistency across trial phases and excels in complex endpoints (e.g., Objective Response Rate). Trial duration analysis reveals improved model performance for medium- to long-term trials, with GPT-4o and HINT displaying stability and enhanced specificity, respectively. We underscore the complementary potential of LLMs (e.g., GPT-4o, Llama3) and HINT, advocating for hybrid approaches to leverage GPT-4o's predictive power and HINT's specificity in clinical trial outcome forecasting.
View on arXiv