A novel language model for predicting serious adverse event results in clinical trials from their prospective registrations

Objectives: With accurate estimates of expected safety results, clinical trials could be better designed and monitored. We evaluated methods for predicting serious adverse event (SAE) results in clinical trials using information only from their registrations prior to the trial. Material and Methods: We analyzed 22,107 two-arm parallel interventional clinical trials fromthis http URLwith structured summary results. Two prediction models were developed: a classifier predicting whether a greater proportion of participants in an experimental arm would have SAEs (area under the receiver operating characteristic curve; AUC) compared to the control arm, and a regression model to predict the proportion of participants with SAEs in the control arms (root mean squared error; RMSE). A transfer learning approach using pretrained language models (e.g., ClinicalT5, BioBERT) was used for feature extraction, combined with a downstream model for prediction. To maintain semantic representation in long trial texts exceeding localized language model input limits, a sliding window method was developed for embedding extraction. Results: The best model (ClinicalT5+Transformer+MLP) had 77.6% AUC when predicting which trial arm had a higher proportion of SAEs. When predicting SAE proportion in the control arm, the same model achieved RMSE of 18.6%. The sliding window approach consistently outperformed direct comparisons. Across 12 classifiers, the average absolute AUC increase was 2.00%, and absolute RMSE reduction was 1.58% across 12 regressors. Discussion: Summary results data fromthis http URLremains underutilized. Predicted results of publicly reported trials provides an opportunity to identify discrepancies between expected and reported safety results.
View on arXiv