34
1

Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

Abstract

Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub:this https URL.

View on arXiv
@article{lee2025_2410.01795,
  title={ Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models },
  author={ Joseph Lee and Shu Yang and Jae Young Baik and Xiaoxi Liu and Zhen Tan and Dawei Li and Zixuan Wen and Bojian Hou and Duy Duong-Tran and Tianlong Chen and Li Shen },
  journal={arXiv preprint arXiv:2410.01795},
  year={ 2025 }
}
Comments on this paper