Morzsák

Oldal címe

Automated Classification of EQ-5D Literature in PubMed Using Multi-Phase Learning and LLM-Assisted Co-Training

Címlapos tartalom

The EQ-5D is a widely used tool for measuring health-related quality of life (HRQoL) to support clinical, economic, and policy decision making. Manually classifying the growing volume of literature reporting EQ-5D data for systematic literature reviews is a challenging, inefficient, and labor-intensive task. To address this, we propose a comprehensive classification framework utilizing pre-trained language models (PLMs), including BERT, SciBERT, BioBERT, PubMedBERT, and BioLinkBERT, to categorize PubMed records based on whether the article reports EQ-5D data using article metadata (titles, abstracts, and keywords). We examine three learning approaches: supervised learning, semi-supervised learning with pseudo-labeling, and a co-training strategy with and without Large Language Model (LLM) assistance (GPT and Claude) in pseudo-label generation. We introduce a confidence-based ensembling within the co-training framework to improve classification reliability and robustness. This study provides a systematic multi-phase evaluation of supervised, semi-supervised, and co-training paradigms on PubMed records using different input configurations and investigates model performance in stages with 200 labeled samples and an expanded unlabeled dataset through iterative pseudo-labeling, while benchmarking across models. The results show that the co-training approach achieved the highest performance, with an F1-score of up to 0.85. Performance is reported using multi-seed evaluation with mean ± standard deviation and 95% confidence intervals. LLM-assisted co-training improves weaker model pairs but may degrade performance for already strong model combinations and reduce the number of high-confidence pseudo-labeled samples due to confidence thresholds. LLMs used with ensemble and semi-supervised approaches provide an effective framework for EQ-5D literature screening under limited labeled data.