A Comparative Evaluation of AI Approaches to Large-Scale Scientific Subject Classification

Background: The Hungarian Science Bibliography applies the OECD Frascati Fields of Science and Technology taxonomy for subject classification; however, approximately 80% of its records lack assigned categories. Automated large-scale classification could support retrospective completion and improve the quality of bibliographic data.
Methods: We evaluated multiple artificial intelligence approaches to classifying publications into level 4 Frascati categories using only titles and keywords. Training datasets were compiled from bibliographic records and subjected to heuristic and large-language-model-based filtering to reduce noise and ambiguity. The approaches tested included statistical methods, classical machine learning classifiers, fine-tuned SciBERT models, zero-shot prompting with large language models, and a Mixture-of-Experts architecture.
Results: Data quality had a stronger impact on performance than model complexity. Large-language-modelbased filtering substantially improved classification results. The best-performing model, a Support Vector Classifier, achieved a weighted F1 score of 0.83, which is an outstanding result relative to state-of-the-art approaches from the literature.
Conclusions: Our findings contribute new insights into classification research and may assist others in selecting appropriate solutions for real-world, large-scale bibliographic classification tasks.

Morzsák

Oldal címe

A Comparative Evaluation of AI Approaches to Large-Scale Scientific Subject Classification

Címlapos tartalom