Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media

Sentiment analysis of alternative tobacco products on social media is important for tobacco control research. Large Language Models (LLMs) can help streamline the labor-intensive human sentiment analysis process. This study examined the accuracy of LLMs in replicating human sentiment evaluation of social media messages about heated tobacco products (HTPs).The research used GPT-3.5 and GPT-4 Turbo to classify 500 Facebook and 500 Twitter messages, including anti-HTPs, pro-HTPs, and neutral messages. The models evaluated each message up to 20 times, and their majority label was compared to human evaluators.Results showed that GPT-3.5 accurately replicated human sentiment 61.2% of the time for Facebook messages and 57.0% for Twitter messages. GPT-4 Turbo performed better, with 81.7% accuracy for Facebook and 77.0% for Twitter. Using three response instances, GPT-4 Turbo achieved 99% of the accuracy of twenty instances. GPT-4 Turbo also had higher accuracy for anti- and pro-HTPs messages compared to neutral ones. Misclassifications by GPT-3.5 often involved anti- or pro-HTPs messages being labeled as neutral or irrelevant, while GPT-4 Turbo showed improvements across all categories.In conclusion, LLMs can be used for sentiment analysis of HTP-related social media messages, with GPT-4 Turbo reaching around 80% accuracy compared to human experts. However, there's a risk of misrepresenting overall sentiment due to differences in accuracy across sentiment categories.
View on arXiv@article{kim2025_2502.01658, title={ Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media }, author={ Kwanho Kim and Soojong Kim }, journal={arXiv preprint arXiv:2502.01658}, year={ 2025 } }