A Comparative Study of Sentiment Analysis for Mental Health Related Posts at Reddit & Twitter Using Machine Learning and Pre-Trained Models
DOI:
https://doi.org/10.56536/jicet.v4i2.128Keywords:
Mental Health, Anxiety, Reddit, Machine learning, Social Media, twitterAbstract
The enormous increase in social media platforms and their usage among people have caused a massive surge in online posting. It has been observed that people find it easy to express themselves in a virtual environment rather than in a real environment. There have been multiple social media platforms where people tend to go to write their feelings out, out of which Twitter and Reddit are among the most used ones when it comes to be expressive about mental health. Therefore, pertinent questions about the efficiency of either of the platforms for the detection of anxiety disorders and depression have arisen, i.e., how much social media is effective in identifying different anxiety disorders and depression, plus the relevance of anxiety to depression. The main purpose of this research paper is to come up with a comparative study of two different platforms for discussions of mental health that are Twitter and Reddit. The finding and analysis have been focused on two distinct datasets: the Reddit dataset comprising Anxiety disorders-related posts that were manually scraped using Python Library PRAW, and the Twitter dataset with depression-related posts downloaded from an online repository Kaggle. Further, the study has been focused on finding the linguistic similarities between depression and anxiety disorder while highlighting the proposed model functioning with cross platform analysis. An array of contemporary pre-trained models as well as the machine learning models have been applied for the analysis of datasets: BERT, SVM, Decision Tree, Naive Bayes, Catboost, Gradient Boosting, XGBoost, XLNet, AACN, and some other prominent algorithms have been included. Data cleaning and feature selection have been applied initially to highlight the commonalities between both of these platform's usage datasets. The evaluation of models has been done using a validation method with 80:20 rule, having 80% data for training whereas the remaining 20% for the testing purpose. The results have been providing us insights into the platform-specific behavior with a maximum of 72% accuracy score with catboost and a 70% accuracy score with XLNet and AACN in the Reddit dataset; however, 0.98 recall was provided by XLNet outperforming other models in terms of recall. Similarly, 85% of the accuracy score has been obtained with catboost, and 82% with logistic regression has been obtained in the Twitter dataset. The highest recall has been given by catboost with a 0.90. This study's findings are expected to provide better insights with social media insights focused on improving mental health-related discussion findings. The findings could guide us in detecting platform preferences. In the future, it is suggested to apply deep learning models with large datasets.