Publication:
An empirical evaluation of text representation schemes to filter the social media stream

dc.contributor.affiliationDA-IICT, Gandhinagar
dc.contributor.authorModha, Sandip
dc.contributor.authorMajumder, Prasenjit
dc.contributor.authorThomas, Mandl
dc.contributor.researcherModha, Sandip (201221001)
dc.date.accessioned2025-08-01T13:09:15Z
dc.date.issued01-05-2022
dc.description.abstractModeling text in a numerical representation is a prime task for any Natural Language Processing downstream task such as text classification. This paper attempts to study the effectiveness of text representation schemes on the text classification task, such as aggressive text detection, a special case of Hate speech from social media. Aggression levels are categorized into three predefined classes, namely: �Non-aggressive� (NAG), �Overtly Aggressive� (OAG), and �Covertly Aggressive� (CAG). Various text representation schemes based on BoW techniques, word embedding, contextual word embedding, sentence embedding on traditional classifiers, and deep neural models are compared on a text classification problem. The weighted�??1�score is used as a primary evaluation metric. The results show that text representation using Googles� universal sentence encoder (USE) performs better than word embedding and BoW techniques on traditional classifiers, such as SVM, while pre-trained word embedding models perform better on classifiers based on the deep neural models on the English dataset. Recent pre-trained transfer learning models like Elmo, ULMFi, and BERT are fine-tuned for the aggression classification task. However, results are not at par with the pre-trained word embedding model. Overall, word embedding using pre-trained fastText vectors produces the best weighted�??1-score than Word2Vec and Glove. On the Hindi dataset, BoW techniques perform better than word embeddings on traditional classifiers such as SVM. In contrast, pre-trained word embedding models perform better on classifiers based on the deep neural nets. Statistical significance tests are employed to ensure the significance of the classification results. Deep neural models are more robust against the bias induced by the training dataset. They perform substantially better than traditional classifiers, such as SVM, logistic regression, and Naive Bayes classifiers on the Twitter test dataset.
dc.format.extent499-525
dc.identifier.citationSandip Modha,Majumder, Prasenjit and Mandl, Thomas"An empirical evaluation of text representation schemes to filter the social media stream," Journal of Experimental & Theoretical Artificial Intelligence, Taylor & Francis, ISSN: 1362-3079, vol. 34, no. 3, May-Jun. 2022, pp. 499-525, doi: 10.1080/0952813X.2021.1907792. [Published Date: 24 Apr 2021]
dc.identifier.doi10.1080/0952813X.2021.1907792
dc.identifier.issn1362-3079
dc.identifier.urihttps://ir.daiict.ac.in/handle/dau.ir/1779
dc.identifier.wosWOS:000643843800001
dc.language.isoen
dc.publisherTaylor & Francis
dc.relation.ispartofseriesVol. 34; No. 3
dc.sourceJournal of Experimental & Theoretical Artificial Intelligence
dc.source.urihttps://www.tandfonline.com/doi/abs/10.1080/0952813X.2021.1907792
dc.titleAn empirical evaluation of text representation schemes to filter the social media stream
dspace.entity.typePublication
relation.isAuthorOfPublication2157d717-1c67-4d71-b314-ed3eddebf251
relation.isAuthorOfPublication7667cdf4-69b1-435a-9156-2401dbedea9a
relation.isAuthorOfPublication2157d717-1c67-4d71-b314-ed3eddebf251
relation.isAuthorOfPublication.latestForDiscovery2157d717-1c67-4d71-b314-ed3eddebf251

Files

Collections