TY - GEN
T1 - Unsupervised classification of health content on reddit
AU - Barros, Joana M.
AU - Duggan, Jim
AU - Buitelaar, Paul
AU - Rebholz-Schuhmann, Dietrich
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/11/20
Y1 - 2019/11/20
N2 - Online forums are easily accessible to the public and useful to acquire and disseminate health information, however, advanced methods have to be applied to correctly interpret the content. For this reason, we propose the application of an unsupervised embedding-based approach for health content classification. Specifically, we utilise word embeddings and a clustering method to create content-sensitive word clusters; we then align the health content with the clusters classifying it into illnesses/medication/disease agents. The results suggest that a cosine similarity of 0.70 is preferred for the creation of informative clusters as well as for the automatically generation of synonyms, acronyms, abbreviations and common misspellings. Our approach does not only demonstrate the potential given by discussion forums, in particular, Reddit, for unsupervised content classification but also for dictionary building from informal health content.
AB - Online forums are easily accessible to the public and useful to acquire and disseminate health information, however, advanced methods have to be applied to correctly interpret the content. For this reason, we propose the application of an unsupervised embedding-based approach for health content classification. Specifically, we utilise word embeddings and a clustering method to create content-sensitive word clusters; we then align the health content with the clusters classifying it into illnesses/medication/disease agents. The results suggest that a cosine similarity of 0.70 is preferred for the creation of informative clusters as well as for the automatically generation of synonyms, acronyms, abbreviations and common misspellings. Our approach does not only demonstrate the potential given by discussion forums, in particular, Reddit, for unsupervised content classification but also for dictionary building from informal health content.
KW - Clustering
KW - Discussion forum
KW - Health informatics
KW - Unsupervised learning
KW - Vocabulary building
KW - Word embeddings
UR - https://www.scopus.com/pages/publications/85076590311
U2 - 10.1145/3357729.3357745
DO - 10.1145/3357729.3357745
M3 - Conference Publication
T3 - ACM International Conference Proceeding Series
SP - 85
EP - 89
BT - DPH 2019 - Proceedings of the 9th International Conference on Digital Public Health
PB - Association for Computing Machinery
T2 - 9th International Conference on Digital Public Health, DPH 2019
Y2 - 20 November 2019 through 23 November 2019
ER -