Unsupervised classification of health content on reddit

Research output: Chapter in Book or Conference Publication/ProceedingConference Publicationpeer-review

3 Citations (Scopus)

Abstract

Online forums are easily accessible to the public and useful to acquire and disseminate health information, however, advanced methods have to be applied to correctly interpret the content. For this reason, we propose the application of an unsupervised embedding-based approach for health content classification. Specifically, we utilise word embeddings and a clustering method to create content-sensitive word clusters; we then align the health content with the clusters classifying it into illnesses/medication/disease agents. The results suggest that a cosine similarity of 0.70 is preferred for the creation of informative clusters as well as for the automatically generation of synonyms, acronyms, abbreviations and common misspellings. Our approach does not only demonstrate the potential given by discussion forums, in particular, Reddit, for unsupervised content classification but also for dictionary building from informal health content.

Original languageEnglish
Title of host publicationDPH 2019 - Proceedings of the 9th International Conference on Digital Public Health
Publisher Association for Computing Machinery
Pages85-89
Number of pages5
ISBN (Electronic)9781450372084
DOIs
Publication statusPublished - 20 Nov 2019
Event9th International Conference on Digital Public Health, DPH 2019 - Marseille, France
Duration: 20 Nov 201923 Nov 2019

Publication series

NameACM International Conference Proceeding Series

Conference

Conference9th International Conference on Digital Public Health, DPH 2019
Country/TerritoryFrance
CityMarseille
Period20/11/1923/11/19

Keywords

  • Clustering
  • Discussion forum
  • Health informatics
  • Unsupervised learning
  • Vocabulary building
  • Word embeddings

Authors (Note for portal: view the doc link for the full list of authors)

  • Authors
  • Barros, JM;Buitelaar, P;Duggan, J;Rebholz-Schuhmann, D

Fingerprint

Dive into the research topics of 'Unsupervised classification of health content on reddit'. Together they form a unique fingerprint.

Cite this