Evaluating better document representation in clustering with varying complexity

Stephen Bradshaw, Colm O’Riordan

Research output: Chapter in Book or Conference Publication/ProceedingConference Publicationpeer-review

Abstract

Micro blogging has become a very popular activity and the posts made by users can be a valuable source of information. Classifying this content accurately can be a challenging task due to the fact that comments are typically short in nature and on their own may lack context. Reddita is a very popular microblogging site whose popularity has seen a huge and consistent increase over the years. In this paper we propose using alternative but related Reddit threads to build language models that can be used to disambiguate intend mean of terms in a post. A related thread is one which is similar in content, often consisting of the same frequently occurring terms or phrases. We posit that threads of a similar nature use similar language and that the identification of related threads can be used as a source to add context to a post, enabling more accurate classification. In this paper, graphs are used to model the frequency and co-occurrence of terms. The terms of a document are mapped to nodes, and the co-occurrence of two terms are recorded as edge weights. To show the robustness of our approach, we compare the performance in using related Reddit threads to the use of an external ontology; Wordnet. We apply a number of evaluation metrics to the clusters created and show that in every instance, the use of alternative threads to improve document representations is better than the use of Wordnet or standard augmented vector models. We apply this approach to increasingly harder environments to test the robustness of our approach. A tougher environment is one where the classifying algorithm has more than two categories to choose from when selecting the appropriate class.

Original languageEnglish
Title of host publicationIC3K 2018 - Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management
EditorsAna Fred, Joaquim Filipe
PublisherSCITEPRESS
Pages194-202
Number of pages9
ISBN (Electronic)9789897583308
DOIs
Publication statusPublished - 2018
Event10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2018 - Seville, Spain
Duration: 18 Sep 201820 Sep 2018

Publication series

NameIC3K 2018 - Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management
Volume1

Conference

Conference10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2018
Country/TerritorySpain
CitySeville
Period18/09/1820/09/18

Keywords

  • Classification methods
  • Clustering
  • Context discovery
  • Mining text
  • Semi-structured data

Fingerprint

Dive into the research topics of 'Evaluating better document representation in clustering with varying complexity'. Together they form a unique fingerprint.

Cite this