Evolving general term-weighting schemes for Information Retrieval: Tests on larger collections

Ronan Cummins, Colm O'Riordan

Research output: Contribution to a Journal (Peer & Non Peer)Articlepeer-review

15 Citations (Scopus)

Abstract

Term-weighting schemes are vital to the performance of Information Retrieval models that use term frequency characteristics to determine the relevance of a document. The vector space model is one such model in which the weights assigned to the document terms are of crucial importance to the accuracy of the retrieval system. This paper describes a genetic programming framework used to automatically determine term-weighting schemes that achieve a high average precision. These schemes are tested on standard test collections and are shown to perform as well as, and often better than, the modern BM25 weighting scheme. We present an analysis of the schemes evolved to explain the increase in performance. Furthermore, we show that the global (collection wide) part of the evolved weighting schemes also increases average precision over idf on larger TREC data. These global weighting schemes are shown to adhere to Luhn's resolving power as middle frequency terms are assigned the highest weight. However, the complete weighting schemes evolved on small collections do not perform as well on large collections. We conclude that in order to evolve improved local (within-document) weighting schemes it is necessary to evolve these on large collections.

Original languageEnglish
Pages (from-to)277-299
Number of pages23
JournalArtificial Intelligence Review
Volume24
Issue number3-4
DOIs
Publication statusPublished - Nov 2005

Keywords

  • Genetic programming
  • Information retrieval
  • Term-weighting schemes

Fingerprint

Dive into the research topics of 'Evolving general term-weighting schemes for Information Retrieval: Tests on larger collections'. Together they form a unique fingerprint.

Cite this