TY - GEN
T1 - Developing a Dataset for Technology Structure Mining
AU - Qasemizadeh, Behrang
AU - Buitelaar, Paul
AU - Monaghan, Fergal
PY - 2010/9/1
Y1 - 2010/9/1
N2 - This paper describes steps that have been taken to construct a development dataset for the task of Technology Structure Mining. We have defined the proposed task as the process of mapping a scientific corpus into a labeled digraph named a Technology Structure Graph as described in the paper. The generated graph expresses the domain semantics in terms of interdependencies between pairs of technologies that are named (introduced) in the target scientific corpus. The dataset comprises a set of sentences extracted from the ACL Anthology Corpus. Each sentence is annotated with at least two technologies in the domain of Human Language Technology and the interdependence between them. The annotations - technology mark-up and their interdependencies - are expressed at two layers: lexical and termino-conceptual. Lexical representation of technologies comprises varying lexicalizations of a technology. However, at the termino-conceptual layer all these lexical variations refer to the same concept. We have adopted the same approach for representing Semantic Relations; at the lexical layer a semantic relation is a predicate i.e. defined based on the sentence surface structure; however at the termino-conceptual layer semantic relations are classified into conceptual relations either taxonomic or non-taxonomic. Morover, the contexts that interdependencies are extracted from are classified into five groups based on the linguistic criteria and syntactic structure that are identified by the human annotators. The dataset initially comprises of 482 sentences. We hope this effort results in a benchmark that can be used for the technology structure mining task as defined in the paper.
AB - This paper describes steps that have been taken to construct a development dataset for the task of Technology Structure Mining. We have defined the proposed task as the process of mapping a scientific corpus into a labeled digraph named a Technology Structure Graph as described in the paper. The generated graph expresses the domain semantics in terms of interdependencies between pairs of technologies that are named (introduced) in the target scientific corpus. The dataset comprises a set of sentences extracted from the ACL Anthology Corpus. Each sentence is annotated with at least two technologies in the domain of Human Language Technology and the interdependence between them. The annotations - technology mark-up and their interdependencies - are expressed at two layers: lexical and termino-conceptual. Lexical representation of technologies comprises varying lexicalizations of a technology. However, at the termino-conceptual layer all these lexical variations refer to the same concept. We have adopted the same approach for representing Semantic Relations; at the lexical layer a semantic relation is a predicate i.e. defined based on the sentence surface structure; however at the termino-conceptual layer semantic relations are classified into conceptual relations either taxonomic or non-taxonomic. Morover, the contexts that interdependencies are extracted from are classified into five groups based on the linguistic criteria and syntactic structure that are identified by the human annotators. The dataset initially comprises of 482 sentences. We hope this effort results in a benchmark that can be used for the technology structure mining task as defined in the paper.
KW - NLP
KW - Technology structure mining
KW - Text mining
UR - http://hdl.handle.net/10379/4514
UR - https://www.scopus.com/pages/publications/79952062900
U2 - 10.13025/20979
DO - 10.13025/20979
M3 - Conference Publication
SN - 9780769541549
T3 - Proceedings - 2010 IEEE 4th International Conference on Semantic Computing, ICSC 2010
SP - 32
EP - 39
BT - Proceedings of the IEEE International Conference on Semantic Computing
PB - IEEE
T2 - 4th IEEE International Conference on Semantic Computing, ICSC 2010
Y2 - 22 September 2010 through 24 September 2010
ER -