TY - GEN
T1 - Using Information Retrieval Techniques to Automatically Repurpose Existing Dialogue Datasets for Safe Chatbot Development
AU - Tunde, Oluwaseyi Ajayi
AU - Negi, Gaurav
AU - Arcan, Mihael
AU - Buitelaar, Paul
N1 - Publisher Copyright:
© 2024 ELRA Language Resource Association.
PY - 2024
Y1 - 2024
N2 - There has been notable progress in the development of open-domain dialogue systems (chatbots) especially with the rapid advancement of the capabilities of Large Language Models. Chatbots excel at holding conversations in a manner that keeps a user interested and engaged. However, their responses can be unsafe, as they can respond in an offensive manner or offer harmful professional advice. As a way to mitigate this issue, recent work crowdsource datasets with exemplary responses or annotate dialogue safety datasets, which are relatively scarce compared to casual dialogues. Despite the quality of data obtained from crowdsourcing, it can be expensive and time consuming. This work proposes an effective pipeline, using information retrieval, to automatically repurpose existing dialogue datasets for safe chatbot development, as a way to address the aforementioned challenges. We select an existing dialogue dataset, revise its unsafe responses, as a way to obtain a dataset with safer responses to unsafe user inputs. We then fine-tune dialogue models on the original and revised datasets and generate responses to evaluate the safeness of the models.
AB - There has been notable progress in the development of open-domain dialogue systems (chatbots) especially with the rapid advancement of the capabilities of Large Language Models. Chatbots excel at holding conversations in a manner that keeps a user interested and engaged. However, their responses can be unsafe, as they can respond in an offensive manner or offer harmful professional advice. As a way to mitigate this issue, recent work crowdsource datasets with exemplary responses or annotate dialogue safety datasets, which are relatively scarce compared to casual dialogues. Despite the quality of data obtained from crowdsourcing, it can be expensive and time consuming. This work proposes an effective pipeline, using information retrieval, to automatically repurpose existing dialogue datasets for safe chatbot development, as a way to address the aforementioned challenges. We select an existing dialogue dataset, revise its unsafe responses, as a way to obtain a dataset with safer responses to unsafe user inputs. We then fine-tune dialogue models on the original and revised datasets and generate responses to evaluate the safeness of the models.
KW - chatbots
KW - dataset
KW - dialogue safety
KW - generation
KW - information retrieval
KW - toxicity
UR - http://hdl.handle.net/10379/18388
UR - https://www.scopus.com/pages/publications/85195412913
U2 - 10.13025/29182
DO - 10.13025/29182
M3 - Conference Publication
T3 - 3rd Workshop on Safety for Conversational AI, Safety4ConvAI 2024 at LREC-COLING 2024 - Workshop Proceedings
SP - 16
EP - 27
BT - 3rd Workshop on Safety for Conversational AI, Safety4ConvAI 2024 at LREC-COLING 2024 - Workshop Proceedings
A2 - Dinkar, Tanvi
A2 - Attanasio, Giuseppe
A2 - Curry, Amanda Cercas
A2 - Konstas, Ioannis
A2 - Hovy, Dirk
A2 - Rieser, Verena
PB - European Language Resources Association (ELRA)
T2 - 3rd Workshop on Safety for Conversational AI, Safety4ConvAI 2024
Y2 - 21 May 2024
ER -