Corpus for the seq2seq bot


#1

I stumbled on the seq2seq_bot project, was wondering how the corpus (I mean the Q1.csv and Q2.csv) was created?


#2

Hi @bachir the corpus was handwritten with some basic characteristics in mind like answering in a mean, grumpy, sarcastic way whenever possible. Does that answer your question?


#3

thanks @asir, can you be more specific? if I want to build a similar bot how could I proceed?
I would appreciate if you can provide practical hints :slight_smile:


#4

Hi thanks for your interest! This was a manually built corpus in an experimental setup, a proper methodology has yet to be developed. This is a very new field of research (personality based/social/affective chatbots) in NLU/NLG so there are not a lot of resources available, but I will keep you updated.


#5

Thanks for the reply @MLT , if you did (or you will) write a research paper on this I would love to read it.


#6

In general (not limited to seq2seq_bot), dialogue corpus curation can be done as follows:

  1. Decide the purpose
    • chit-chat: collect free-form conversations by crowd-sourcing, with or without topics
    • task-oriented: collect formulaic question-answering dialogs by crowd-sourcing
  2. Design annotations
    • speech acts: more theoretical and general
      • e.g. types of human communication
        • optional: whether a dialogue has been broken (changed, deviated, non-sense, etc.)
    • dialog acts: more practical and specific
    • e.g. just annotate question-related terms and types
  3. Annotate
    • by trained annotators
    • by crowd

#7

FYI: There are few dialogue corpi available for English, here


#8

Hey Mike do you have any references by any chance for the points you mentioned? (academic papers)


#9

Sure, will do it later tonight.
Sorry for the late reply.


#10

For starters, many works of task-oriented dialogue systems are done by http://dialogue.mi.eng.cam.ac.uk/

Before further elaborate my points, please kindly note that they are merely how I categorize related research fields. Some better summaries may be out there in the literature. An umbrella term is “conversational agent.” See https://web.stanford.edu/class/cs124/lec/chatbot.pdf by Prof. Jurafsky to get general ideas. My first step is based on the widely adopted classification:

  1. Decide the purpose
    • chit-chat: collect free-form conversations by crowd-sourcing, with or without topics
    • task-oriented: collect formulaic question-answering dialogs by crowd-sourcing

#11

Specifically, most of people use “Wizard-of-Oz” approach to collect dialogs. Some apply bootstrapping, too.

See

for examples. Please note that the above papers are not the origins of the dialog collection methods. I list them because they might be easier for readers to get the ideas, in my opinion.