Intent Processor - Cuneiform

Introduction

The intent processor takes the intent information provided by the developer and pre processes it into information the intent classifier can make use of. The provided intent information get stored in a json file. For example, the intent information in the ChessHelper sample are saved as shown below.


{
    "name" : "OverviewIntent",
    "initiative": "user",
    "sample_utterances" : [
        "How do i play",
        "How do i play chess",
        "give me an overview of chess",
        "give me an overview",
        "What are the rules",
        "What are the rules for chess",
        "What are the rules to chess"
    ],
    "slots" : []
}

Text Classification

The algorithm approach used in the classifier is Multinomial Naive Bayes. The classifier is considered to be "naive" as each word in the sentence is treated as having no connection to the other words in the utterance.

For example, in the utterance "She sell sea shells on the sea shore", the words "she", "sells", "shells", etc. have no relationship with each other in the eyes of the classifier. This approach is incredibly "naive". However, it would not be a problem, as we are trying to classify utterances, not understand them.

Tools

Natural Language Toolkit (NLTK) is the main library we will be using for this purpose. NLTK is used for two purposes.

Tokenization - breaking the sentence into words: "Have a nice day" tokenizes into a list of individual words: "have", "a", "nice", "day".
Stemming - reducing words to their stem: “have” stems to “hav” which allows it to be matched with “having” (same stem)

Classification

The next step is to observe the training data. The training data are provided by the user as "sample utterances". All intents are stored in a .json file in the form of an array of intent objects, as shown below.


[
    {
        "name" : "intentA",
        "sample_utterances" : [
            ...
        ]
    },
    {
        "name" : "intentB",
        "sample_utterances" : [
            ...
        ]
    },
    ...
]

The next step is to organize the sample utterances in structures that can be worked with algorithmically.


corpus_words = {}
intent_words = {}
stemmer = LancasterStemmer()

intents = get_intents()
intents_array = intents['intents']
for intent in intents_array:
    name = intent['name']
    intent_words[name] = []

    sample_utterances = intent['sample_utterances']
    for utterance in sample_utterances:
        slot_detected = False
        for word in word_tokenize(utterance):
            if word == '{':
                slot_detected = True
            elif word == '}':
                slot_detected = False
                continue

            if not slot_detected:
                stemmed_word = stemmer.stem(word.lower())
                if stemmed_word not in corpus_words:
                    corpus_words[stemmed_word] = 1
                else:
                    corpus_words[stemmed_word] += 1

                intent_words[name].extend([stemmed_word])

        utterance_data = {'corpus_words': corpus_words, 'intent_words': intent_words}

The stemmer we have used is the Lancaster stemmer. The get_intents() function reads intent data from the .json file, and returns all of the intents, along with the sample utterances in the form of a list. We iterate through each intent. In each intent, we iterate through all of the provided sample utterances.

An utterance gets broken up into words using the tokenizer. slots are indicated within curly braces ({slotName}), and therefore they are ignored at this stage. Thereafter, the word is stemmed added to the word corpus. The word is also added to the list of words in the intent. Corpus is a term used in NLP. Our corpus contains a collection of all the stemmed words. Shown below is an example of the output of the processed training data in the ChessHelper sample.


{
    "corpus_words": {"hello": 1, "hi": 1, "welcom": 1, "good": 3, "morn": 1, "afternoon": 1, "ev": 1, "what": 6, "'s": 1, "up": 6, "?": 1, "yo": 1, "howdy": 1, "how": 11, "do": 11, "i": 12, "play": 2, "chess": 13, "giv": 3, "me": 4, "an": 2, "overview": 2, "of": 4, "ar": 3, "the": 5, "rul": 3, "for": 1, "to": 1, "set": 5, "a": 7, "gam": 4, "board": 1, "win": 4, "is": 2, "goal": 2, "can": 1, "hav": 1, "tip": 2, "pleas": 1, "tel": 1, "about": 1},
    "intent_words": {
        "WelcomeIntent": ["hello", "hi", "welcom", "good", "morn", "good", "afternoon", "good", "ev", "what", "'s", "up", "?", "yo", "howdy"],
        "OverviewIntent": ["how", "do", "i", "play", "how", "do", "i", "play", "chess", "giv", "me", "an", "overview", "of", "chess", "giv", "me", "an", "overview", "what", "ar", "the", "rul", "what", "ar", "the", "rul", "for", "chess", "what", "ar", "the", "rul", "to", "chess"],
        "SetupIntent": ["how", "do", "i", "set", "up", "a", "chess", "gam", "how", "do", "i", "set", "up", "chess", "how", "do", "i", "set", "up", "a", "chess", "board", "how", "do", "i", "set", "up", "a", "gam", "of", "chess", "how", "do", "i", "set", "up"],
        "GoalIntent": ["how", "do", "i", "win", "a", "gam", "how", "do", "i", "win", "a", "gam", "of", "chess", "how", "do", "i", "win", "chess", "how", "do", "i", "win", "what", "is", "the", "goal", "of", "chess", "what", "is", "the", "goal"],
        "TipIntent": ["can", "i", "hav", "a", "chess", "tip", "pleas", "giv", "me", "a", "chess", "tip"],
        "PieceIntent": ["tel", "me", "about"]
    }
}

Now we have organized our data into dictionaries.

corpus_words - each stemmed word and the number of occurrences.
intent_words - each intent, and the list of stemmed words within it.

This processed data are then stored in a .json file for classification purposes by the Intent Classifier.

The intent processor similarly classifies slot values, and their synonyms by stemming the words, and then storing them in a .json file for classification purposes.

What's next?

The next section discusses the implementation details of the intent classifier within the classifier.