For example, if someone asks to pay money off their credit card, then the system should interact with the banking backend to pay the card, while also playing a TTS response to confirm to the user what is happening.
The first step in any dialogue system is Automatic Speech Recognition (ASR), which converts the audio of someone’s utterance into text. ASR usually outputs in ‘spoken form’, i.e. numbers and other expressions are in words, not digits (“three hundred and twenty one” and not “321”), but automatic formatting can optionally be applied.
Diatheke uses Cobalt’s ASR engine, Cubic, to process audio and supply transcripts.
A command is a specific type of action that executes some set of tasks in response to a recognized intent. For example, if the utterance “Play some music” was recognized as an intent, the command might handle the actual playback of an audio file.
An entity (a.k.a. a slot) appears in an utterance. It represents a value in an intent that can change between utterances or even be omitted entirely in some cases. Entities are defined in the Diatheke model.
“Book me three plane tickets from London to New York, please.”
“Play the song New York”
“Pay three pounds off my credit card.”
The first two examples have entities that are cities - the first has a single City, while the second might have a SourceCity and DestinationCity.
The second and fourth examples have entities that are numbers, though they are interpreted differently in the two utterances. One is a quantity, and the other is an amount of money.
The same words can mean different entities in different contexts. E.g. ‘New York’ could be either a City or a SongName, depending on the context.
Natural Language Understanding (NLU) is a field within artificial intelligence that deals with machine reading comprehesion. It attempts to discern the meaning of (sometimes incomplete) sentences. In Diatheke, these interpretations are the intent with its associated entites.
A session represents all the components (ASR, TTS, NLU, etc.) necessary to carry on a conversation with Diatheke. A single session keeps track of a dialog’s current state, with all possible states being defined by a Diatheke model.
In a Diatheke model, a story groups together several related actions that help accomplish a particular goal. This is where the desired dialog flow is defined as sequences of actions, such as waiting for user input, executing commands, and sending the user replies. A Diatheke model includes at least a “main” story that defines where a session begins, and can have any number of additional stories that are used during a session.
Text To Speech (TTS) synthesizes audio that replicates human speech from written text.
Diatheke uses Cobalt’s TTS engine, Luna, to process audio and supply transcripts.
An utterance is something that someone says - it isn’t necessarily a full sentence as people don’t always speak grammatically.