In practice, the two vectors – the context encoding and the encoding of a single candidate output – are combined with dot product to arrive at a score for the given candidate. The goal of this network is to maximize the score for true candidates – e.g. candidates that did appear in the training set – and minimize false candidates.
Choosing how to sample negatives affects the model training greatly. Below are some strategies that can be employed:
Using positive labels from other training examples in the batch.
Drawing randomly from a set of common messages. This assumes that the empirical probability of each message is sampled correctly.
Using messages from context.
Generating negatives from another model.
As this encoding generates a fixed list of candidates that can be precomputed and stored, each time a prediction is needed, only the context encoding needs to be computed, then multiplied by the matrix of candidate embeddings. This reduces both the time from the beam search method and the inherent bias towards shorter responses.
Dialogue Attributes
Conversations are more than simple text modeling. The overall flow of the conversation between participants provides important information, changing the attributes of each message. The context, such as who said what to whom and when, offers useful bits of input for the model when making a prediction. To that end the model uses the following attributes during its prediction:
Local User ID’s – we set a finite number of participants for a given conversation to represent the turn taking between messages, assigning values to those participants. In most cases for support sessions there are 2 participants, requiring ID 0, and 1.
Replies vs continuations – initially modeling focused only on replies. However, in practice conversations also include instances where participants are following up on the previously sent message. Given this, the model is trained for both same-user suggestions and “other” user suggestions.
Timestamps – gaps in conversation can indicate a number of different things. From a support perspective, gaps may indicate that the user has disconnected. The model takes this information and focuses on the time elapsed between messages, providing different predictions based on the values.
Post processing
Suggestions can then be manipulated to get a more desirable final ranking. Such post-processing includes:
Preferring longer suggestions by adding a token factor, generated by multiplying the number of tokens in the current candidate.
Demoting suggestions with a high level of overlap with previously sent messages.
Promoting more diverse suggestions based on embedding distance similarities.
To help us tune and focus on the best responses the team created a priority list. This gives us the opportunity to influence the model’s output, ensuring that responses that are incorrect can be de-prioritized. Abstractly it can be thought of as a filter that can be calibrated to best suit the client’s needs.
Getting suggestions to agents
With our model ready we now needed to get it in the hands of our techs. We wanted our solution to be as agnostic to our chat platform as possible, allowing us to be agile when facing tooling changes and speeding up our ability to deploy other efficiency features. To this end we wanted an API that we could query either via gRPC or via HTTPs. We designed a Google Cloud API, responsible for logging usage as well as acting as a bridge between our model and a Chrome Extension we would be using as a frontend.
The hidden step, measurement
Once we had our model, infrastructure, and extension in place we were left with the big question for any IT project. What was our impact? One of the great things about working in IT at Google is that it’s never dull. We have constant changes, be it planned or unplanned. However, this does complicate measuring the success of a deployment like this. Did we improve our service or was it just a quiet month?
In order to be satisfied with our results we conducted an A/B experiment, with some of our techs using our extension, and the others not. The groups were chosen at random with a distribution of techs across our global team, including a mix of techs with varying levels of experience ranging from 3 to 26 months.
Our primary goal was to measure tech support efficiency when using the tool. We looked at two key metrics as proxies for tech efficiency:
The overall length of the chat.
The number of messages sent by the tech.
Evaluating our experiment
To evaluate our data we used a two-sample permutation test. We had a null hypothesis that techs using the extension would not have a lower time-to-resolution, or be able to send more messages, than those without the extension. The alternative hypothesis was that techs using the extension would be able to resolve sessions quicker or send more messages in approximately the same time.
We took the mid mean of our data, using pandas to trim outliers greater than 3 standard deviations away. As the distribution of our chat lengths is not normal, with significant right skew caused by a long tail of longer issues, we opted to measure the difference in means, relying on central limit theorem (CLT) to provide us with our significance values. Any result with a p-value between 1.0 and 9.0 would be rejected.
Across the entire pool we saw a decrease in chat lengths of 36 seconds.