How do you measure how well your designs are performing?

Just as the UX design discipline conducts audits to identify opportunities for optimization or refactoring of user experiences through heuristic analysis, how does this methodology translates to the field of voice? Example: how would you run an assessment for timers and alarms to identify opportunities to either fixed them or optimize them.


Many times, new features or products are launched and then…forgotten.

The next big thing comes along, and in a rush to chase it, we simply assume things are “probably working fine” and we don’t need to do the work of actually verifying this. Because we’d know if there was a problem, right?

Measuring how well a conversational experience is doing is a big, important topic, and contains more than can be included in an answer here. But I do want to talk about the basics. (For more, check out Chapter 6 of my book.)

First, checking on the performance of a new feature or flow is critical, but even more so when voice/natural language is used. Why? Because in a standard GUI (Graphical User Interface), at least it’s clear when someone has tapped a button on the screen, or which menu option they’ve chosen. That doesn’t mean the menu options are perfect, or that the buttons aren’t confusing, but at least you know what happened.

With natural language, it’s trickier. Let’s look at your example of alarms and timers. There are dozens of ways to set an alarm via natural language. And sometimes, people don’t even say things “correctly”! For example, I might say “Set an alarm for ten minutes”. Technically, I mean “set a timer”. But you, a human, know this, right? Your bot should know too.

When you launch something in which people are going to speak or type, it’s not going to be perfect out of the box. And that’s ok! That’s what measuring/analyzing/tuning is all about. You launch, you get some data, you check out where you went wrong and fix it.

There are a few phases to keep in mind.

  • Pre-launch: before you let it out to the wild, do some internal testing. Take your design specifications, and run through each flow/branch in the conversation. Are the correct prompts played? Will different phrasing of the same intent work? Is the error behavior working as designed?

  • Post-launch: if possible, run a pilot with a constrained number of users. Run for a week or two, and then look at the data from real users. You should look for (a) did the system correctly interpret what they said/typed? (b) are they asking for intents you did not expect? (Note: this will require manually looking at a selection of fully anonymized transcriptions.)

  • Intermittent checks: you should still run tuning from time to time even in a system that’s been running a while. User behavior and expectations may change, or something may have broken. This is especially true if it’s new hardware or tech: for example, the way someone uses a smart watch right after they get it may change over time as they get used to it.

In an ideal world, you’ll have a dashboard that can measure some of these things (like “OOG”—Out Of Grammar”) so you don’t have to run all tests manually. Other things to measure include:

  • Task completion rate: how often do users start a task, but fail to finish it?

  • Transfers to agent, if your system include this. (Note: transfers are not bad by default, but worth examining to see if there are particularly frustrating places in your flow, or places to get a user to an agent earlier)

  • Unsupported requests: perhaps your users are asking for a feature you don’t have and don’t plan to support. Rather than giving the standard error, it’s better to acknowledge what they said (e.g. “Sorry, but I’m afraid I can’t check past invoices for you. Would you like me to transfer you to an agent?”)

  • CSAT: you may have a high task completion rate, but low customer satisfaction. This could be for a number of reasons, including redundant prompts, giving the user unnecessary options or information, or asking them for information that’s not really needed. This is a more subjective measure, but can still help pinpoint problem areas.

In summary: analysis of conversational systems is crucial. Ideally, run design QA before launch, as well as shortly after. Set up dashboards to look for high OOG (out of grammar) rates, errors, task completion, etc. Know up front your users will ask for things/say things in a way you did not anticipate. Don’t despair, just prepare!

Cathy Pearl1 Comment