Blog

Building a Voice Experience for People with Type 2 Diabetes

Recently we were finalists in the Merck-sponsored Alexa Diabetes Challenge, where we built a voice-powered interactive care plan, complemented by a voice-enabled IOT scale and diabetic foot scanner. The scanner uses the existing routine of weighing-in to scan the person’s feet for foot ulcers, a serious but usually preventable complication of Diabetes.

We blogged about our experience testing the device in clinic here

We blogged about feedback from patients here

This was a fun and productive challenge for our team, so we wanted to share some of our lessons learned from implementing the voice interface. This may be of interest to developers working on similar problems.

Voice Experience Design

As a team, we sat down and brainstormed a long list of things we thought a person with diabetes might want to ask, whether they were interacting with the scale and scanner device, or a standalone Amazon Echo. We tried not to be constrained by things we knew our system could do. This was a long list that we then categorized into intents and prioritized.

  • ~60% of the utterances we knew we’d be able to handle well (“My blood sugar is 85.”, “Send a message to my care team”, “Is it ok to drink soda?”)
  • ~20% of the utterances we couldn’t handle fully, but could reasonably redirect (“How many calories are in 8oz of chicken and a half cup of rice?”)
  • ~20% of the utterances we didn’t think we’d be able to get to, but were interesting for our backlog (“I feel like smoking.”)

After some “wizard of oz” testing of our planned voice interactions, we decided that we needed to support both quick-hit interactions, where a user quickly records their blood sugar or weight for example, and guided interactions where we guide the patient through a few tasks on their care plan. The guided interactions were particularly important for our voice-powered scale and foot scanner so that we could harness an existing habit (weighing oneself) and capture additional information at the same time. This allows the interaction to fit seamlessly into someone’s day.

 

Challenge 1: We wanted to integrate the speech hardware into our scale / foot scanner device using the Alexa Voice Service, rather than using an off-the-shelf Echo device.

The Alexa Voice Service is a client SDK and a set of interface standards for how to build Echo-like capabilities into other hardware products. We decided early on to prototype our device around a Raspberry Pi 3 board to have sufficient processing power to:

  • Handle voice interactions (including wake-word detection)
  • Drive the sensors (camera array, thermal imaging, load sensors)
  • Run an image classifier on the device
  • Drive on-device illumination to assist the imaging devices
  • Securely perform network operations both for device control and for sending images to our cloud service

Raspberry Pi in Sugarpod

The device needed built-in illumination in order to capture usable photos of peoples feet to look for ulcers and abnormalities. Since the device needed built-in illumination to perform imaging, one of our team members came up with the idea of dual-purposing the LED lighting as a speech status indicator. In the same way that the Amazon Echo uses blinking cyan and blue to show status on the LED ring, our entire scale bed could do this.

As we started prototyping with basic audio microphones and speakers, we quickly discovered how important the audio-preprocessing system is in our application. In our testing there were many cases of poor transcriptions or unrecognizable utterances, especially when the user was standing any distance from the device. Our physical chassis designs put the microphone height around 2’ from the ground, which is far from the average user’s mouth, and also in real-world deployments would be in echo-filled bathrooms. Clearly, we needed to use a proper far-field mic array. We considered using a mic array dev kit, which we decided was too expensive and added too much complexity for the challenge. We also spent a couple hours investigating whether we could hack an Echo Dot to use it’s audio hardware.

Eventually we decided that it would make the most sense to stick with an off-the-shelf Echo for our prototype. Thus, in addition to being a foot scanner and connected scale, the device is also the world’s most elegant long-armed Echo Dot holder! It was easy to physically include the Echo Dot into our design. We figured out where the speaker was, and adjusted our 3D models to include sound holes. Since the mic array and cue lights are on top, we made sure that this part of the device remained exposed.

We will be revisiting the voice hardware design as we look at moving the prototype towards commercial viability.

 

Challenge 2: We wanted both quick-hit and guided interactions to use the same handlers for clean code organization but hit some speed bumps enabling intent-to-intent handoff

Our guided workflows are comprised of stacks and queues that hand off between various handlers in our skill, but we found this hard to do when we moved to Alexa Skill Builder. The Alexa Skill Builder (currently in beta) enables skill developers to customize the speech model for each intent and provide better support for common multi-turn interactions like filling slots and verifying intents. This was a big improvement, but also forced us rework some things.

For example, we wanted the same blood sugar handler to run whether you initiated a conversation with “Alexa, tell Sugarpod my blood sugar is 85”, or if the handler was invoked as part of a guided workflow where Alexa asks “You haven’t told me your blood sugar for today. Have you measured it recently?”

We tried a number of ways to have our guided workflow handlers switch intents, but this didn’t seem to be possible to do with the Alexa API. As a workaround we ended up allowing all of our handlers to run in the context of both the quick-hit entry point intent (like BloodSugarTaskIntent) as well as in guided workflows (like RunTasksIntent), and then expanded the guided workflow intent slots to include the union set of all slots needed for any handler that might run in that workflow.

Another challenge was that we wanted to use the standard AMAZON.YesIntent or AMAZON.NoIntent in our skill, however the Alexa Skill Builder does not allow this, presumably because it needs to reserve these intents for slot and intent confirmation. Our workaround for this was to use a fictitious slot (we called ours “ConfirmationSlot”) in basically all of our intents which could be “confirmed”, every time we wanted to ask for Yes/No values. We factored this into a helper library that is used throughout our skill codebase.

if (confirmationSlot.confirmSlotStatus === 'CONFIRMED') {
  confirmationSlot.confirmationStatus = "NONE";
  handler.emitWithState("AMAZON.YesIntent");
  return true;
} else if (confirmationSlot.confirmSlotStatus === 'DENIED') {
  confirmationSlot.confirmationStatus = "NONE";
  handler.emitWithState("AMAZON.NoIntent");
  return true;
}

Challenge 3: The Voice Kit speech recognizer did not always reliably recognize complicated and often-mispronounced pharmaceutical names

One of our intents allows the user to report medication usage, by saying something like “Alexa, tell Sugarpod I took my Metformin.” People are not always able to pronounce drug names clearly, so we wanted to allow for mispronunciation. For the Alexa Diabetes Challenge, we curated a list of the ~200 most common medications associated with diabetes and its frequent co-morbidities, including over-the-counter and prescription medication. These were bound to a custom slot type.

When we tested this, however, we found that Alexa’s speech recognizer sometimes struggled to identify medication from our list. This was especially true when the speaker mispronounced the name of the medication (understandable with an utterance like “Alexa, tell Sugarpod I took my Thiazolidinedion”). We observed this even with reasonably good pronunciation. We particularly liked “Mad foreman” as a transcription when one of us asked about “metformin.” Complex pronunciations are a well-known problem in medical and other specialized vocabularies, so this is a real-world problem that we wanted to invest some time in.

Ideally, we would have been able to take an empirical approach and collect a set of common pronunciations (and mispronunciations) of the medications and train a new model. It may additionally be interesting to use disfluencies such as hesitation and repetition in the recorded utterance as features in a model. However, this is not something that is currently possible using Alexa Skills Kit.

Our fallback was to use some basic algorithmic methods to try to find better matches. We had reasonable success with simple fuzzy matching schemes like Soundex and NYIIS, which gave us a good improvement over the raw Alexa ASR results. We also started to evaluate whether an edit-distance approach would work better (for example, comparing phonetic representations of the search term against the corpus of expected pharmaceutical names using a Levenshtein edit distance), but we eventually decided that a Fuzzy Soundex match was sufficient for the purposes of the challenge (Fuzzy Soundex: David Holmes & M. Catherine McCabe http://ieeexplore.ieee.org/document/1000354/).

Even though our current implementation provides good performance, this remains an area for further investigation, particularly as we continue to work on larger lists of pharmaceuticals associated with other disease or intervention types.

 

Conclusion

This challenge helped us stretch our thinking about the voice experience, and gave us the opportunity to solve some important problems along the way. The work we’ve done is beneficial not just for Diabetes care plans, but also for all of our other care plans too.

While the Alexa voice pipeline is not yet a HIPAA-eligible service, we’re looking forward to being able to use our voice experience with patients as soon as it is!

Posted in: Healthcare Technology, Voice

Leave a Comment (0) ↓

Leave a Comment

Google+