I am interested in investigating more on the experience part of the conversation design. Halloween is close, so I decided to make a talking pumpkin who is not so fond of Halloween. When trick-o-treaters asking Dorcy the pumpkin for candies, he will reply with an annoyed voice and turn them away. I am going to build this chatbot with Watson, and play with the built-in SSML(The Speech Synthesis Markup Language).
IBM Watson Text-to-Speech service demo.
I was surprised that SSML, expressive SSML, and Voice Transformation SSML don’t apply to all voices. It looks like the most advanced/customizable voice is Alison’s, I will begin with hers. I later found out in the documentation that both Michael and Lisa’s voice will also work with the following experiments.
Plain Text:
[sourcecode language=”XML” wraplines=”TRUE” light=”TRUE”]
I have been assigned to handle your order status request. I am sorry to inform you that the items you requested are back-ordered. We apologize for the inconvenience. We don’t know when those items will become available. Maybe next week but we are not sure at this time.Because we want you to be a happy customer, management has decided to give you a 50% discount!
[/sourcecode]
Expressive SSML:
[sourcecode language=”XML” wraplines=”TRUE” light=”TRUE”]
<speak>I have been assigned to handle your order status request.<express-as type=”Apology”> I am sorry to inform you that the items you requested are back-ordered. We apologize for the inconvenience.</express-as><express-as type=”Uncertainty”> We don’t know when those items will become available. Maybe next week but we are not sure at this time.</express-as><express-as type=”GoodNews”>Because we want you to be a happy customer, management has decided to give you a 50% discount! </express-as></speak>
[/sourcecode]
express-as syntax documentation
GoodNews: expresses a positive, upbeat message.
Apology: expresses a message of regret.
Uncertainty: conveys an uncertain, interrogative message.
Voice Transformation SSML:
[sourcecode language=”XML” wraplines=”true” light=”TRUE”]
<voice-transformation type=”Custom” glottal_tension=”-80%”>I have been assigned to handle your order status request.</voice-transformation> <voice-transformation type=”Young” strength=”80%”>I am sorry to inform you that the items you requested are back-ordered. </voice-transformation><voice-transformation type=”Custom” breathiness=”90%”>We apologize for the inconvenience.</voice-transformation> <voice-transformation type=”Custom” glottal_tension=”40%” breathiness=”40%”>We don’t know when those items will become available.</voice-transformation><voice-transformation type=”Custom” timbre=”Breeze” timbre_extent=”60%”> Maybe next week but we are not sure at this time.Because we want you to be a happy customer, </voice-transformation><voice-transformation type=”Custom” pitch=”-30%” pitch_range=”80%” rate=”60%” glottal_tension=”-80%” timbre=”Sunrise”>management has decided to give you a 50% discount!</voice-transformation>
[/sourcecode]
Voice Transformation SSML only works with following English voices:
en-US_AllisonVoice
en-US_LisaVoice
en-US_MichaelVoice
(see more detailed reference below)
Remix Expressive and Voice Transformation SSML 1:
[sourcecode language=”XML” light=”true”]
<voice-transformation type=”Custom” pitch=”-30%” pitch_range=”80%” rate=”60%” glottal_tension=”-80%” timbre=”Sunrise”> <speak>I have been assigned to handle your order status request.<express-as type=”Apology”> I am sorry to inform you that the items you requested are back-ordered. We apologize for the inconvenience.</express-as><express-as type=”Uncertainty”> We don’t know when those items will become available. Maybe next week but we are not sure at this time.</express-as><express-as type=”GoodNews”>Because we want you to be a happy customer, management has decided to give you a 50% discount! </express-as></speak></voice-transformation>
[/sourcecode]
Remix Expressive and Voice Transformation SSML 2:
[sourcecode language=”XML” light=”true”]
<voice-transformation type=”Custom” pitch=”-30%” pitch_range=”50%” glottal_tension=”-80%” timbre=”Sunrise”><speak version=”1.0″><express-as type=”GoodNews”><prosody rate=”+10%”>I say tometo, you say <phoneme alphabet=”ipa” ph=”təˈmɑːtoʊ”>tomato.</phoneme></prosody></express-as></speak></voice-transformation>
[/sourcecode]
Remix Expressive and Voice Transformation SSML 3:
[sourcecode language=”XML” light=”true”]
<voice-transformation type=”Custom” pitch=”20%” pitch_range=”80%” rate=”60%” glottal_tension=”-80%” timbre=”Sunrise”>You say tometo, I say <phoneme alphabet=”ipa” ph=”təmaɾoʊ”>tomato.</phoneme> </voice-transformation>
[/sourcecode]
Transformation and Expressive SSML have many functions overlapped (pitch, rate, …), and based on the experiment, they can be used together. The 3rd remix example was done with Lisa’s voice instead of Alison’s, and it sounds better to me.
Alison’s voice without any modification:
Attribute | Range | Description |
---|---|---|
pitch | [-100%, 100%], [ x-low , low , default ,high , x-high ] | Normalized relative change of the average pitch contour level within safe limits. The attribute controls the perceived average tone level. It is borrowed from the pitch attribute of the SSML <prosody> tag. It contributes to changing perceived speaker identity. |
pitch_range | [-100%, 100%], [ x-narrow , narrow , default ,wide , x-wide ] | Normalized relative change of the pitch contour dynamic range within safe limits. Increasing or decreasing the pitch range makes the speech style more or less expressive. The attribute is borrowed from the range attribute of the SSML <prosody> tag. |
glottal_tension | [-100%, 100%], [ x-low , low , default ,high , x-high ] | Normalized relative change of the glottal tension within safe limits. Increasing or decreasing the glottal tension is perceived as a more tense or lax speech quality. A positive value might produce buzzing sounds, which you can alleviate by increasing the value of the breathiness attribute. A negative value is perceived as more breathy and generally more pleasant. |
breathiness | [-100%, 100%], [ x-low , low , default ,high , x-high ] | Normalized relative change of the perceived level of the aspiration noise within safe limits. Extreme values might produce either noisy speech (for positive breathiness) or a buzzing sound (for negative breathiness). Use this attribute to compensate for buzz or extra noise produced as side effects of other attributes. |
rate | [-100%, 100%], [ x-slow , slow , default ,fast , x-fast ] | Normalized relative change of the speech rate within safe limits. Increasing or decreasing the rate makes speech faster or slower. A positive (faster) rate makes the perceived pitch range wider, and a negative (slower) rate perceptually narrows the pitch range. The attribute is borrowed from the rate attribute of the SSML <prosody> tag. |
timbre | [Sunrise , Breeze ] | The case-sensitive name of one of the built-in vocal-tract transformations: Sunrise or Breeze . The names are symbolic; experiment with the timbres to learn how they impact voice transformation. The attribute contributes to changing perceived speaker identity. |
timbre_extent | [0%, 100%] | The extent of the timbre vocal-tract transformation: 0% cancels the transformation; 100% represents full application of the transformation. The attribute quantifies the difference between the transformed and original voices, enabling blending of the selected timbre with that of the original voice. Even at moderate timbre extent values, the timbre attribute contributes to changing perceived speaker identity. |
This definitely added another layer of fun at the experience design, but maybe the technology is not there yet with SSML. Compared to other text-to-speech (TTS) services online, the default TTS service did a poor job on punctuation, especially the intended pause in between sentences. I was able to hack it using Expressive SSML by using the break syntax and space. There should be a way to open up the sound design and the production of TTS to design driven computation community. A bigger and more diverse community will take this far in a short time. It is going to be fun TTS-bending in the near future.