An Introduction to VoiceXML


VoiceXML (also known widely as VXML) is the World Wide Web Consortium’s (W3C) format for specifying standard XML (Extensible Markup Language) interactive voice dialogues, typically via an Interactive Voice Response (IVR) system.
The landscape for providers of traditional phone services has shifted because of the exponential growth of the Internet to a new set of users accessing information and services through the Web. Providers are finding it easier to develop new services that exploit the power of Web technology.

VoiceXML provides the best of both worlds:

  • By using VXML, providers can open up their new Web services to customers using phone based voice interfaces.
  • Organisations can now build automated voice services using exactly the same technology they use to create visual Web sites significantly cutting the cost of construction and delivery of new capabilities for the traditional phone customer.

Just like HTML (HyperText Markup Language) can be used to represent visual (web based) applications, VXML enables voice applications to be developed and interpreted by a VoiceXML browser. A VXML browser is a web browser that presents an interactive voice user interface to the caller as well as providing an interface to the telephone network (PSTN) or a Private Branch Exchange (PBX). A VXML browser presents information aurally using text-to-speech software or pre-recorded audio file playback. The VXML browser can obtain information using keypad entry (e.g. DTMF detection) or Advanced Speech Recognition (ASR).
Also, as with XML, VXML has ‘tags’ that instruct the VXML browser how to interact with the caller or communicate with other interfaces supported by the standard. VXML pages are served using HTTP as the transport protocol. Application execution environments like Vicorp’s xMP generate and serve dynamic VoiceXML pages allowing more flexible implementations than would be available with static VoiceXML pages.


Service Creation Environments (SCE) such as xMP Director enable non-technical users to design call-flows using an easy to follow Graphical User interface (GUI). Dynamic pages of VoiceXML are generatedfrom the call-flow when calls are received allowing for flexible implementations.

Non-technical users can design call-flows using a Graphical User interface (GUI)

Just as many sites have an HTML presence on the web for visual browsing, most large companies also have a VXML presence on the web for telephone based voice browsing. All the normal capabilities of the web apply including taking advantage of web services, linking, markup and cross-browser support.

VoiceXML Forum

The VoiceXML Forum is an industry group promoting the use of the standard. It was formed in 1999 by IBM, AT&T, Lucent and Motorola to develop a standard markup language for specifying voice dialogues and published VoiceXML 1.0 in March 2000. It also provides a testing process that enables vendors' implementations to be certified.
VoiceXML platform vendors originally implemented the standard in different ways or added proprietary features (custom tags). The VoiceXML 2.0 standard, adopted as a W3C Recommendation in 2004, clarified the areas of difference and most vendors now comply with this version.

Call Control eXtensible Markup Language)

CCXMLis a W3C standard that is complementary to VXML. A CCXML interpreter is used to handle the initial call setup between the caller and the voice browser and to provide telephony services like call transfer and disconnect.

Associated Web Standards

There is also a suite of independent standards that are supported alongside VoiceXML. When used together these standards enable developers to create powerful applications.
These web standards include:

  • Speech Grammar Recognition Specification (SRGS)
    a document language used by developers to specify the words and patterns of words to be listened for by an Advanced Speech Recognition (ASR) engine.
  • Semantic Interpretation for Speech Recognition (SISR)
    a document format that represents annotations to grammar rules for extracting semantic results from recognition.
  • Pronunciation Lexicon Specification (PLS)
    a representation of phonetic information for use in speech recognition and synthesis;
  • Speech Synthesis Markup Language (SSML)
    a markup language for rendering a combination of prerecorded speech, synthetic speech, and music
  • State Chart XML (SCXML)
    a markup language to simply and precisely represent the semantics of state machines;

For more information please refer to www.w3.org/Voice/


© 2013 Vicorp Services Limited. Registered in UK No: 05038031 | Registered Office: 3 Shaftsbury Court, Chalvey Park, Slough, Berkshire, SL1 2ER, UK

UK VAT registered number 174 0054 35