Planning for the Deployment of Speaker Verification

The most significant risk in building speaker authentication into voice applications isn’t the core technology; it’s how the technology is implemented.

In presentations from voice biometric engine developers, many stress the similarities between speaker verification technology and speech recognition. They explain how the core algorithms used in speech recognition and text-dependent speaker verification are essentially the same, and how the bulk of the process for determining the start and end of an utterance is exactly the same for recognition and verification.

Though these statements are factually correct, it can also be horribly misleading. Though verification and recognition share a number of similar technological components, being able to implement a successful speech recognition application has very little in common with how one would implement a secure speaker verification system.

Implementing speech recognition – giving control to the voice application developer

When installing speech recognition technologies into an IVR (Interactive Voice Recognition) system, it’s relatively straightforward: take the speech recognition engine, integrate to the platform either through MRCP (Media resource Control Protocol) or a direct API (Application Programme Interface), and make the technology available to the voice application developer. In this case, one could say that the speech engine becomes a resource of the IVR.

The goal, when building speech recognition-enabled voice applications, is to make as much control available to the voice application developer – since he or she will need to constantly tweak aspects such as recognition thresholds and grammars. When the speech recognition engine processes a request, it returns a confidence score that the voice application then utilizes to determine if it was a successful recognition or if it failed to recognize what the caller said.

Implementing speaker verification – giving control to IT security

When we look at the way that many vendors recommend and many enterprises deploy speaker verification, they apply a similar model to that of speech recognition implementation:. iIntegrate the speaker verification engine to the IVR platform, expose a command, control or process to the voice application developer and let him or her start building applications.

The result of this model is that each voice application developer ends up being responsible for how speaker verification gets implemented within the voice application. Since the speaker verification engine returns a score (like the speech recognition engine does), the voice application developer can set logic within the voice application to determine if the score is high enough to be considered a ‘pass’ or ‘fail’.

Though this will work at the most technical level, it is inherently a very insecure way of building authentication into voice applications.

When building basic verification – such as using an account number and PIN – into an enterprise voice application, the voice application typically has a loose connection (e.g. via a web service) to an authentication subsystem. Sometimes this system is integrated into a CRM database and sometimes it’s part of a larger authentication, authorization and fraud management system. The IVR, as the channel, passes the account number and PIN to the authentication subsystem, and that system responds with a pass, fail or retry (sometimes just pass/fail). It’s the same way that enterprise web applications work: the username and password are passed to the authentication subsystem, which then responds with a pass/fail/lockout. All the application needs to do is respect the response coming back from the authentication subsystem.

Understanding enterprise authentication processes

There’s a reason for this distributed architecture. By separating the authentication process from the application, security specialists within the enterprise can control access from a central point. This sort of policy management is critical for reducing fraud – blocking access from certain geographic regions on a per-user or enterprise-wide level, flagging usage patterns, etc. The application developer (and user interface specialist) can focus on providing content for users and the security specialist can focus on authentication and authorization. Just as a security specialist has no business changing a voice application (and should have no access), similarly, the application developer has no business determining acceptable IT security policies.

From a speaker verification perspective, there’s more risk involved than simply doing a binary check to see if a password or PIN matches what is in a CRM field. Since speaker verification engines return scores back instead of a conclusive ‘pass’ or ‘fail’, the system needs to be able to take the raw score back, apply a level of business logic and determine if the score really represents a pass or a fail.

Integrating speaker verification at the enterprise level

In the speech-centric model, the score would be passed back to the voice application, which would then compare the score against a pre-determined threshold. Each application would need to implement this process. This means it’s a human procedure to ensure that every application uses the same process in order to determine when a user should be considered authentic.

Compare this to the benefit of handling the speaker authentication process within the authentication subsystem. In this case, a security architect can create applicable policies, which would be applied. For example, the security architect can say that an engine needs to have a confidence score of 80% - no matter what voice application made the request – in order to gain access. Conversely, the security architect could choose that for certain voice applications – such as an information-only application – a lower score can represent a pass. The authentication subsystem could even change thresholds based on other information (is the user calling from a number in the CRM system, is the user calling in from a blacklisted region, is the user calling outside his or her normal usage patterns, has the user been flagged for possible fraud). Regardless of how the policies are set up, the fact is that they are set by the appropriate personnel – those who have end responsibility for ensuring the overall security for end-user facing applications.

Therefore, the nature of placing any aspect of voice authentication – be it user tables, the core speaker verification engines or the logic to decide what is a ‘pass’ or ‘fail’ - in the hands of the voice application developer or under the control of the voice application just doesn’t make sense.

 

© 2013 Vicorp Services Limited. Registered in UK No: 05038031 | Registered Office: 3 Shaftsbury Court, Chalvey Park, Slough, Berkshire, SL1 2ER, UK

UK VAT registered number 174 0054 35