Minimizing the use of digits with voice biometrics. Maximizing security

When choosing a token for the purposes of enrolling and verifying a biometric credential, behavioral biometrics present a challenge that doesn’t exist for physical biometrics. For physical biometrics such as fingerprint, palm or iris the choice is simply which finger(s), hand or eye to use. However, with behavioral biometrics, such as voice or handwriting, the words that are chosen as the token are as critical to the overall process as the underlying biometric technology.

Voice biometric options

In the case of voice biometrics (also known as speaker verification), there are three categories of tokens that can be used as a credential: digits, words and phrases.

If all three categories worked equally well and presented similar, if not identical, security concerns, the clear preference from users would always be digits. People are used to remembering phone numbers and short digit strings - so being asked to say either a known digit string (e.g. phone number, account number) or a randomly generated series of four to six digits is not obtrusive. However, from an overall security and technology perspective, the use of digits for primary biometric verification should not be implemented.

A historical view of the use of digits

To understand some of the concerns from a technical perspective about the use of digits, it is important to understand how most biometric engines implement digit-based verification. Unlike words, one of the unique qualities of digits is that regardless of the order of a string of digits, most users tend to say each digit clearly and with even inter-digit spacing (pauses). In this case, the “2” in the series 1-2-4-5 and 1-3-5-2 would be said the same way.

When enrolling, the biometric engine separates each of the digits and enrolls each one independently. Then, during verification, the engine (or the surrounding business logic) generates a random digit string, which can then be used as the token. The idea is that by combining digits with randomization, you add in a factor of ‘liveness testing’ - ensuring that you are dealing with a human, not a recording.

It is key to note, however, that the engine is not actually verifying the digits as a single utterance. Instead, the engine is separating the digits, verifying each one independently, then ensuring that the order of the digits is correct.

Words or phrases, on the other hand, are verified as a single utterance.

Increasing Accuracy

When performing a voice biometric authentication, it is generally accepted by all researchers that, when longer utterances are used as a token, the platform performs better. The reason is simple - the more data points available to the engine during a verification, the less of a chance that missing any individual data point(s) (due to changes in how one speaks, a missed audio segment due to line issues, etc.) will impact the overall scoring process. With digits, each individual number is recognized individually. This means that instead of having a continuous segment of audio (as with words or phrases), each digit is evaluated on its own. As the individual audio segments are significantly shorter, the number of data points available for validation is smaller, which means that even without any channel issues (such as enrolling on a landline and verifying on a mobile), the engine will have a harder challenge to trap impostor and digital replay attacks.

One point to consider is that because so many engines first implemented digits, the technology providers generally have more comprehensive background models when it comes to numbers. This means that though there are fewer data points available to the engine, it knows which data points are more indicative of a particular user’s voice and which are generally found in the calling population. In some cases, the benefit of the more comprehensive background model is significant, as an engine that has been ‘tuned’ to perform well using digits may start to exhibit reduced accuracy when switching to phrases. Typically, vendors can supply updated background models or can assist with the recalibration of a voice biometric engine if changing to phrases causes any impact.

Minimizing Cross Channel Effect

Digits are prevalent in internal voice biometric applications, especially ones targeting services such as password reset. The reason is clear - these applications were designed for internal use, so typically users enroll and verify from their desk phone. Now, as people use the same phone (channel) for enrolling and verifying, using digits with their inherently shorter duration, combined with other security factors (such as the user having to be at the office to call for the password reset) mitigated the possible accuracy issues described in the previous section.

However, in the real world, people call in from all forms of devices - including VoIP, Mobile Phone (GSM) and cordless devices - all of which introduce elements of noise reduction and compression.  This results in a reduction of matchable data points, which in turn increases the incidence of false rejects. Combine the inherent data loss issues introduced by cross channel factors with the fact that digits provide fewer data points to match on, and the effect can be a ten fold increase of false rejects when combining digits and compressed audio.

To combat the rise in false rejects, vendors set the security thresholds to be more ’forgiving’ of lower match rates, which brings down the false rejects at the cost of increasing false accepts.  In other words, security is compromised.

Using 1.5-2.5 seconds of continuous audio as a token (either word(s) or a phrase) combats the cross channel effect, although it does not eliminate it. The more data points available to the engine reduces false rejects without increasing the amount of false acceptances.

Maximizing Potential Randomness

It is generally accepted that there are 44 phonemes in the English language and digits utilize only a smaller subset of them. It is also accepted that in all security processes, maximizing the amount of potential randomness, or entropy, between tokens increases a system’s overall security. By restricting the number of phonemes used in a population’s tokens (e.g. all enrolling with the exact same digits), you effectively reduce the maximum amount of security a solution can provide. It is recommended as a best practice not only to use phonetically rich word(s) or phrases, but to also use randomization to make sure that only a percentage of the population enrolls and verifies with a specific word or phrase.

Reducing Violation of the Identification/Verification Process

Security analysts consider it paramount that different tokens are used for both identification and verification, and that no one should use publicly available information for verification.

For example, account numbers and telephone numbers are never considered ‘personal’ or even ‘semi-private’. On the one hand, account numbers are available on mailed bank statements, on the bottom of cheques and are publicly provided to other entities to facilitate EFT/ACH transactions. On the other hand, phone numbers are easily found by calling directory enquiries. Such public information should only be used to establish a claimed, unverified identity, which then needs to be authenticated before providing access.

From a security perspective, the use of shared secrets is always recommended - be it a password, passphrase, PIN or an answer to a security question (however, not one based on biographical information, such as mother’s maiden name which can be gleaned from public databases). For voice biometrics, these shared secrets can be unprompted (e.g. please say your password now) or prompted (e.g. please repeat “Boston is the capital of Massachusetts”), since these phrases are used only for accessing that specific system. From a pure security perspective, an enterprise can implement a secret numeric PIN and still comply with the shared secret model, though it does run against the entropy rule and increases the cross channel effect.

Implementing digits as a primary token tends to lead to a slippery slope that encourages the combination of identification and verification in the name of ‘customer experience’ and needs to be watched carefully.

Reducing Digital Replay Attack

At government trials in the United States, it was shown to be possible to capture all ten digits from a user within a single day. By simply placing a recorder on a phone line or putting a microphone on a user’s desk, between a person saying time, date, phone numbers and credit card numbers, it was possible to get each digit quickly. A researcher then built a simple ‘calculator’ using PowerPoint.  The system would ask the user to repeat a randomized six digit code and the impostor would simply click the appropriate buttons on the ‘calculator’, which would speak back the digits with an alarmingly high acceptance rate.

When using phrase-based authentication, the chance of a fraudster being able to capture a user providing their verification credential is significantly reduced, especially if you add randomization to the process (e.g. enroll three phrases and verify with one on each call). In this case, the fraudster would need to capture an actual enrollment or verification in order to mount a replay attack.

Revocation of Credentials

From a security perspective, being able to revoke a credential, such as a password or PIN, when fraud is detected is considered paramount.

In a traditional security sense, if a user forgets his or her password, loses a smart card or destroys a secure ID token, the security administrator can simply invalidate that credential and issue a new one.

This is one of the problems inherent with the use of physical biometrics – since they are, by their nature, unchangeable, they are not revocable. However, in most cases, physical biometrics are not implemented in an unmanaged environment. Fingerprint and iris/retina scanners used for border control and physical access, are typically manned – where producing a false finger or eye would be detected and suspicious!  For remote physical biometrics, such as using a fingerprint scanner to gain access to a secure website, another factor, such as a PIN is typically introduced.

Voice biometrics is a behavioral biometric – which by its nature has the ability of creating a revocable credential. Since text-dependent voice biometric templates combine who you are with what you are saying, if a specific credential, such as a word or phrase is compromised, a security officer can just invalidate that token and have the user enroll an additional phrase or word(s).

However, if you use digits as the primary token, either in combination with randomization, public information (date of birth, phone number, account number) or the identifier, and a fraudster happens to capture the user speaking all ten digits, at that point, there is no recourse for revocation of the credential.

Capturing digits is a very simple process for identity thieves and other fraudsters. Simply calling a target, pretending that the call was to the wrong number and asking what number he or she dialled is enough to get a large subset of the digits. Simple social engineering in follow-up calls can easily fill in the gaps.

In the case of randomized digits, if you have a recording of them changing the order will not defeat the fraudster. Similarly, changing from one public piece of information to another will not re-establish security for that user. Since one cannot simply re-enroll digits, unlike phrases or words, that user is now permanently compromised. Using digits with a secret PIN is slightly more secure, but no more than a DTMF PIN would be in a similar situation.

Closing the case

Though selecting an appropriate series of phrases - one that is relatively time agnostic and phonetically rich - can be a challenge, the benefit from both how the biometric engine performs and the security impact more than overrides customer preference for speaking digits versus phrases.


© 2013 Vicorp Services Limited. Registered in UK No: 05038031 | Registered Office: 3 Shaftsbury Court, Chalvey Park, Slough, Berkshire, SL1 2ER, UK

UK VAT registered number 174 0054 35