Contactless technology is proving invaluable during the current coronavirus pandemic – avoiding the need to exchange cash or press buttons on a chip-and-pin machine. But for many businesses – such as fast-food restaurants – the only option for taking a customer’s initial order without face-to-face contact is a touchscreen. But that actually risks spreading infection more widely.
Antibacterial touchscreens exist but this misses an important factor. The kiosk has to be perceived to be safe and hygienic by customers, as well as being genuinely safe and hygienic. An anti-COVID coating is not likely to cut it for the general public as many people won't trust it, no matter how good it really is. Even before the current health crisis, there were reports of gut and faecal bacteria being found on every fast food touchscreen swabbed in a study involving London Metropolitan University.
So could the latest audio technology provide a contactless solution?
If we could just ask for a burger and chips rather than having to go through a complicated on-screen menu system, we could remain socially isolated and talk to the kiosk, use contactless payment and pick up our order from a service counter. The amount of physical contact would be minimised to a single one-way contact from the short order chef to the customer. And it's much quicker to say "two burgers with chips, one with cola" than go through the laborious process of finding the right buttons to press on a touchscreen.
However, there are two drawbacks with this scenario – the first is general background noise and the second is other people at neighbouring kiosks who will be making additional noise as they place their own orders.
Automatic speech recognition (ASR) has been around for decades and reliability is on an ever-upwards trend, although it is most reliable on limited vocabulary systems. But that's fine as most kiosks only need a limited vocabulary anyway. However, if you add in a noisy environment with lots of other people around, things are not so great for ASR. And, of course, touchscreen kiosks are often in high street shops with a high customer footfall – and therefore far from being quiet.
There is also likely to be several kiosks – so how can you ensure that your order for burger and chips doesn’t get mixed up with the person at the neighbouring kiosk ordering chicken nuggets? And what if your daughter tries to tack on a sneaky cola or ice cream?
It's tempting to think that noise suppression is the answer – after all, there have been some amazing advances in the last few years, fuelled by the artificial intelligence revolution. The results can be great if the signal was already clearly intelligible. But if you listen to what a microphone picks up in a noisy shopping mall, it is hard for a human being to understand the raw speech – let alone an ASR system. And that’s exactly where noise suppression falls down – it can’t pull out the speech cleanly from such high noise levels.
You might think beamforming holds the key – but this is surprisingly difficult to do well, unless you are willing to invest in a large number of expensive calibrated microphones. In general, unlike noise suppression, beamforming can provide some improvement in intelligibility – but nowhere near enough for a general high-volume kiosk.
This is where blind source separation (BSS) comes into its own. It simply needs between four and eight off-the-shelf microphones, with no calibration required – the sort of microphone found in a mobile phone. The array geometry is flexible – anything between 5cm and 30cm across. Ideally there would be a clear 'line of sight' to the customer and, if space allows, a 2D array is preferable – but a linear array does also work.
BSS can separate the incoming audio back into its constituent sources automatically. So not only is the customer's voice brought out of the background noise, it is clearly distinguished from the voice of the person at the neighbouring kiosk. It can even separate you from your daughter trying to add that sneaky ice cream.
This is all done with data-driven machine learning. The system is continuously analysing the sound field and can pick out the speech of the person in front of the kiosk – adapting automatically to the lunchtime rush or the quiet of a 2am motorway pitstop. Just like a human – but no social distance required.
Perhaps it’s time to put a 'Do not touch' sign on our touchscreens…