Q&A with Dr. Hamid Nawab
Tell us about yourself — what is your background in and how did you end up in your current position?
I was born in England, and my parents were from Pakistan, so I had a mixture of education in England and Pakistan, moving around frequently as a child. I went to college and graduate school at MIT in Boston and I’ve been here ever since. I love Boston — it is the first place where somebody can stop and ask me how to get somewhere, and I know all the streets.
I worked for three years at MIT, and then moved over to Boston University, where I’m currently a professor. A few years back, I was approached by Ken Sutton in regards to an idea for a company. There were some broadcast studio methodologies that he and a friend of his had developed, and they wanted to explore if that could actually be automated. Their idea required a combination of artificial intelligence and signal processing, and that’s my academic specialty.
I came up with the original automation, by combining artificial intelligence and signal processing to replace the human element in audio processing. One thing led to the other, and I became a co-founder and chief scientist at Yobe and it has been a wonderful partnership.
What prompted your shift from surface electromyography research to voice tech?
Well, I don’t look at it as a switch. If you look at my research, there’s one common element in that, and that’s superposition. I have always been intrigued, since my college days, with the idea of superposition — and actually more precisely, undoing superposition. [laughs]
As an engineer, superposition means a state when two or more things are mixed together. And so you can have a superposition of voices, you can have a superposition of muscle signals, such as signals from the brain that controls our muscles. Whether you have a microphone with a number of people speaking, or you have an EMG needle that is poked into somebody’s skin near a muscle or a surface EMG, they’re all, if you will, talking simultaneously. The muscle responds to what the neurons are telling the muscle, and the problem is that a lot of them are talking simultaneously. You need to figure out what each one is saying in order to properly decode the message.
The analogy in speech processing is known as the “cocktail party problem” where you have several people talking, having their own conversations, maybe there’s music and other kinds of everyday sounds in the environment. You and I could have a conversation in this environment without missing a word, but for voice processing, all of those sounds get superimposed, and that has to be undone to get to the one signal you’re after.
In a sense, that’s what Yobe is all about — when you have multiple sounds and you’re trying to listen to somebody specifically. When you want your device to listen to you when you talk to it, while the TV is on, while the blender is on, while there are other noise sources in the environment. That is a problem of superposition that has to be undone. Yobe is in the business of undoing superposition in the context of everyday acoustic environments.
How is Yobe applied to all these different types of signals? And how does it answer that cocktail party problem?
I would say the pre-Yobe approach to addressing the cocktail party problem, most recently, are machine learning approaches that try to train devices to tell apart, for example, the sound of a person speaking versus the background sound. Those kinds of machine learning solutions require a lot of training.
As opposed to that, our approach at Yobe is based on the idea of integrated processing and understanding of signals. Which is really just a fancy way of saying we combine artificial intelligence with signal processing. Now signal processing is the calculations that a device has to do to figure out the features in the incoming sound, or multiple sounds that may be coming in simultaneously, to figure out their individual features.
What’s unique about our approach is that we bring inferential artificial intelligence into the picture. Instead of focusing on machine learning per se, we focus on inference: that is, giving the device a set of rules that it can use to infer what is going on in the auditory environment. That, combined with the signal processing, makes the technology much more adaptive than the traditional approach.
We use all the modern ways of doing signal processing, including techniques like blind sound separation or beamforming, which specifically takes advantage of the fact that you can tell different sounds coming from different directions by filtering spatially. This works to a certain extent, but the problem is real life doesn’t always cooperate. In practical, real world environments sound sources move and fluctuate — you cannot just assume that everything will be fixed.
Making it more complicated is the fact that typically you’re sitting in a room, and you have the sounds bouncing around off the walls and the ceiling. So when I speak, and you hear me, you are not just hearing me — you’re also hearing the echoes bouncing off the walls and the ceiling and the floor. And so, if you beamformed just in my direction, you actually still get a lot of the echoes which come from different directions.
That makes it very important that we recognize not only that there is a voice, but our technology also looks at what is called the biometrics of the voice. The biometrics help us identify the uniqueness of the voice, the biological and the linguistic markers that are contained in the sound of any particular person speaking. Preserving those markers means that you can then enable an automatic speech recognition system to be able to figure out what you’re saying, despite the fact that you may be speaking in a noisy environment.
We bring, in my opinion, an increase in the capability of technology today to deal with the cocktail party problem in a way that is orders of magnitude better than what was previously possible for practical real-world environments.
What are some of the current hurdles for voice tech that people might not be aware of?
Voice is the most natural interface we have as humans. One of the nicest things about speaking is that we can talk and do other things at the same time. We cannot generally see and do other things at the same time, because we can only attend to whatever we are seeing. If I’m driving a car, I have to focus on what’s going on in front of me. If I look down to send or read text, I have to divert my attention from the road. However, I can be looking at the road and having a conversation with somebody simultaneously — my attention from the road is not affected. At some level, it’s surprising that voice has not developed as an interface for machines, right from the start.
However, as a scientist, I can say that it is not surprising at all. When figuring out what’s being said in an everyday kind of environment, it is an extremely difficult task for a machine to cut through the background noise and hone in on a signal. The technology is still struggling to catch up. Only recently, I would give credit to Amazon for coming up with Alexa and making people used to the fact that maybe you can talk to your devices. We now really are at the threshold of a new age, which is the age of the voice interface, and I think Yobe is very much at the forefront of that.
Our ideal is to really have that conversational capability, so that there should not be too much of a difference between my talking to you or talking to a device. If there’s noise it shouldn’t suddenly misinterpret what I’m saying, but it also should understand my emotions; it should know whether I’m telling a joke or I’m serious.
What complexities have you faced when trying to train neural networks for voice recognition?
At Yobe, we really do not use neural networks by design: we believe that that type of training is very, very brittle. It’s not very intelligent, but it’s a way of giving something a lot of memory, but the neural net only remembers the things you have shown it before. However, when you consider an everyday acoustic environment just the kinds of possible scenarios that we run into, or our devices would run into, it is so huge that is close to impossible. The amount of training that would be needed is exponentially large and unrealistic.
There are two things you need in order to address for voice tech to operate in noisy environments. First of all, inferential intelligence to be able to infer things quickly: which we as humans do very quickly. We have a wealth of knowledge to draw from and make inferences after just one experience. The other part of it is unsupervised learning, which means no training necessary. The device has to be able to learn quickly what that environment is like so that it knows how to better listen to that environment. Environments are dynamic: it’s not just the physical structures that are different — you know, different number of windows, different materials have different reflection properties — but also, within each environment, things move and sound sources move.
That’s what requires the algorithms to have the ability to adapt to a particular environment, and beyond that, the device has to learn by itself. With Yobe, a device doesn’t have to remember if it’s seen environments like that before; it just has to figure out what that particular environment is like by analyzing the sounds that are emanating from it.
Is there a specific event or point in your life that ignited your interest in science?
I think I happened to fall into science because of a variety of factors. Partly because I ended up at MIT for college and I did not really think I wanted to be a scientist or a technically-oriented person. When I got out of high school I didn’t really know what I wanted to do. I knew I was good at science and math, but I wasn’t passionate about it, and there were other things I was passionate about. Through a series of accidents — people I ran into, classes I took, et cetera — I ended up in something which I’m truly passionate about.
What I would say is the beginning of my awakening was in third grade. I had a third grade teacher, whose name was Miss Helda. My family had just moved to town, and it was the first day of school. My mother had dropped me off, in the rain, and I stood outside, not knowing what to do. Miss Helda came over, brought me inside and she happened to be my teacher for the year. I had some great teachers after that but she was the greatest — she really gave me a certain confidence that no other teacher ever gave me. I came into myself and realized that I can learn, that I can feel that I can master things, whether it was math or history or whatever. It just made me motivated. I decided I wanted to be just like her; I wanted to be a teacher. And that developed a sort of a teacher mindset, for me, to explore things and to figure things out and to learn that ultimately landed me in science and engineering.