Clever math could enable a high-quality 3-D camera for cellphones

January 6, 2012 by Larry Hardesty
Depth-sensing cameras can produce 'depth maps' like this one, in which distances are depicted as shades on a gray-scale spectrum (lighter objects are closer, darker ones farther away). Image: flickr/Dominic

When Microsoft’s Kinect -- a device that lets Xbox users control games with physical gestures -- hit the market, computer scientists immediately began hacking it. A black plastic bar about 11 inches wide with an infrared rangefinder and a camera built in, the Kinect produces a visual map of the scene before it, with information about the distance to individual objects. At MIT alone, researchers have used the Kinect to create a “Minority Report”-style computer interface, a navigation system for miniature robotic helicopters and a holographic-video transmitter, among other things.

Now imagine a device that provides more-accurate depth information than the Kinect, has a greater range and works under all lighting conditions — but is so small, cheap and power-efficient that it could be incorporated into a cellphone at very little extra cost. That’s the promise of recent work by Vivek Goyal, the Esther and Harold E. Edgerton Associate Professor of Electrical Engineering, and his group at MIT’s Research Lab of Electronics.

“3-D acquisition has become a really hot topic,” Goyal says. “In consumer electronics, people are very interested in 3-D for immersive communication, but then they’re also interested in 3-D for human-computer interaction.”

Andrea Colaco, a graduate student at MIT’s Media Lab and one of Goyal’s co-authors on a paper that will be presented at the IEEE’s International Conference on Acoustics, Speech, and Signal Processing in March, points out that gestural interfaces make it much easier for multiple people to interact with a computer at once — as in the dance games the Kinect has popularized.

“When you’re talking about a single person and a machine, we’ve sort of optimized the way we do it,” Colaco says. “But when it’s a group, there’s less flexibility.”

Ahmed Kirmani, a graduate student in the Department of Electrical Engineering and Computer Science and another of the paper’s authors, adds, “3-D displays are way ahead in terms of technology as compared to 3-D cameras. You have these very high-resolution 3-D displays that are available that run at real-time frame rates.

“Sensing is always hard,” he says, “and rendering it is easy.”

Clocking in

Like other sophisticated depth-sensing devices, the MIT researchers’ system uses the “time of flight” of light particles to gauge depth: A pulse of infrared laser light is fired at a scene, and the camera measures the time it takes the light to return from objects at different distances.

Traditional time-of-flight systems use one of two approaches to build up a “depth map” of a scene. LIDAR (for light detection and ranging) uses a scanning laser beam that fires a series of pulses, each corresponding to a point in a grid, and separately measures their time of return. But that makes data acquisition slower, and it requires a mechanical system to continually redirect the laser. The alternative, employed by so-called time-of-flight cameras, is to illuminate the whole scene with laser pulses and use a bank of sensors to register the returned light. But sensors able to distinguish small groups of light particles — photons — are expensive: A typical time-of-flight camera costs thousands of dollars.

The MIT researchers’ system, by contrast, uses only a single light detector — a one-pixel camera. But by using some clever mathematical tricks, it can get away with firing the laser a limited number of times.

The first trick is a common one in the field of compressed sensing: The light emitted by the laser passes through a series of randomly generated patterns of light and dark squares, like irregular checkerboards. Remarkably, this provides enough information that algorithms can reconstruct a two-dimensional visual image from the light intensities measured by a single pixel.

In experiments, the researchers found that the number of laser flashes — and, roughly, the number of checkerboard patterns — that they needed to build an adequate depth map was about 5 percent of the number of pixels in the final image. A LIDAR system, by contrast, would need to send out a separate laser pulse for every pixel.

To add the crucial third dimension to the depth map, the researchers use another technique, called parametric signal processing. Essentially, they assume that all of the surfaces in the scene, however they’re oriented toward the camera, are flat planes. Although that’s not strictly true, the mathematics of light bouncing off flat planes is much simpler than that of light bouncing off curved surfaces. The researchers’ parametric algorithm fits the information about returning light to the flat-plane model that best fits it, creating a very accurate depth map from a minimum of visual information.

On the cheap

Indeed, the algorithm lets the researchers get away with relatively crude hardware. Their system measures the time of flight of photons using a cheap photodetector and an ordinary analog-to-digital converter — an off-the-shelf component already found in all cellphones. The sensor takes about 0.7 nanoseconds to register a change to its input.

That’s enough time for light to travel 21 centimeters, Goyal says. “So for an interval of depth of 10 and a half centimeters — I’m dividing by two because light has to go back and forth — all the information is getting blurred together,” he says. Because of the parametric algorithm, however, the researchers’ system can distinguish objects that are only two millimeters apart in depth. “It doesn’t look like you could possibly get so much information out of this signal when it’s blurred together,” Goyal says.

The researchers’ algorithm is also simple enough to run on the type of processor ordinarily found in a smartphone. To interpret the data provided by the Kinect, by contrast, the Xbox requires the extra processing power of a graphics-processing unit, or GPU, a powerful special-purpose piece of hardware.

“This is a brand-new way of acquiring depth information,” says Yue M. Lu, an assistant professor of electrical engineering at Harvard University. “It’s a very clever way of getting this information.” One obstacle to deployment of the system in a handheld device, Lu speculates, could be the difficulty of emitting light pulses of adequate intensity without draining the battery.

But the light intensity required to get accurate depth readings is proportional to the distance of the objects in the scene, Goyal explains, and the applications most likely to be useful on a portable device — such as gestural interfaces — deal with nearby objects. Moreover, he explains, the researchers’ system makes an initial estimate of objects’ distance and adjusts the intensity of subsequent light pulses accordingly.

The telecom giant Qualcomm, at any rate, sees enough promise in the technology that it selected a team consisting of Kirmani and Colaco as one of eight winners — out of 146 applicants from a select group of universities — of a $100,000 grant through its 2011 Innovation Fellowship program.

Explore further: Trillion-frame-per-second video

Related Stories

Trillion-frame-per-second video

December 13, 2011

By using optical equipment in a totally unexpected way, MIT researchers have created an imaging system that makes light look slow.

2-D photos spring to 3-D life

June 16, 2011

You’re interested in purchasing a car you’ve seen on the web. It’s the right make, model and vintage. It seems to be in great shape, and it’s just the right color. The price seems reasonable. So what’s ...

Faster computer graphics

June 13, 2011

Photographs of moving objects are almost always a little blurry — or a lot blurry, if the objects are moving rapidly enough. To make their work look as much like conventional film as possible, game and movie animators ...

Recommended for you

Click beetles inspire design of self-righting robots

September 25, 2017

Robots perform many tasks that humans can't or don't want to perform, getting around on intricately designed wheels and limbs. If they tip over, however, they are rendered almost useless. A team of University of Illinois ...

New technique spots warning signs of extreme events

September 22, 2017

Many extreme events—from a rogue wave that rises up from calm waters, to an instability inside a gas turbine, to the sudden extinction of a previously hardy wildlife species—seem to occur without warning. It's often impossible ...


Adjust slider to filter visible comments by rank

Display comments: newest first

1 / 5 (3) Jan 06, 2012
Before reading the article, I was all like wow, I bet this will be clever and interesting and cool.

After reading the article, I'm extremely skeptical that this system is more practical than using 2 nominally passive CCD sensors.

For imaging, the lower power requirements and low processing requirements render this new technology moot.

For gesture recognition you have, power requirement, questionable practibility for a phone/mobile device at all, and low resolution requirements that allow other technologies to take it's place.

The technology is clever and may have a place somewhere, but I think in the mobile arena, it is a nonstarter.
1 / 5 (4) Jan 06, 2012
It's funny that we still haven't perfected technological "depth perception" and yet as humans, we fully evolved our depth perception while we are still infants. Score 1 point for the biological.
4.5 / 5 (6) Jan 06, 2012
It's funny that we still haven't perfected technological "depth perception" and yet as humans, we fully evolved our depth perception while we are still infants. Score 1 point for the biological.

Depth perception in biological organisms developed LONG before humans or even apes existed.
2.8 / 5 (4) Jan 06, 2012
Also, there is nothing new about depth maps, computer graphics applications use multiple image maps to reproduce what you see on the screen. The only one most people are familiar with is the color map, which is basically the image, but there are also depth maps (bump maps), alpha maps, texture maps, normal maps, parallax maps, mip maps, cube maps, etc etc... all of which are the same physical dimension and can be thought of as "layered" onto a 3D object to produce the final visual result on the screen.

These techniques are decades old.
4.7 / 5 (3) Jan 06, 2012
This technique allows for significant cost reduction in imaging technology. It moves the cost bottleneck off of light sensors and into a spatial light modulator (not gonna fit in an iphone any time soon). However, in a laptop or specialized 3d-imaging device, this is a much cheaper alternative.
It has much higher resolution than the Kinect.
They will have to expand their work to include curved surfaces before the algorithm is really complete. But this is a promising start on a new concept.
1.5 / 5 (2) Jan 06, 2012
what would also be cool is if they could make a motion capture application for the home user. Then i could make a movie at home with only myself. YOu know, like i do all the acting and convert my movements into a wireframe, and then overlay an arbitrary texture, such as a man with a handsome face, or a girl with a beautiful face. Then those can be overlayed on top of a background, such as pictures of some place i went to on my travels, or a computer generated landscape. If the computer can match lighting conditions between all elements, film production costs could be cut significantly, and maybe producers could focus on more worthwhile things like narrative and such.
5 / 5 (3) Jan 07, 2012
Too many people focus on 3D images, and gesture control, or creating 3D models of physical objects. While all worthwhile areas of development, and surely needed, the real fun is with Augmented Reality. Some applications are becoming available now, but wait until HD transparent displays become available that people can wear in eye-glasses. Hooked up to your wirelessly interconnected mobile device. You percieve the world around you with your digital avatars and environments superimposed on the top. Films, homes, games will all become the theatre of entertainment and interaction. Playing a game of 'shoot the zombies' in your own house would be great with a group of friends. Or watching 3D films where the actors are moving around outside of the screen.. sat to your right.. the possibilities are endless.
not rated yet Jan 11, 2012
Seems like an excellent candidate to become a sensor in my future home robots, self-driving cars, laser diode-based bug zapper, and home awareness system.

Those might not be today's mobile devices, but similar, high-volume devices will probably be enhancing our mobile and home experiences in less than a decade.

This is really great systems engineering. I hope that the extensions to handle curved surfaces don't trip up the algorithm, so that my future robotic servants and caretakers will be affordable.

(Btw, I had no idea that the spatial light modulator chipset had gotten so cheap! It's about $6 or $7 ea/qty 100 for 768x1024 pixels.)

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.