The following post is more for my own benefit. Sometimes, by describing a problem to someone else, it becomes clearer.
For my thesis, I’m working on a homemade low-cost 3D scanner. The idea being that you should be able to create a water-tight 3D model that can be 3D printed directly from a 3D scan. For example, here’s a real rough scan of a Kleenex box I did last week:
Again, this is some very rough looking work. The scanner isn’t fully calibrated, and I’m just dumping a lot of raw geometry out, just to see what things look like. This particular scan came from 8 different captures of depth data from a Microsoft Kinect. The data wasn’t filtered in any way. Here’s a slightly better calibrated single scan:
The issue that I’m currently having is performance. To process a single capture of depth data takes about 8 seconds. This seems quite slow to me, so last week, I did some digging around to see if I could ramp up the performance. Before I can talk too much about performance, I’ll need to describe a few of the data structures that I’m using.
At the core of the technique that I’m using, I’m treating small square regions of space as being either solid or empty. I assume that they are empty until I find out otherwise. I’m currently using a Microsoft Kinect, which is a depth sensing camera, to capture my depth data. I’ve built a crude wood platform that uses a stepper motor to turn. I grab a set of depth data from the Kinect then tell the platform to turn by, say, 45 degrees. Once the platform has finished turning, I grab another set of depth data. I do this until I’ve scanned all the sides of the object. The internal workings are a little bit more complex than that, but that’s the basic idea. Here’s a picture of what the setup looks like, complete with the Kleenex box I’ve been using to test the system:
Once I have all the depth data captured, I begin the processing. This is where the magic happens, and where things get more interesting.
Internally, I’m using a cool data structure called an octree (technically an octrie, but whatever). An octree starts with a root node that is basically a giant cube of 3D space. It then subdivides that cube into 8 smaller, equally sized cubes. It then takes each of those and divides that into 8 equally sized cubes… and so on, until the cubes hit a minimum size. The size I’ve been playing with is about 2 mm. This is about the maximum depth resolution of the Kinect, so there’s probably not too much point in going smaller.
There’s also no point in allocating memory for regions of the 3D space that are empty. Because of this, I only start to fill in the tree where I have data that I’ve scanned. So rather than forcing each node in the tree to always have 8 children, I arranged it so that each node in the tree can have up to 8 children. In a simple test, this brought the memory consumption down by 30%.
Anyway, what I’m doing is this: the depth frames from the Kinect come in as basically a 640×480 image, but rather than a color for each pixel, I’m getting a depth value. I make the assumption that if two adjacent pixels in the image are valid, and aren’t too different in depth values, then the object being scanned must be solid between those two points. Extrapolating this, if I have three points that are all similar in depth value, and are adjacent to each other in the image, that I can form a triangle between those three points, and that triangle corresponds to some portion of the surface of the object being scanned.
So, from a single set of depth data, the end result is a bunch of triangles that fall somewhere on the surface of the object. Great. Next, what I do is run those triangles into the octree structure, and figure out which of the leaf nodes in the tree intersect each triangle. That’s where the octree comes into play. By starting at the root node, it’s really quite efficient at figuring out which subsequent nodes in the tree intersect the triangle. When I do find the child nodes that intersect a triangle, I mark them as being solid.
(OK, so technically, I don’t mark them as solid. I assume that if the recursive function I’m using has generated a child node of a minimum size, that it must have been the result of an intersection test, and therefore must be a child node).
The end result of this, is a tree with a bunch of tiny child nodes that represent regions of 3D space in the scan that are solid. I make another function call to the tree that gets me all these tiny child nodes in the tree. I then generate geometry for each child node based on it’s neighbors. If it has no neighbor on a given side, I create two triangles to cover that side of the tiny cube. If there is a neighbor on that side, then I don’t generate any geometry. (This is really just a huge, crude approximation, just so I can start seeing some results. Eventually this will be replaced with something like marching cubes).
The end result is a ton of geometry that shows an approximate surface contour of the object. It’s crude, but it works.
The problem that I ran into last week was this: performance. The performance isn’t bad, but I figured there must be some way of speeding things up. I did a bunch of reading about various voxel engines, and one interesting technique I discovered is this: rather than have each leaf in the tree only store a single solid region, why not have each leaf store a bunch of regions? By widening and flattening the tree, this reduces a lot of the pointer chasing to get to a given child node. Apparently heavy pointer chasing is a bad thing, as it results in cache misses. Cache misses can kill performance (especially on platforms like the Xbox 360). So, in theory, flattening the tree should help reduce cache misses, and increase performance, right?
Not in my case. As I mentioned, it was taking about 8 seconds to process a single frame of depth data. Not great, but not unbearable. When I ran a test with the flattened tree, performance was a LOT worse. Like 10x worse. I was a bit surprised at this, but I think the reason for it makes sense.
When I flattened the tree, I tried making the final child node contain 4x4x4 chunk of 3D space. For every triangle that I generate from the depth data, I need to check it against the child nodes to figure out which ones are intersecting. The triangle-AABB (axis aligned bounding box) test that I’m using seems pretty quick, but despite this, calling it a million times or more will result in a performance hit. When I was using the regular octree, each child node in the tree took at most about 9 triangle-AABB tests to get to. So supposing that I had a tiny triangle that only spanned a single leaf node, this would take at least 9 of these tests to figure out exactly which leaf node it was that was being intersected by the triangle. Comparing this with the flattened tree, it was much much worse. Supposing that each leaf node in the flattened tree contained a grid of 4x4x4 children, then it would take at least 4x4x4 (=64) triangle-AABB intersection tests plus the few extra tests to get to that child node. So yes, I was saving on cache misses, but I was doing a LOT more triangle-AABB tests. I still haven’t fully profiled things to figure out if this is the case, but it does seem to be the most likely culprit.
To make a long story short, I learned the following:
- It really doesn’t take long to prototype things.
- Prototyping can be worthwhile. Even if you learned what doesn’t work, you still learned something valuable.
- Sometimes the performance gain you think you’ll get will be offset by a loss somewhere else.
- Looking for performance gains can be fun.
Now if you’ll excuse me, I’ve got some triangle-AABB tests to run…