Issues with custom disparity generation

PLore1 · ‎03-22-2018

Does anyone have experience generating custom disparity maps with the D400-series sensors? I am using a D415 at the moment, and while the 3D results are OK, they do seem quite wavy (see: https://github.com/IntelRealSense/librealsense/issues/1375 Very wavy cloud · Issue # 1375 · IntelRealSense/librealsense · GitHub). In my quest to get better results, I have been experimenting with OpenCV's StereoBM and StereoSGBM in order to generate a disparity map from the left/right IR images recorded from the D415 in 1280x720 resolution. However, when I convert the disparity values into a depth map, I am getting what looks to be exponential scaling in the Z direction. See the images below - the first 3D mesh is generated from the depth map provided by the sensor, while the second one is generated from the StereoBM disparity map. The code I am using to convert from the depth map to 3D points is identical in both cases.

For reference, here's what the disparity map looks like:

Here are the relevant values I am using to convert from disparities to depth, retrieved from the IR intrinsics/extrinsics:

cX = 629.8552856445313

cY = 358.5696105957031

fX = 938.2886352539063

fY = 938.2886352539063

baseline = 0.054923441261053085

I am converting from disparity to depth using the following formula:

depth = (fX * baseline) / disparity

Note that the disparity value is also divided by 16 due to the data format returned by StereoBM.

I have also tried creating a Q matrix based on the values above so that I can use cv::reprojectImageTo3D() - results were (unfortunately) the same.

From experimenting with these values, I suspect the issue is with generating the disparity map, and not the conversion afterwards. It is my understanding that the IR images are already rectified, and lens distortion has been removed (see: https://github.com/IntelRealSense/librealsense/wiki/Projection-in-RealSense-SDK-2.0# d400-series Projection in RealSense SDK 2.0 · IntelRealSense/librealsense Wiki · GitHub). Is there some other secret sauce going on here that I'm missing? Maybe lens distortion that's not accounted for?

I realize that I can do a custom calibration myself, but I'd really like to avoid that step if possible. Any help would be greatly appreciated.

PSnip · ‎03-22-2018

Hi Lorp,

I don't have experience in generating custom disparity map from D415, but I can help you with some inputs.

1. First of all, one thing what I understood here is that your main target is to improve the Depth Map. 3D Map will eventually improve once you have a better depth map. Is that correct? Based on my this understanding, I think we should compare the following to understand root cause of the problem:

(i ) Depth Ground Truth

(ii) SDK generated Depth Map

(iii) your own depth Map

You posted a 3D map, which correctly explained the problem. But more software layers we involve in testing, there can be more confusion as to which layer is causing problem. So if you can share the depth map comparison image, we can be sure that depth map indeed is problem [and not the depth to 3D generation]. comparing custom depth map with ground truth and D415 depth map will also give more insight in to root cause. So, if possible can you provide the above 3 images for comparison?

2. Coming back to Depth Map quality, note that though depth from stereo is theoretically a perfect method to compute depth, but practically it has not yielded best result so far. Prior to Kinect, there was no good quality Depth Camera. The Kinect revolutiozed depth camera market by use of active IR. You can't expect to match performance of a Kinect just by using stereo imaging. I am sure that you already know about how Kinect computes the depth with active IR, but just sharing this link.

https://www.youtube.com/watch?v=uq9SEJxZiUg

There are also many other material available on the technology with Kinect uses.

3. Since Kinect relies on active IR, it can only work indoors. Real Sense camera are on the other hand have been designed to work both indoors and outdoors, even under water. D415 uses a mix of active IR, and Stereo Depth. You can optionally turn on/off active IR, or experiment with various parameters like led power etc. Though Kinect is designed to work under very special scene conditions, D400 cameras target a broader work condtions. So user will have to play around with various settings and parameters to arrive at what works best for them. So one way to improve on D415 generated depth map quality will be to experiment with various parameters/settings available in realsense SDK. The other way of course is to try to generate your own custom depth map

4. About your own custom depth map, one potential concern you should have is that you will be solely relying on stereo imaging here. You won't be using active IR, so there should be a concern that you will never be able to match the performance of Kinect. In any case it will be interesting to debug the issue which you reported with custom depth. In order to look further in to this, I would appreciate if you could post the 3 depth map images for comparision, i.e. ground truth, d415 depth map, and your custom depth map. Then I can check further on the parameters you use.

Best Regards,

PS

PLore1 · ‎03-23-2018

Hi PS,

The SDK-generated depth map is fine - that is to say, I have generated 3D data from the depth map, used them to build full 3D reconstructions, and did some measurements. The SDK gives me similar results to the ground truth. When you say you want to compare the "depth maps", I assume you mean converting the floating-point map of Z values normalized to fit in an 8-bit image? I'm not really clear what you're looking for there. I am fairly convinced that the disparity map is the issue, not the depth map.

As far as the rest of your comments about active IR and stereo imaging -

The Kinect works by calibrating a single camera with the dot pattern from the projector. This allows for generation of a "stereo" image, rectified from the projector's point of view to the camera's point of view. This allows for the generation of a disparity map created from the stereo image pair.

When it comes to the RealSense, as far as I know, the stereo cameras do not get calibrated with the projector dot pattern at all. There are 2 IR cameras as opposed to the 1 in the Kinect, so the second camera gets calibrated with the first camera which allows for rectifying the images. Without the projected dot pattern, reconstruction will still work as long as there are sufficient details in the scene. The projected pattern (aka, "active IR") is there simply to help guide the stereo reconstruction algorithm in areas where there are few details (a completely white wall with no distinguishing features, for example). There might be more going on behind the scenes with the D400-series, but that's my general understanding of how the technology works. Ultimately, I would expect to get better results than the Kinect with the D400-series since camera-to-camera calibration should be more accurate than projector-to-camera.

For reference, here are the original IR images I used to generate the disparity map:

In case anyone's wondering, the images are dark to human eyes, but that's not a problem for the disparity generation code. At least, issues are not going to manifest in the way that I am seeing right now.

I'd be pretty interested to hear from someone at Intel for some clarification on the issue I'm having with disparity map generation. Is this the right forum for that, or is the GitHub "Issues" page more appropriate?

idata · ‎03-23-2018

Hello @Lorp,

We currently do not support this use case. If you tell us what you are trying to achieve we may be able to tell you how to best setup the camera.

Regards,

Jesus G.

Intel Customer Support

PLore1 · ‎03-27-2018

Hi Jesus,

I am trying to get more accurate results when it comes to capturing 3D data in the near range (ideally 0.2-0.6m, ignoring device-specific limits). I have been able to modify the advanced settings for the D415 in order to get semi-decent near data (see scan in earlier post), although the results are still not quite as good as I was expecting. Generating a custom disparity map was an attempt to mitigate some of the artifacts in the RealSense-generated data. The bumps I am seeing are oddly persistent - if it was purely noise, then the temporal filter could smooth out the bumps, but that is not the case. My thought was that better results could potentially be obtained by employing a more thorough algorithm for the disparity generation stage, but the fact that I do not seem to be able to use the IR images for this purpose is a bit of a bummer.

Ultimately, I may consider calibrating the device using OpenCV and retrieving the uncalibrated/unrectified 1080p IR images and go from there. It will completely omit most of the RealSense-based processing from the picture; I am aware that this defeats the purpose of the device somewhat. In any case, "we currently do not support this use case" is a perfectly acceptable answer (even if it's not what I was hoping for), so I'm marking this topic "answered".

Thanks for your time.

idata · ‎03-28-2018

Hello Lorp,

Our team wants to help you further. Can you send an RGB image of what you are trying to scan? The distance is between .2m - .6m, correct?

Jesus

Intel Customer Support

PLore1 · ‎03-28-2018

Hi Jesus,

Here's a photo of the object with the D415 for size reference:

Ultimately, I am looking to scan body parts (heads, feet, hands, etc, but not full-body scans). I realize the D415 near range supposedly maxes out at 0.3m, but I am waiting on a D435 (ordered already, but unsure of when it will ship) which I believe is supposed to work closer to that 0.2m distance.

ROhle1 · ‎03-28-2018

Thanks for raising the questions. When I look at the stereo images I see vertical offsets in the projected IR pattern... up to

10 pixels... look in area of chin. But when I look at the marks and writing that you have made on the model, there don't seem to be any vertical offsets in your marks. The model looks like it is made of Styrofoam... and I am wondering if the IR isn't passing through the surface and reflecting from beneath in some areas depending upon the angle?

Try doing the analysis without the IR projector. I have my own rat-model approach to stereo analysis and would be interested in your pictures and results.

Also... the disparity map above... is that yours or from the camera?

I assume it is from your work... the 3D reconstruction sure seems like there is a math error somewhere... so programming error?

I am waiting on my D435 and haven't done much poking around yet. But I was immediately confused about disparity values... I would think that if your have an error, it is in the process of going from pixel offsets to calculated disparity?

If you could post an image that has your found pixel offsets set to gray, I would like to try to put it through my process.

Thanks,

Rich

Anders_G_Intel · ‎04-02-2018

I would recommend exploring more optimizing the settings of the RealSense Camera.

It was very helpful that you sent an image. We can now reproduce this and show you what we see.

Here is with the D415 at ~47cm distance, with 1280x720 resolution, 30fps.

Here is the depth map, down-sampled by 3x in x-y.

And here is it with a small amount of post-processing. Using our post processing exponential running average with alpha=0.3 and delta=20, and edge-preserving filter with alpha=0.6 and delta=8.

Also here is with texture applied:

And if you use want to remove the IR pattern from the left RGB channel you can do that too by using a proper color correction matrix:

Please check out this link for more on how to optimize the camera for your usage.

https://realsense.intel.com/wp-content/uploads/sites/63/BKMs-For-Tuning-RealSense_D4xx_Cameras_WP_1.7.pdf https://realsense.intel.com/wp-content/uploads/sites/63/BKMs-For-Tuning-RealSense_D4xx_Cameras_WP_1.7.pdf

Anders_G_Intel · ‎04-02-2018

Also I use the "High Density" depth preference for this capture.

PLore1 · ‎04-02-2018

Hi Anders,

Thank-you for this comprehensive feedback. That document is extremely useful, and it is not something I had come across until now. Is it linked anywhere? In any case, it is full of great info. The colour correction option for the D415 is definitely interesting.

While I am using the included tripod and a static object for some of my testing right now, ultimately we will want to move around the body parts we are scanning, likely in a handheld form factor. This does mean that a temporal filter would not work for us, but the edge-preserving smoothing could definitely be useful.

Since we are looking at potentially using this sensor as a handheld device, I am a bit leery of the rolling shutter cameras in the D415 due to motion artifacts. I will have to do some more testing with regards to this caveat. In terms of which one will be better for our purposes (D415 vs D435), it is probably going to come down to whether the better RMS results of the D415 (according to the document) outweigh the artifacts caused by movement and the rolling shutter. It's a shame that OV doesn't appear to offer the global shutter equivalent of the D415 cameras.

I should have some time later this week to attempt reproducing the results you have posted. Thanks again - extremely impressed with the support for these devices thus far.

Lorp

idata · ‎04-03-2018

Hello Lorp,

Thank you for the kind feedback. The tuning whitepaper can be found at:

Regards,

Jesus G.

Intel Customer Support

PSnip · ‎04-04-2018

Hi Jesus,

Thanks for sharing this very informative document. I have some questions w.r.t. to the Depth Tuning whitepaper. Can you please respond.

(1) Under Section-1, Optimal Depth Resolution is mentioned as 848x480, but for D435 it is mentioned as 1280x720. This difference is despite the fact that both sensors have max Depth resolution of 1280x720. What makes 848x480 as optimal Depth resolution for D435, whereas for D415 the optimal resolution is 1280x720? Is it related to the fact that IR sensor on D435 is 1280x800, and IR sensor on D415 is 1920x1080?

(2) Under Section-3.e what is HFOV - is it Horizontal FOV of the IR sensors? Under section 3.e on page-3 HFOV of D435 is mentioned as 90 degrees and HFOV of D415 is mentioned as 65 degrees. But under section 11.b on page-10, HFOV for D415 is mentioned differently as 64 degrees and HFOV of D435 is mentioned as 86 degrees. A table on same page, again has different figures, it mentions HFOV of D415 as 69.4 degrees and HFOV of D435 as 91.2 degrees. What is the most correct values to be considered?

(3) under Section-3.e, Theoretical limit on RMS error is given as

RMS error (mm) = Distance(mm)^2 x Subpixel / (focal-length(pixels) x Baseline (mm))

And there are plots which are taken with subpixel=0.08. What exactly subpixel refers to in above equation and on what factors will its value depend? i.e. what value should we use for of calculation RMS error on our setup?

Anders_G_Intel · ‎04-04-2018

1. Yes, the optimal resolution related to the resolution of the input sensors. The Intel ASIC rectifies images, aligns them, and then scales the output. A rule of thumb is that you lose >20% of the input resolution by creating the overlapping rectified images, due to manufacturing tolerances and lens distortion. For the output, the ASIC can scale the output to many different resolutions, but at some point the up-scaling yield zero net improvement. The D415 has 2MP input sensors, and the D435 has 1MP input sensors.

2. The output resolution of the depth is nominally 90HFOV. We can look at being more consistent, but variations occur due to manufacturing tolerances.

3. The subpixel number is somewhat empirical. It is affected by the structure in the scene, the depth algorithm, the calibration quality, and the quality of the input images (ex. in focus, and well lit, high contrast). For a passive textured target (i.e. normal light illumination, not laser) we have measured less than 0.05, but in general expect <0.1 for a "good" unit. Note that areas that have no texture (like a what wall) will have much higher subpixel number, unless you project a texture.

Anders Grunnet-Jepsen

RealSense Group CTO, Intel

PSnip · ‎04-04-2018

Hi Anders,

Thanks for clarifying on above. I have one last query on white paper contents where the computed values of MinZ does not match with the values specified in white paper. Computed MinZ values differ from specified values by about 5.3% for D435, and about 11.35% for D415. I was curious to know what contributed to above difference in values. Is it due to a different formula being used, or different HFOV & baseline values being used, or because of using some empirical values rather than theoretical values? For details of computed vs specified MinZ values please see below.

(1) As per the white paper

focal-length(pixels) = Xres (pixels) / 2 tan(HFOV/2)

and

MinZ(mm) = focal-length(pixels) x Baseline(mm)/126

(2) Now,

(a) considering the D435 case, where

- HFOV is 90 degree

- Baseline is 55 mm

- and considering Xres 848

focal-length comes out to 424.35

and MinZ comes out to be 18.52 cm

This matches somewhat with the value of 19.5 cm specified in white paper (about 5.3% differenct from specified value).

(b) Also, considering the D415 case, where

- HFOV is 65 degree

- Baseline is 50 mm

- and considering Xres 1280

focal-length comes out to 1005.268

and MinZ comes out to be 39.89 cm

This is in fact much different from the value of about 45 cm mentioned in white paper (about 11.35% differenct from specified value).

Regards,

PS

PSnip · ‎04-04-2018

Hi Anders,

Sorry for this another question. I thought that previous question would be last, but as a after thought I want to clarify something more.

You mentioned about D435 that, " The Intel ASIC rectifies images, aligns them, and then scales the output. A rule of thumb is that you lose >20% of the input resolution by creating the overlapping rectified images, due to manufacturing tolerances and lens distortion. For the output, the ASIC can scale the output to many different resolutions, but at some point the up-scaling yield zero net improvement."

So am I right in assuming that when I select 1280x720 output depth on D435, the D4 VPU will process for 848x800 Depth resolution and then upscale it to 1280x720?

Best Regards,

PS

ROhle1 · ‎04-05-2018

Not sure what I was looking at:) There are no vertical offsets... Lordy.

One thing I did notice is that one of the images was translated... to give the appearance of converging optics.

When I removed the translation and set my analysis for parallel optics, I get >90% fill and pretty results. Needs post processing.

If you have images that do not have the projector on I would like to see it. To view the .ply file, use Meshlab->File->Import Mesh

Thanks,

Rich

Anders_G_Intel · ‎04-05-2018

thePaintedSnipe the best way to determine the actual FOV of your camera is to read the intrinsics, and use equation:

HFOV=2 x atan(0.5 Xres /FocalLenghtPix)

If you go this for D435 you will get focalLengthPix=424 (or similar) when you are in Hres=848. This will give HFOV=90 deg.

Second, you read of the baseline. This should be 55mm for the D415 and 50mm for the D435 (you had them inverted in your example).

Once you do this, then plug it all back into equation (1) and you should get about 16.8cm for the D435. Note that the minZ will change linearly with Xres.

Anders_G_Intel · ‎04-05-2018

It is a bit hard to say that there is an exact "native resolution", but yes a 1280x720 on the D435 is effectively up-scaled.

PLore1 · ‎04-05-2018

Hey, that looks great! To clarify, did you just offset one of the images horizontally by n pixels, or did you shrink the image horizontally? I had experimented with both of those briefly, but eventually gave up when it didn't seem to improve results.

It appears that your model is stretched vertically a bit. For comparison, I have attached a reference OBJ file of the exact same frame, but calculated on the RealSense device. The model also includes a pre-rectified texture image.

ROhle1 · ‎04-05-2018

The right image was translated to the right... as though a repeated cut and paste had occurred.

My model is screwed up. The absolute values will also be off, since I use the center of the left camera sensor

as the origin.

I haven't played with this stuff for a while. I'll have to look for my math error.

Apparently I have plenty of time. I have a D435 on order and tried to place an order for a D415 this morning.

Intel isn't used to dealing with people who forget both their password and their magic question:)

I cried out for help to Click... they are carefully studying what to do.

Rich