People Detection and Tracking using Depth Camera

(1)

1 INTRODUCTION

Depth video cameras are starting to play a major role in videogames and novel Natural User Interface sys- tems, which represent an interesting challenge to the computer vision field.

This kind of video sensors typically uses the signal reflected by an IR light to measure the total distance from the object to the camera. At some parts of the scene the distance can't be estimated and, therefore, the depth image will be severely affected by noise.

These failures occur most frequently in surfaces that don't reflect the IR light, and the signal quality dete- riorates as the distance increases.

Other related research studies try to minimize noise problems using techniques for smoothing [Lee], filling the image gaps [Edeler], or assuming distance restrictions in the acquired data. Despite this, it's an inevitable fact that the noise will interfere on the processing of the depth images, especially at the objects segmentation step. In order to perform the image segmentation and object detection in depth images, other works also use auxiliary data of color video cameras [Holz] or background subtraction techniques [Leens] to find moving objects. These methods present many limitations for a use in uncon- trolled environments. For example, when the provid- ed light for the color video camera is weak or its in- tensity varies along the time. Furthermore, if moving objects and persons are already present in the scene in the first acquisition of the background image, these techniques will have worse results.

In the next chapters, we present a new fast method to find and track people, or other kind of predefined objects, only using the data from a depth camera.

2 IMAGE SEGMENTATION

Our algorithm starts by making the image segmentation. This step is used to locate the boundary lines of the objects present in the depth image. For this pur- pose, we perform the edge detection based on the gradient values.

The gradient value of each pixel is obtained by convolving the Prewitt kernels with the original image, to calculate approximations of the first-order derivatives.

The segmentation is expressed in a binary image (Fig. 1.b) that contains just the main edges of the depth image, after applying a threshold to the gradient values.

Figure 1. a) The original depth image. b) The resulting binary image with the main edges.

People Detection and Tracking using Depth Camera

Rui Silva

[email protected]

EXVA Technologies, Avepark - Zona Industrial de Grandra, Sala 104 S. Cláudio de Barco, 4805-015 Guimarães, Portugal

Duarte Duque

[email protected]

IPCA - Instituto Politécnico do Cávado e do Ave, 4750-333 Barcelos, Portugal

ABSTRACT: In this paper we present a method for real-time detection and tracking of people in video cap- tured by a depth camera. For each object to be assessed, an ordered sequence of values that represents the dis- tances between its center of mass to the boundary points is calculated. The recognition is based on the analysis of the total distance value between the above sequence and some pre-defined human poses, after apply the Dynamic Time Warping. This similarity approach showed robust results in people detection.

(2)

3 DISTINCTION AND FILTERING OF OBJECTS After the segmentation, we perform the labeling of connected components present in the binary edge image. A unique ID number is assigned to each connected region, contained within a closed edge. This allows us to distinguish the various objects.

We count the total number of pixels of each object, and measure the width and height of its bound- ing box.

As we're searching for people, in order to speed up the processing, we ignore all the regions that don't correspond to human body morphological character- istics.

Therefore, the object is deleted if:

- The total number of pixels is smaller than a predefined minimum;

- The total number of pixels is higher than a predefined maximum;

- The ratio width / height is higher than 0.8.

4 FEATURE EXTRACTION

To distinguish and compare the objects present in depth images, the shape of the object is one of the most expressive features.

We apply the Chain Code algorithm [Freeman] to read the position of the boundary points of the object in an orderly manner. The starting point for the Chain Code algorithm is defined by the boundary point located at the maximum value of the x-axis

frequency histogram of the top half of the object.

Then, the values of the distance between the center of mass of the object and its boundary points, are used as the main feature for identify the object (Fig.

2.d).

This step was previously done for the poses that we use in the predefined set for the comparison pro- cess, and the arrays were stored in files for posterior use.

5 CLASSIFICATION OF OBJECTS

Finally, the descriptor generated in the previous step is compared with the predefined human body poses.

To check the similarity of the object being assessed with each of the predefined pose, both arrays are processed with Dynamic Time Warping (DTW) algorithm [Myers].

If the total distance between the two arrays after the DTW is below a minimum preset value, the object is classified as a person (Fig. 2).

Depending on the pose to be compared, we have defined different threshold values. If we increase the value in these parameters, the number of poses accepted related to each predefined descriptor will also increase. This can be an advantage to classify different human movements with a low number of predefined poses, but also raises the number of false posi- tives. To avoid this, when the boundary line it's not so expressive, which happens in poses that the legs and arms are close together, the value should be low.

Figure 2. Scheme of the proposed approach.

(3)

6 RESULTS

In this section we present some experiments that we have made with the implementation of the proposed algorithm. In Figure 3 we can see that the application is capable of correctly classify different human poses. We can also verify that the shape of the accepted objects is related to the ones of the predefined set, but is not strictly equal to them. Figure 4 b) shows multiple persons detection, with good results even with partial occlusion.

The main problem when we perform the segmentation of depth images based on the edges, is that if two or more objects are at the same distance from the camera and touch each other at some point, no edge will be found at that zone. Consequently, the system will recognize them as being only one object.

In the future we will define a ground truth set. For this, we will do the manual segmentation of the objects in some depth images. So we can use those human-marked boundaries, to properly evaluate the results of our algorithm.

Figure 3. On the top, (a) and (b) represent two predefined human poses. Below, (c) to (j) are examples of correct classifications after calculate the DTW total distance with the above pose images.

Figure 4. a) The original depth image. b) Detection of multiple persons with robustness to partial occlusion.

(4)

7 CONCLUSION

In this work we presented a new approach to auto- matically detect people with depth cameras. The proposed solution consists on an edge detection algorithm, an object filter based in morphologic char- acteristics, a feature extraction technique based on the shape, and comparing the degree of similarity with DTW.

The system seems to be sufficiently fast to detect and track multiple people, walking normally around the scene.

Despite the results obtained, the validation of this solution still lacks of an extensive test. Therefore, our main concern in the near future is to extend the experiments on different real-world scenarios.

8 REFERENCES

Edeler, T., Ohliger, K. Hußmann, S. & Mertins, A.

2010. Time-of-Flight Depth Image Denoising using Prior Noise Information. ICSP.

Freeman, H., 1961. On the encoding of arbitrary geomet- ric configurations. IRE Transactions on Electronic Computers: 260-268.

Holz, D., Holzer, S., Rusu, R. B., & Behnke, S. 2011.

Real-Time Plane Segmentation using RGB-D Camer- as, Proceedings of the Robocup Symposium.

Myers, C. S., & Rabiner, L. R. 1981. A comparative study of several dynamic time-warping algorithms for connected word recognition. The Bell System Tech- nical Journal: 1389-1409.

Lee, P.-J. 2010. Nongeometric Distortion Smoothing Approach for Depth Map Preprocessing. BMSB.

Leens, J., Piérard, S., Barnich, O., Van Droogenbroeck, M., & Wagner, J.M. 2009. Combining Color, Depth, and Motion for Video Segmentation. ICVS09: 104- 113.