Training on Synthetic Dataset + Real-world Data

6.3 Training with Real-world Data

6.3.3 Training on Synthetic Dataset + Real-world Data

By using real data as training data to train the network, the network was able to estimate the human pose occluded by cloth to some extent. In this subsection, we examine whether the synthetic dataset can supplement the lack of posture variation. Fig. 36 shows the estimation results when the network was trained with a synthetic dataset of 59 postures and 5 postures of real data together as the training dataset. The output from the network trained with only the real data is not accurate enough, as the orientation and size of the person are not well estimated. Training the network with both the synthetic dataset and the real data solved these problems. It can be said that the pose variations in the synthetic dataset could supplement the lack of pose variations in the real data. In addition, the estimation accuracy improved as the number of real data added to the synthetic dataset increased (Fig. 37). In particular, when 20 postures of real data were added to the synthetic dataset, the pose estimation was successfully performed even for the body parts that were difficult to estimate due to the high degree of freedom of movement such as wrists and ankles. It seems to be necessary to add at least approximately 20 poses of real data. Table 2 lists the corresponding RMSEs and PCKs when the network is trained on some datasets.

The RMSE and [email protected] for the full body were 8.570 [px] and 0.789, respectively, when the network is trained with the 20 postures of real data, indicating that the posture estimation is more accurate than when training with the synthetic dataset. Therefore, it is considered that the network is able to successfully extract features for estimating the human pose under the cloth from the input of the single depth image. Also, the estimation accuracy improved as more real world data were added to the synthetic dataset. The lowest accuracy is obtained when the network is trained only on the synthetic dataset. Training on real data only gives better estimation accuracy than training on synthetic datasets only. The highest accuracy has been achieved when the real data of 20 postures is added to the synthetic dataset;

Table 2:RMSE and [email protected] for full body. Test data: real data (30 poses).

Training Dataset RMSE [email protected]

synthetic dataset only 26.976 0.167 real data only (5 poses) 15.773 0.508 real data only (10 poses) 13.225 0.636 real data only (20 poses) 8.570 0.789 real data (5 poses) + synthetic dataset 8.760 0.738 real data (10 poses) + synthetic dataset 7.729 0.800 real data (20 poses) + synthetic dataset 7.061 0.841

the RMSE and [email protected] being 7.061 and 0.841, respectively. Therefore, it can be said that it is effective to use not only real data but also synthetic dataset as training dataset to improve the accuracy of human pose estimation under cloth-like objects from a single depth image.

(a)input (b)real data only (c)real data + synthetic (d)pseudo-ground truth Figure 36: Estimation results in real scene. real data only (Training dataset: 5 poses of real data with cloth. Test dataset: Real data with cloth.). real data + synthetic (Training dataset: 5 poses of real data with cloth + synthetic dataset with cloth. Test dataset: Real data with cloth.).

(a)5 poses (b)10 poses (c)20 poses (d)pseudo-ground truth Figure 37: Estimation results in real scene. 5 poses (Training dataset: 5 poses of real data with cloth + synthetic dataset with cloth. Test dataset: Real data with cloth.). 10 poses (Training dataset: 10 poses of real data with cloth + synthetic dataset with cloth. Test dataset: Real data with cloth.). 20 poses (Training dataset: 20 poses of real data with cloth + synthetic dataset with cloth. Test dataset:

Real data with cloth.).

7 Conclusions

7.1 Conclusions

This thesis describes a method of pose estimation for humans under cloth-like object such as blankets. We use depth images as inputs to avoid the sensitivity to illumination conditions and privacy concerns. We utilizes a cloth deformation simulation for generating pairs of depth images of humans under cloths and locations of joint keypoints in pixel coordinates. These pairs of depth image and keypoint are then used for training a network. The performance evaluation using synthetic test data shows a potential ability of the proposed method for human pose estimation under cloth-like objects. Even though the postures in the input data are unknown or the human body is covered with a cloth-like objects, the network successfully es- timates the human pose. The evaluation using RMSE and [email protected] showed high accuracy on both synthetic test datasets with and without cloth. On the other hand, the application to real data has not been achieved well. As mentioned above, the difference between synthetic and real data makes the reliable estimation difficult. To alleviate the reality gap, we added real data to the training data for training. The network was trained using the synthetic dataset with a few real-world depth images which have 20 kinds of postures, and the results showed that the RMSE and PCK were 7.061 and 0.841, respectively. It is expected that this accuracy can be further improved with more variations of postures in the synthetic dataset.

No documento Human Pose Estimation under Cloth-like Objects from Depth Images Using a Synthetic Image Dataset with Cloth Simulation (páginas 40-44)