An efficient encoder–decoder model for portrait depth estimation from single images trained on pixel-accurate synthetic data

Research output: Contribution to a Journal (Peer & Non Peer)Articlepeer-review

17 Citations (Scopus)

Abstract

Depth estimation from a single image frame is a fundamental challenge in computer vision, with many applications such as augmented reality, action recognition, image understanding, and autonomous driving. Large and diverse training sets are required for accurate depth estimation from a single image frame. Due to challenges in obtaining dense ground-truth depth, a new 3D pipeline of 100 synthetic virtual human models is presented to generate multiple 2D facial images and corresponding ground truth depth data, allowing complete control over image variations. To validate the synthetic facial depth data, we propose an evaluation of state-of-the-art depth estimation algorithms based on single image frames on the generated synthetic dataset. Furthermore, an improved encoder–decoder based neural network is presented. This network is computationally efficient and shows better performance than current state-of-the-art when tested and evaluated across 4 public datasets. Our training methodology relies on the use of synthetic data samples which provides a more reliable ground truth for depth estimation. Additionally, using a combination of appropriate loss functions leads to improved performance than the current state-of-the-art network performances. Our approach clearly outperforms competing methods across different test datasets, setting a new state-of-the-art for facial depth estimation from synthetic data.

Original languageEnglish
Pages (from-to)479-491
Number of pages13
JournalNeural Networks
Volume142
DOIs
Publication statusPublished - Oct 2021

Keywords

  • 2.5D dataset
  • Convolution neural network
  • Depth estimation
  • Encoder–decoder architecture
  • Facial depth
  • Hybrid loss function

Fingerprint

Dive into the research topics of 'An efficient encoder–decoder model for portrait depth estimation from single images trained on pixel-accurate synthetic data'. Together they form a unique fingerprint.

Cite this