This network has two parts : The model learns a representation of images, disentangled with scene structure and viewing transformations such as depth rotation and lighting variations. This generative model reconstructs the input and outputs a more complete and sufficient model (in image processing, transforming original input).