We study the problem of self-supervised structured representation learning using autoencoders for generative modeling. Unlike most methods which focus on using side information like weak supervision or defining new regularization objectives, we focus on improving the representation using a novel decoder architecture and an improved sampling technique. Our structural decoder architecture learns a hierarchy of latent variables, akin to structural causal models, and learns a natural ordering of the latent mechanisms without any additional regularization. We propose a novel framework to characterize the quality of the learned representation by applying interventions in the latent space and evaluating the effects to gain insight in the causal structure learned by the model which also enables us to quantify how disentangled the representation is. We evaluate our architecture and sampling method on several challenging natural image datasets and compare to several canonical baselines.