ECE 590SIP, Fall 2020

Image Source

Self-expressing autoencoders for unsupervised feature learning

Saurabchand Bhati, September 14, 2020, 4:00-5:00pm CDT

Generic features like MFCC contain information about various sources like speech, speaker, emotion, etc. In supervised scenarios, the manual labels guide the neural networks to select task-specific representations which is not possible in unsupervised settings. For good unsupervised performance, the representations must highlight the task-specific information e.g. highlight the phonetic properties for good segmentation.

We propose Self-expressing autoencoders (SEA) to learn representations that highlight the phonetic properties of a frame more than other factors of variability. It consists of a single encoder and two decoders with shared weights. The encoder projects the input features into a latent representation. One of the decoders tries to reconstruct the input from these latent representations and the other from the self-expressed version of them.

We explore the usefulness of SEA in three scenarios : No supervision: SEA representations are used to segment and cluster the speech data for the unsupervised unit discovery task. Partial or distance supervision: SEA representations are used for Audio-visual feature learning. Complete supervision: We analyze the amount of phonetic information captured in the SEA representations by mapping them to phone labels via a single sigmoid layer.

Link to the paper