RAVE Audio Models

RAVE Audio Models: Deep Learning as a Symbolic Form

This project was developed as the primary research component of my MA thesis, “Deep Learning as Symbolic Form,” which applied Erwin Panofsky’s theory of symbolic form to contemporary AI systems. The hands-on methodology reflects my broader approach to critical AI studies: examining computational systems through direct technical engagement rather than purely theoretical analysis.

The project examines how deep learning systems encode and reconstruct meaning through the training of three RAVE (Realtime Audio Variational autoEncoder) models on distinct audio datasets. Each model was trained for 1,166,800 steps using identical hyperparameters on Google Colab A100 GPUs.

Rather than treating these models as neutral tools for audio synthesis, the project investigates how RAVE’s architecture—its encoding process, latent space construction, and generative decoding—imposes formal logic onto training data. The featured audio samples are random 30-second generations from each model’s latent space, revealing consistent aesthetic patterns of fragmentation and rhythmic incoherence across all three models despite their different source material. These shared qualities suggest that the model’s latent space structures outputs independently of dataset content.

This work positions deep learning as a symbolic form where computational architectures mediate reality through processes of statistical abstraction embedded within planetary-scale computational infrastructure. By making visible the interpretive decisions encoded in model training, the project reveals how deep learning functions not as transparent reproduction but as formalized aesthetic and political intervention.

DrumModel Trained on 81 minutes of percussion recorded at 122 BPM in Los Angeles, 2024-2025. Random 30-second generation from trained model
VocalModel Trained on 72 minutes of solo vocal performances in the key of C, recorded in Burbank, CA, 2025. Random 30-second generation from trained model.
OperaModel Trained on 123 minutes of digitized 1920s Cantonese opera vinyl recordings. Random 30-second generation from trained model.