Moviescc |work| 〈4K〉
Movie scene classification, deep learning, content-based video analysis, film narrative, clustering, MovieSCC 1. Introduction Cinema is a complex multimodal art form. For decades, film analysis relied on manual annotation by scholars and archivists. With platforms like Netflix, Disney+, and YouTube hosting millions of hours of video, automated scene understanding has become critical for indexing, recommendation, and accessibility.
[6] McInnes, L., Healy, J., & Astels, S. (2017). HDBSCAN: Hierarchical density based clustering. JOSS . : Sample scene embeddings (t-SNE visualization) and confusion matrix are available in the supplementary material. End of paper. moviescc
[5] Devlin, J., et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL . With platforms like Netflix, Disney+, and YouTube hosting
Existing video classification models (e.g., VideoBERT, TimeSformer) often treat movies as generic video streams, ignoring unique cinematic structures such as shot transitions, pacing, and narrative tropes. fills this gap by focusing specifically on scene-level classification —the atomic narrative unit typically comprising multiple shots unified by time and location. HDBSCAN: Hierarchical density based clustering
Author: [Generated for academic purposes] Date: April 14, 2026 Abstract The exponential growth of streaming media and digital film archives has created an urgent need for automated, granular analysis of cinematic content. This paper introduces MovieSCC (Movie Scene Classification and Clustering) , a computational framework designed to classify movie scenes based on visual, auditory, and narrative features. Leveraging deep learning architectures—including convolutional neural networks (CNNs) for keyframe analysis, recurrent neural networks (RNNs) for dialogue sentiment, and graph-based clustering for narrative arcs—MovieSCC achieves 87.4% accuracy in identifying scene types (e.g., action, dialogue, suspense, romance) across a diverse dataset of 10,000 annotated scenes from 500 films. We discuss its architectural components, training methodology, applications in content recommendation, film editing, and accessibility (e.g., audio description generation), as well as limitations regarding cultural bias and computational cost. This paper provides a foundation for future research in automated cinematic understanding.