What Makes Videos Memorable?

29/01/2026

This paper presents a comprehensive study on video memorability prediction conducted by researchers from the Information Processing and Telecommunications Center (IPTC), research center of theUniversidad Politécnica de Madrid (UPM).The work investigates how semantic information, derived from both visual and textual content, can improve the prediction of how likely a video is to be remembered by viewers.

Video memorability is an important factor in human perception and has direct implications for content design, information retrieval, and user engagement. While previous approaches have shown that semantic cues play a key role, the specific benefits of multimodal models combining images and text have not been systematically analyzed. This study addresses that gap through a controlled comparison between unimodal models (using only images or only text) and multimodal models based on CLIP (Contrastive Language–Image Pre-training).

The authors introduce FCLIP, an adapted version of CLIP that is further trained using image–text pairs specifically related to memorability. Extensive experiments on the Memento10k dataset show that multimodal models significantly outperform unimodal baselines. Results demonstrate that contrastive learning and domain adaptation improve the ability of models to capture semantic information relevant to human memory.

Beyond its scientific contribution, this research highlights the importance of semantic understanding in multimedia analysis and provides practical guidance on how to adapt large pre-trained models to specific cognitive tasks.

Potential applications of this research include the design of more effective advertising campaigns, improved educational and training videos, content recommendation systems, and tools to support media creation by predicting which videos are more likely to be remembered by audiences.

Bibliographic reference:

Martín-Fernández, I., Esteban-Romero, S., Gil-Martín, M. & Fernández-Martínez, F.  A comprehensive study on contrastive pre-training and fine tuning of vision and text transformers for video memorability prediction.  Multimedia Tools and Applications, 85, 30 (2026). https://doi.org/10.1007/s11042-026-21260-3

Iván Martín Fernández: GS / ORCID / LinkedIn

Sergio Esteban Romero: GS/ ORCID / LinkedIn

Manuel Gil Martín: GS / ORCID / LinkedIn

Fernando Fernández Martínez: GS / ORCID / LinkedIn


LinkedIn: https://www.linkedin.com/company/iptc-upm/

For more information: www.iptc.upm.es

Share this: