This work explores 2D Gaussian splats as a new primitive for representing videos. We propose GSVC, an approach to learning a set of 2D Gaussian splats that can effectively represent and compress video frames.
GSVC incorporates the following techniques: (i) To exploit temporal redundancy among adjacent frames, which can speed up training and improve the compression efficiency, we predict the Gaussian splats of a frame based on its previous frame; (ii) To control the trade-offs between file size and quality, we remove Gaussian splats with low contribution to the video quality; (iii) To capture dynamics in videos, we randomly add Gaussian splats to fit content with large motion or newly-appeared objects; (iv) To handle significant changes in the scene, we detect key frames based on loss differences during the learning process.
Experiment results show that GSVC achieves good rate-distortion trade-offs, comparable to state-of-the-art video codecs such as AV1 and HEVC, and a rendering speed of 1500 fps for a 1920x1080 video.
GSVC can achieve a significant improvement over the I-frame only method (Ours-GI), which use GaussianImage to encode each frame independently, demonstrating the contributions of predictive frames, pruning, augmentation, and key-frame detections brought to the representation.
Notice the floating dots in the up-left video for I-frame only method.
The neural-based approach achieves significant improvements over GSVC and even some state-of-the-art codecs such as VVC in terms of MS-SSIM.
These neural-based approaches, however, are pre-trained on the UVG dataset, and thus the superior performance is not surprising. The standard MPEG/H26x video codec and GSVC do not require any pre-training and can generalize well to any video.
GSVC achieves better or comparable performance in terms of MS-SSIM and VMAF (which consider image structure and human perception) except VVC. Our experiments on GSVC yield a lower PSNR in certain scenarios. But, it is important to note that we use \(L_2\) as a loss function, and we can tune the convergence condition to further improve the PSNR, without increasing \(N\), if we sacrifice the encoding time.
@article{wang2025gsvc,
title={{GSVC}: Efficient Video Representation and Compression Through {2D Gaussian} Splatting},
author={Wang, Longan and Shi, Yuang and Ooi, Wei Tsang},
journal={arXiv preprint arXiv:2501.12060},
year={2025}
}