Deep Learning Video Prediction Theories and Their Architecture: A Review

Authors

DOI:

https://doi.org/10.31272/jeasd.2262

Keywords:

Convolutional neural network, Generative adversarial network, Recurrent Model, Three-dimensional CNN layers, Video prediction

Abstract

The most important factor in making a suitable decision in video-based AI systems is the ability to forecast future outputs from computer vision data. Deep-learning (DL) architectures are considered a promising new direction in many fields of computer vision. The researchers have recently introduced several novel video prediction (VP) models that achieve high performance. However, before building any prediction model, the basic principles of VP architectures and theories must be understood to determine the appropriate datasets and evaluation metrics. This study reviews 51 peer-reviewed papers published that cover the major VP architectures, including CNN, RNN, Autoencoder, VAE, and GAN models. The comparative analysis shows that 3D-CNN and GAN-based architectures achieve superior performance, with SSIM = 0.97 and PSNR = 40.2 dB across standard datasets such as UCF101 and KITTI. The novelty of this work lies in providing a comprehensive quantitative comparison of architectures, metrics, and datasets, and in proposing a unified taxonomy that integrates spatial-temporal deep learning models, their evolution from 2D to 3D, and probabilistic approaches. The paper's main contribution is offering a structured classification of VP architectures and datasets, serving as a reference framework for researchers to evaluate and design novel video prediction systems.

References

. S. Oprea et al., “A Review on Deep Learning Techniques for Video Prediction,” IEEE Trans Pattern Anal Mach Intell, vol. 44, no. 16, pp. 2806–2832, Apr. 2020, doi: https://doi.org/10.1109/TPAMI.2020.3045007

. Y. Wang, L. Jiang, M.-H. Yang, L.-J. Li, M. Long, and L. Fei-Fei, “Eidetic 3d Lstm: A Model For Video Prediction And Beyond,” in International conference on learning representations (ICLR), 2019.

. P. Bhattacharjee and S. Das, "Temporal Coherency-based Criteria for Predicting Video Frames using Deep Multi-stage Generative Adversarial Networks,” in 31st Conference on Neural Information Processing Systems (NIPS2017), 2017, pp. 30–40. doi: https://doi.org/10.1201/9781003203964-4.

. B. D. Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool, “Dynamic Filter Networks,” arXiv (Cornell University), Jan. 2016, doi: https://doi.org/10.48550/arxiv.1605.09673

. Y. Camgözlü and Y.Kutlu, "Analysis of Filter Size Effect in Deep Learning," arXiv (Cornell University), Jan. 2021, doi: https://doi.org/10.48550/arXiv.2101.01115.

. G. Interdisciplinary, “Enhancing the accuracies by performing pooling decisions adjacent to the output layer,” Sci. Reports., vol. 13, no. 1, pp. 13385–13414, 2023, doi: https://doi.org/10.1038/s41598-023-40566-y.

. L. Lin, Lisa M.J. Lee, W. Dai, and E. P. Xing, “Dual Motion GAN for Future-Flow Embedded Video Prediction,” arXiv (Cornell University), Aug. 2017, doi: https://doi.org/10.1109/iccv.2017.194

. H. Chiu, E. Adeli, and J. C. Niebles, “Segmenting the Future,” IEEE Robot Autom Lett, vol. 5, no. 3, pp. 4202–4211, Apr. 2019, doi: https://doi.org/10.1109/LRA.2020.2992184.

. L. Castrejon, N. Ballas, and A. Courville, “Improved Conditional VRNNs for Video Prediction,” in IEEE/CVF international conference on computer vision, 2019, pp. 7608–7617. doi: https://doi.org/10.1109/iccv.2019.00770.

. W. Lotter, G. Kreiman, and D. Cox, “Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning,” in 5th International Conference on Learning Representations(ICLR ), May 2017, pp. 08104–08122. doi: https://doi.org/10.48550/arXiv.1605.08104.

. W. Byeon, Q. Wang, R. K. Srivastava, and P. Koumoutsakos, “ContextVP: Fully Context-Aware Video Prediction,” in European Conference on Computer Vision (ECCV), 2018, pp. 753–769. doi: https://doi.org/10.1007/978-3-030-01270-0_46.

. D. Linsley, J. Kim, V. Veerabadran, C. Windolf, and T. Serre, “Learning long-range spatial dependencies with horizontal gated recurrent units,” in Advances in neural information processing systems 31, 2018, pp. 8315–8336. doi: https://doi.org/10.32470/ccn.2018.1116-0.

. H. Fan, L. Zhu, and Y. Yang, “Cubic LSTMs for Video Prediction,” in AAAI conference on artificial intelligence, 2019, pp. 8263–8270. doi: https://doi.org/10.1609/aaai.v33i01.33018263.

. J. Zhang, Y. Wang, M. Long, J. Wang, and P. S. Yu, “Z-Order Recurrent Neural Networks For Video Prediction,” in IEEE International Conference on Multimedia and Expo (ICME), 2019, pp. 230–235. doi: https://doi.org/10.1109/icme.2019.00048.

. C. Luo, X. Li, and Y. Ye, “PFST-LSTM: A SpatioTemporal LSTM Model with Pseudoflow Prediction for Precipitation Nowcasting,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 14, pp. 843–857, 2021, doi: https://doi.org/10.1109/JSTARS.2020.3040648.

. B. Jin, Y. Hu, Y. Zeng, Q. Tang, S. Liu, and J. Ye, “VarNet: Exploring Variations for Unsupervised Video Prediction,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2018, doi: https://doi.org/10.1109/iros.2018.8594264

. N. Shayanfar, V. Derhami, and M. Rezaeian, "Video Prediction Using Multiscale Deep Neural Networks," Technol. J. Artif. Intell. Data Min., vol. 10, no. 3, pp. 423–431, 2022, doi: https://doi.org/10.22044/jadm.2022.11415.2305 .

. Y. Yang, K. Zheng, C. Wu, and Y. Yang, “Improving the classification effectiveness of intrusion detection by using improved conditional variational autoencoder and deep neural network,” Sensors (Switzerland), vol. 19, no. 11, 2019, doi: https://doi.org/10.3390/s19112528.

. W. Lu, J. Cui, Y. Chang, and L. Zhang, “A Video Prediction Method Based on Optical Flow Estimation and Pixel Generation,” IEEE Access, vol. 9, pp. 100395–100406, 2021, doi: https://doi.org/10.1109/ACCESS.2021.3096788.

. M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic Variational Video Prediction,” in conference paper at ICLR, Oct. 2018, pp. 11252–11267. doi: https://doi.org/10.48550/arXiv.1710.11252

. Y. Ye, M. Singh, A. Gupta, and S. Tulsiani, “Compositional Video Prediction,” in IEEE/CVF International Conference on Computer Vision, 2019, pp. 10353–10362. doi: https://doi.org/10.1109/iccv.2019.01045.

. Y.-H. Kwon and M.-G. Park, “Predicting Future Frames using Retrospective Cycle GAN,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1811–1821. doi: https://doi.org/10.1109/cvpr.2019.00191.

. M. Mirza and S. Osindero, “Conditional Generative Adversarial Nets,” arXiv Prepr. arXiv1411.1784, Nov. 2014, doi: https://doi.org/10.48550/arXiv.1411.1784.

. J. Lee, J. Lee, S. Lee, and S. Yoon, “Mutual Suppression Network for Video Prediction using Disentangled Features,” in Mutual Suppression Network for Video Prediction using Disentangled Features, Apr. 2018, pp. 3174–3180. doi: https://doi.org/10.48550/arXiv.1804.04810.

. T. Yu, L. Wang, H. Gu, S. Xiang, and C. Pan, “Deep generative video prediction,” Pattern Recognit. Lett., vol. 110, pp. 58–65, Jul. 2018, doi: https://doi.org/10.1016/j.patrec.2018.03.027

. S. Fan, “Video Prediction and Anomaly Detection Algorithm Based On Dual Discriminator,” pp. 123–127, 2020, doi: https://doi.org/10.1109/ICCIA49625.2020.00031.

. C. Schüldt, I. Laptev, and B. Caputo, “Recognizing Human Actions: A Local SVM Approach,” in 17th International Conference on Pattern Recognition. ICPR 2004, 2004, pp. 32–36. doi: https://doi.org/10.1109/icpr.2004.1334462.

. J. M. Chaquet, E. J. Carmona, and A. Fernández-Caballero, “A survey of video datasets for human action and activity recognition,” Comput. Vis. Image Underst., vol. 117, no. 6, pp. 633–659, 2013, doi: https://doi.org/10.1016/j.cviu.2013.01.013.

. V. Patraucean, A. Handa, and R. Cipolla, “Spatio-temporal video autoencoder with differentiable memory,” in 4th International Conference on Learning Representations, ICLR 2016, Nov. 2015, pp. 1–13. doi: https://doi.org/10.5220/0007409400002108.

. I. Sutskever, G. Hinton, and G. Taylor, “The Recurrent Temporal Restricted Boltzmann Machine,” in Advances in Neural Information Processing Systems 21 (NIPS 2008), 2008.

. P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Detection: A Benchmark,” in IEEE conference on computer vision and pattern recognition, Jun. 2009, pp. 304–311. doi: https://doi.org/10.1109/cvpr.2009.5206631.

. R. Vezzani and R. Cucchiara, “ViSOR: Video Surveillance Online Repository,” an Integr. Fram. Multimed. Tools Appl., vol. 50, no. 2, pp. 359–439, 2010, doi: https://doi.org/10.1145/2483977.2483987.

. J. Santner, C. Leistner, A. Saffari, T. Pock, and H. Bischof, “PROST: Parallel Robust Online Simple Tracking,” in IEEE computer society conference on computer vision and pattern recognition, 2010, pp. 723–730. doi: https://doi.org/10.1109/cvpr.2010.5540145.

. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: A large video database for human motion recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2011, pp. 2556–2563. doi: https://doi.org/10.1109/ICCV.2011.6126543.

. V. Jain et al., “Supervised learning of image restoration with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2007, pp. 1–8. doi: https://doi.org/10.1109/ICCV.2007.4408909.

. K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild,” Dec. 2012, doi: https://doi.org/10.48550/arXiv.1212.0402.

. R. Memisevic and G. Exarchakis, “Learning invariant features by harnessing the aperture problem,” in International Conference on Machine Learning, 2010, pp. 100–108.

. W. Zhang, M. Zhu, and K. G. Derpanis, “From actemes to action: A strongly-supervised representation for detailed action understanding,” in Proceedings of the IEEE International Conference on Computer Vision, Institute of Electrical and Electronics Engineers Inc., 2013, pp. 2248–2255. doi: https://doi.org/10.1109/ICCV.2013.280.

. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” Int. J. Rob. Res., vol. 32, no. 11, pp. 1231–1237, Sep. 2013, doi: https://doi.org/10.1177/0278364913491297.

. M. G. Bellemare, J. Vteness, and M. Bowling, “The Arcade Learning Environment: An Evaluation Platform for General Agents,” J. Artif. Intell. Res., vol. 47, no. 2013, pp. 253–279, 2013, doi: https://doi.org/10.1613/jair.3912.

. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale Video Classification with Convolutional Neural Networks,” in IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732. doi: https://doi.org/10.1109/cvpr.2014.223.

. C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments,” IEEE Trans. PATTERN Anal. Mach. Intell., vol. 36, no. 7, pp. 1325–1364, 2014, doi: https://doi.org/10.1109/tpami.2013.248.

. N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised Learning of Video Representations using LSTMs,” in International Conference on machine learning, 2015, pp. 843–852.

. C. Finn, I. G. Openai, S. Levine, and G. Brain, “Unsupervised Learning for Physical Interaction through Video Prediction,” in Advances in neural information processing systems 29, 2016, pp. 67–72.

. M. Cordts et al., “The Cityscapes Dataset for Semantic Urban Scene Understanding,” in IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223. doi: https://doi.org/10.1109/cvpr.2016.350.

. S. Abu-El-Haija et al., “YouTube-8M: A Large-Scale Video Classification Benchmark,” in arXiv preprint arXiv:1609.08675, Sep. 2016, pp. 8675–8685. doi: https://doi.org/10.48550/arXiv.1609.08675.

. B. Thomee et al., “YFCC100M: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2. Association for Computing Machinery, pp. 64–73, Feb. 01, 2016. doi: https://doi.org/10.1145/2812802

. G. Seguin, P. Bojanowski, R. Lajugie, and I. Laptev, “Instance-level video segmentation from object tracks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3678–3687. doi: https://doi.org/10.1109/cvpr.2016.400.

. H. Idrees et al., “The THUMOS Challenge on Action Recognition for Videos ‘in the Wild,’” vol. 155, p. 3, Apr. 2016, doi: https://doi.org/10.1016/j.cviu.2016.10.018.

. F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-Supervised Visual Planning with Temporal Skip Connections,” in Conference on Robot Learning, 2017, pp. 12–16.

. X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The ApolloScape Open Dataset for Autonomous Driving and its Application,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 10, pp. 2702–2721, Mar. 2018, doi: https://doi.org/10.1109/TPAMI.2019.2926463.

. A. Garcia-Garcia et al., “The RobotriX: An eXtremely Photorealistic and Very-Large-Scale Indoor Dataset of Sequences with Robot Trajectories and Interactions,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Jan. 2019, pp. 6790–6797. doi: https://doi.org/10.1109/iros.2018.8594495.

. J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman, “A Short Note about Kinetics-600,” in arXiv preprint arXiv:1808.01340, Aug. 2018, p. 01340. doi: https://doi.org/10.48550/arXiv.1808.01340.

. S. Dasari et al., “RoboNet: Large-Scale Multi-Robot Learning,” in RoboNet: Large-Scale Multi-Robot Learning, Oct. 2019, pp. 885–897. [Online]. Available: http://arxiv.org/abs/1910.11215

. Z. Bauer, F. Gomez-Donoso, E. Cruz, S. Orts-Escolano, and M. Cazorla, “UASOL, a large-scale high-resolution outdoor stereo dataset,” Sci. Data, vol. 6, no. 1, Dec. 2019, doi: https://doi.org/10.1038/s41597-019-0168-5.

. A. S. Periyasamy, M. Schwarz, and S. Behnke, “SynPick: A Dataset for Dynamic Bin Picking Scene Understanding,” in IEEE 17th International Conference on Automation Science and Engineering (CASE), IEEE, 2021, pp. 488–493. [Online]. Available: http://www.hdrlabs.com/sibl/archive.html

. C. Ling, J. Zhong, and W. Li, “Predictive Coding Based Multiscale Network with Encoder-Decoder LSTM for Video Prediction,” in arxiv, Dec. 2022, pp. 11642–11654. doi: https://doi.org/10.48550/arXiv.2212.11642.

. B. Liu, Y. Chen, S. Liu, and H.-S. Kim, “Deep Learning in Latent Space for Video Prediction and Compression,” in IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 701–710. doi: https://doi.org/10.1109/cvpr46437.2021.00076.

. W. Yu, Y. Lu, S. Easterbrook, and S. Fidler, “Efficient and Information-Preserving Future Frame Prediction and Beyond," in conference paper at ICLR, 2020.

. Y.-H. Ho, C.-Y. Cho, W.-H. Peng, and G.-L. Jin, “SME-Net: Sparse Motion Estimation for Parametric Video Prediction through Reinforcement Learning,” in IEEE/CVF International Conference on Computer Vision, 2019, pp. 10460–10470. doi: https://doi.org/10.1109/iccv.2019.01056.

. Z. Gao, C. Tan, L. Wu, and S. Z. Li, “SimVP: Simpler yet Better Video Prediction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3170–3180. doi: https://doi.org/10.1109/cvpr52688.2022.00317.

. A. Villar-Corrales, A. Karapetyan, A. Boltres, and S. Behnke, “MSPred: Video Prediction at Multiple Spatio-Temporal Scales with Hierarchical Recurrent Networks,” arXiv:2203.09303, Mar. 2022, doi: https://doi.org/10.48550/arXiv.2203.09303.

. W. Lee et al., “Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction,” in International Conference on Learning Representations ICLR, Apr. 2021, pp. 06697–06704. doi: https://doi.org/10.48550/arXiv.2104.06697.

. Z. Straka, T. Svoboda, and M. Hoffmann, “PreCNet: Next-Frame Video Prediction Based on Predictive Coding,” IEEE Trans. Neural Networks Learn. Syst., vol.. 14, no. 22, pp. 1467–1486, 2023, doi: https://doi.org/10.1109/TNNLS.2023.3240857.

. M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra, “Video (language) modeling: a baseline for generative models of natural videos,” arXiv Prepr. arXiv1412.6604, Dec. 2014, doi: https://doi.org/10.48550/arXiv.1412.6604.

. P. Luc, N. Neverova, C. Couprie, J. Verbeek, Y. Lecun, and F. A. Research, “Predicting Deeper into the Future of Semantic Segmentation,” in IEEE International Conference on Computer Vision, 2017, pp. 648–657. doi: https://doi.org/10.1109/iccv.2017.77.

. X. Chen, W. Wang, J. Wang, and W. Li, “Learning object-centric transformation for video prediction,” in MM 2017 - Proceedings of the 2017 ACM Multimedia Conference, Association for Computing Machinery, Inc., Oct. 2017, pp. 1503–1512. doi: https://doi.org/10.1145/3123266.3123349.

. F. A. Reda et al., “SDC-Net: Video prediction using spatially-displaced convolution,” in European Conference on Computer Vision (ECCV), 2018, pp. 718–733. doi: https://doi.org/10.1007/978-3-030-01234-2_44.

. W. Liu, “DYAN: A Dynamical Atoms-Based Network For Video Prediction,” in European Conference on Computer Vision (ECCV), 2018, pp. 170–185. doi: https://doi.org/10.17760/d20385573.

. V. Kumar, V. Tripathi, and B. Pant, “Unsupervised Learning of Visual Representations via Rotation and Future Frame Prediction for Video Retrieval,” in Communications in Computer and Information Science, Springer Science and Business Media Deutschland GmbH, 2021, pp. 701–710. doi: https://doi.org/10.1007/978-3-030-81462-5_61.

. J. van Amersfoort, A. Kannan, M. Ranzato, A. Szlam, D. Tran, and S. Chintala, “Transformation-Based Models of Video Sequences,” in CORR, Jan. 2017, pp. 8435–8446. doi: https://doi.org/10.48550/arXiv.1701.08435.

. V. Michalski, R. Memisevic, and K. Konda, “Modeling Deep Temporal Dependencies with Recurrent ‘Grammar Cells,’” in Advances in neural information processing systems, 2014, pp. 27–36.

. S. Shahabeddin Nabavi, M. Rochan, Yang, and Wang, "Future Semantic Segmentation with Convolutional LSTM," BMVC, vol. 1, no. 2, pp. 3–15, Jul. 2018, doi: https://doi.org/10.48550/arXiv.1807.07946.

. S. Vora, R. Mahjourian, S. Pirk, and A. Angelova, “Future Segmentation Using 3D Structure,” arXiv:1811.11358v1, pp. 11358–11372, Nov. 2018, doi: https://doi.org/10.48550/arXiv.1811.11358.

. M. Oliu, J. Selva, and S. Escalera, “Folded Recurrent Neural Networks for Future Video Prediction,” in the European Conference on Computer Vision, 2018, pp. 716–731. doi: https://doi.org/10.1007/978-3-030-01264-9_44.

. J. Sun et al., "Predicting future instance segmentation with contextual pyramid convlTMs," in MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia, Association for Computing Machinery, Inc., Oct. 2019, pp. 2043–2051. doi: https://doi.org/10.1145/3343031.3350949.

. M. Hosseini, A. S. Maida, M. Hosseini, and G. Raju, “Inception-inspired LSTM for Next-frame Video Prediction,” in arXiv preprint arXiv:1909.05622, Aug. 2019, pp. 05622–05629. doi: https://doi.org/10.48550/arXiv.1909.05622.

. H. Wu, Z. Yao, J. Wang, and M. Long, “MotionRNN: A Flexible Model for Video Prediction with Spacetime-Varying Motions,” in IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15435–15444. doi: https://doi.org/10.1109/cvpr46437.2021.01518.

. P. Desai, C. Sujatha, S. Chakraborty, S. Ansuman, S. Bhandari, and S. Kardiguddi, “Next frame prediction using ConvLSTM,” J. Phys. Conf. Ser., vol. 2161, no. 1, pp. 012024–012039, Jan. 2022, doi: https://doi.org/10.1088/1742-6596/2161/1/012024 .

. A. K. Akan, S. Safadoust, and F. Güney, “Stochastic Video Prediction with Structure and Motion,” pp. 1–26, 2022, [Online]. Available: http://arxiv.org/abs/2203.10528

. J. Walker, K. Marino, A. Gupta, and M. Hebert, “The Pose Knows: Video Forecasting by Generating Pose Futures,” in IEEE International Conference on Computer Vision, 2017, pp. 3352–3361. doi: https://doi.org/10.1109/iccv.2017.361.

. A. Hu, F. Cotter, N. Mohan, C. Gurau, and A. Kendall, “Probabilistic Future Prediction for Video Scene Understanding,” in 16th European Conference, Mar. 2020, pp. 767–785. doi: https://doi.org/10.1007/978-3-030-58517-4_45.

. B. Wu, S. Nair, R. Martín-Martín, L. Fei-Fei, and C. Finn, “Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2318–2328. doi: https://doi.org/10.1109/cvpr46437.2021.00235.

. C. Vondrick and A. Torralba, “Generating the Future with Adversarial Transformers,” in IEEE Conference on Computer Vision and Pattern Recognition 2017, 2017, pp. 1020–1028. doi: https://doi.org/10.1109/cvpr.2017.319.

. X. Chen, C. Xu, X. Yang, and D. Tao, “Long-term Video Prediction via Criticization and Retrospection,” in IEEE Transactions on Image Processing, 2021, pp. 7093–7107. doi: https://doi.org/10.1109/tip.2020.2998297.

. Z. Hu and J. T. L. Wang, “A Novel Adversarial Inference Framework for Video Prediction with Action Control,” in IEEE/CVF International Conference on Computer Vision Workshops, 2019.

. B. Jin et al., “Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity and Temporal-Consistency Video Prediction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4545–4563. doi: https://doi.org/10.1109/cvpr42600.2020.00461.

. [O. Shouno, “Photo-Realistic Video Prediction on Natural Videos of Largely Changing Frames,” arXiv:2003.08635, pp. 8635–8641, Mar. 2020, doi: https://doi.org/10.48550/arXiv.2003.08635.

. Y. Wu, H. R. Gao, J. Park, P. Qifeng, and C. Hkust, “Future Video Synthesis with Object Motion Prediction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5539–5548. doi: https://doi.org/10.1109/cvpr42600.2020.00558.

. W. Yan, D. Hafner, S. James, and P. Abbeel, “Temporally Consistent Video Transformer for Long-Term Video Prediction,” Oct. 2022, doi: https://doi.org/10.48550/arXiv.2210.02396.

. Z. Chang, X. Zhang, S. Wang, S. Ma, and W. Gao, “STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13946–13955. doi: https://doi.org/10.1109/cvpr52688.2022.01356.

. P. Luc et al., “Transformation-based Adversarial Video Prediction on Large-Scale Data,” in arXiv, Mar. 2020, pp. 04035–04044. doi: https://doi.org/10.48550/arXiv.2003.04035.

. X. Ye and G.-A. Bilodeau, “Video Prediction by Efficient Transformers,” Image Vis. Comput., vol. 130, pp. 104612–104624, Dec. 2022, doi: https://doi.org/10.1109/icpr56361.2022.9956707.

. Y. Seo, K. Lee, F. Liu, S. James, and P. Abbeel, “HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator,” in IEEE International Conference on Image Processing (ICIP), Sep. 2022, pp. 3943–3947. doi: https://doi.org/10.1109/ICIP46576.2022.9897982.

. H. Geng, T. Wang, X. Zhuang, D. Xi, Z. Hu, and L. Geng, “GAN-rcLSTM: A Deep Learning Model for Radar Echo Extrapolation,” Atmosphere (Basel), vol. 13, no. 5, May 2022, doi: https://doi.org/10.3390/atmos13050684.

. Z. Chang, X. Zhang, S. Wang, S. Ma, and W. Gao, “STIP: A SpatioTemporal Information-Preserving and Perception-Augmented Model for High-Resolution Video Prediction,” arXiv, Jun. 2022, doi: https://doi.org/10.48550/arXiv.2206.04381.

. C. Ling, W. Li, and J. Zhong, “Analyzing and Improving the Pyramidal Predictive Network for Future Video Frame Prediction,” in arxiv, Jan. 2023, pp. 1–12. doi: https://doi.org/10.48550/arXiv.2301.05421.

. M. Sun, W. Wang, X. Zhu, and J. Liu, "MOSO: Decomposing Motion, Scene, and Object for Video Prediction," in IEEE/CVF Conference on Computer Vision and Pattern Recognition, Mar. 2023. doi: https://doi.org/10.1109/cvpr52729.2023.01796.

. M. Backus, Y. Jiang, and D. Murphy, “Video Frame Prediction with Deep Learning,” in IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 701–710. doi: https://doi.org/10.1109/siu49456.2020.9302047.

. X. Gao, Y. Jin, Q. Dou, C.-W. Fu, and P.-A. Heng, “Accurate Grid Keypoint Learning for Efficient Video Prediction,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Jul. 2021, pp. 5908–5915. doi: https://doi.org/10.1109/iros51168.2021.9636874.

. J. Xu, B. Ni, Z. Li, S. Cheng, and X. Yang, “Structure Preserving Video Prediction,” in IEEE conference on computer vision and pattern recognition, 2018, pp. 1460–1469. doi: https://doi.org/10.1109/cvpr.2018.00158.

. A. M. Terwilliger, G. Brazil, and X. Liu, “Recurrent Flow-Guided Semantic Forecasting,” in IEEE Winter Conference on Applications of Computer Vision (WACV), Sep. 2018, pp. 1703–1712. doi: https://doi.org/10.1109/wacv.2019.00186.

Downloads

Key Dates

Received

2023-09-10

Revised

2025-04-06

Accepted

2025-04-26

Published Online First

2025-10-27

Published

2025-11-01

How to Cite

Al Mokhtar , Z. T. ., & Dawwd, S. A. (2025). Deep Learning Video Prediction Theories and Their Architecture: A Review. Journal of Engineering and Sustainable Development, 29(6), 771-784. https://doi.org/10.31272/jeasd.2262

Similar Articles

1-10 of 530

You may also start an advanced similarity search for this article.