Tu K, Meng M, Lee M W, et al. Joint video and text parsing for understanding events and answering queries[J]. IEEE MultiMedia, 2014, 21(2): 42-70.
Zhu L, Xu Z, Yang Y, et al. Uncovering the temporal context for video question answering[J]. International Journal of Computer Vision, 2017, 124(3): 409-421.
Tapaswi M, Zhu Y, Stiefelhagen R, et al. Movieqa: Understanding stories in movies through question-answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4631-4640.
Maharaj T, Ballas N, Rohrbach A, et al. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6884-6893.
Jang Y, Song Y, Yu Y, et al. Tgif-qa: Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2758-2766.
Zeng K H, Chen T H, Chuang C Y, et al. Leveraging video descriptions to learn video question answering[J]. arXiv preprint arXiv:1611.04021, 2016.
Zhao Z, Yang Q, Cai D, et al. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks[C]//IJCAI. 2017: 3518-3524.
Xu D, Zhao Z, Xiao J, et al. Video question answering via gradually refined attention over appearance and motion[C]//Proceedings of the 25th ACM international conference on Multimedia. 2017: 1645-1653.
Zhao Z, Lin J, Jiang X, et al. Video question answering via hierarchical dual-level attention network learning[C]//Proceedings of the 25th ACM international conference on Multimedia. 2017: 1050-1058.
Ye Y, Zhao Z, Li Y, et al. Video question answering via attribute-augmented attention network learning[C]//Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. 2017: 829-832.
Gao J, Ge R, Chen K, et al. Motion-appearance co-memory networks for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6576-6585.
Song X, Shi Y, Chen X, et al. Explore multi-step reasoning in video question answering[C]//Proceedings of the 26th ACM international conference on Multimedia. 2018: 239-247.
Lei J, Yu L, Bansal M, et al. Tvqa: Localized, compositional video question answering[J]. arXiv preprint arXiv:1809.01696, 2018.
Li X, Song J, Gao L, et al. Beyond rnns: Positional self-attention with co-attention for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 8658-8665.
Gao L, Zeng P, Song J, et al. Structured two-stream attention network for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 6391-6398.
Fan C, Zhang X, Zhang S, et al. Heterogeneous memory enhanced multimodal attention model for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 1999-2007.
Kim J, Ma M, Kim K, et al. Progressive attention memory network for movie story question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 8337-8346.
Zhao Z, Zhang Z, Xiao S, et al. Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks[C]//IJCAI. 2018: 3683-3689.
Kim J, Ma M, Kim K, et al. Gaining extra supervision via multi-task learning for multi-modal video question answering[C]//2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019: 1-8.
Xiangpeng Li, Lianli Gao, Xuanhan Wang, Wu Liu, Xing Xu, Heng Tao Shen, and Jingkuan Song. 2019. Learnable Aggregating Net with Diversity Learning for Video Question Answering. In Proceedings of the 27th ACM International Conference on Multimedia (MM '19). Association for Computing Machinery, New York, NY, USA, 1166–1174. DOI:https://doi.org/10.1145/3343031.3350971
Jin W, Zhao Z, Gu M, et al. Multi-interaction network with object relation for video question answering[C]//Proceedings of the 27th ACM International Conference on Multimedia. 2019: 1193-1201.
Yang T, Zha Z J, Xie H, et al. Question-aware tube-switch network for video question answering[C]//Proceedings of the 27th ACM International Conference on Multimedia. 2019: 1184-1192.
Yu T, Yu J, Yu Z, et al. Compositional attention networks with two-stream fusion for video question answering[J]. IEEE Transactions on Image Processing, 2019, 29: 1204-1218.
Wang A, Luu A T, Foo C S, et al. Holistic multi-modal memory network for movie question answering[J]. IEEE Transactions on Image Processing, 2019, 29: 489-499.
Garcia N, Nakashima Y. Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions[J]. arXiv preprint arXiv:2007.08751, 2020.
Zhao Z, Xiao S, Song Z, et al. Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks[J]. IEEE Transactions on Image Processing, 2020, 29: 3859-3870.
Yang Z, Garcia N, Chu C, et al. BERT Representations for Video Question Answering[C]//The IEEE Winter Conference on Applications of Computer Vision. 2020: 1556-1565.