CVNLP 常用数据集语料库资源汇总
深度学习常用数据集汇总
- CV
- NLP
- Sentiment Analysis
- Text Classification
- Dialogue Generation
- 其他
- Audio
- Multi-Modal
- Classification
- Search & Matching
- Image Captioning
- VisualQA
- Tri-Modal
- 其他
CV
- ghcn
- climate_sphere
- ModelNet40
- Shrec17 data + label
- cosmo Spherical convergence maps dataset | Zenodo
Classification
- Fashion-MNIST
- ImageNet
- CIFAR-10 + CIFAR-100
- CelebA Dataset
- MS-Celeb-1M
- SVHN The Street View House Numbers (SVHN) Dataset
- Open Images Dataset
NLP
Sentiment Analysis
- Large Movie Review Dataset (IMDB)
- Sentiment140 (STS)
Text Classification
- Twenty Newsgroups
Dialogue Generation
- Reddit-Thread Dataset
- SimpleQuestions (v2)
- Web data: Amazon reviews
- The WikiText Long Term Dependency Language Modeling Dataset
其他
- WordNet
- Yelp
Audio
- The Flickr Audio Caption Corpus
Multi-Modal
Classification
- Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model (2019)
- MUStARD: Multimodal Sarcasm Detection Dataset (ACL, 2019)
- CMU-Multimodal SDK
- UR-FUNNY
- CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotations of Modality (2020)
- Iemocap: interactive emotional dyadic motion capture database (2008)
- MM-IMDB
Search & Matching
- IAPR TC-12
- Nus-wide
- BriVL (2021)
Image Captioning
- Flickr8k Dataset
- Flickr 30k Dataset
- COCO Dataset (2015)
- Conceptual Captions Dataset (2018)
VisualQA
- VisualQA
Tri-Modal
- How2: A Large-scale Dataset for Multimodal Language Understanding
其他
- SVLD: The Social Vision and Language Dataset
- https://dubbel.eecs.berkeley.edu/minio/login
- AI-NLP-ML GROUP
- https://dumps.wikimedia.org/backup-index-bydb.html
- 汉语语料库
中文NLP数据集搜索(命名实体识别、文本分类、文本摘要)
参考资料
- 如何优雅地使用数据标注众包平台?——Amazon Mechanical Turk使用指南
- Datasets for Natural Language Processing
- nlp_chinese_corpus
- nlp-datasets
- 10大行业公开数据免费下载:电商行业
- 数据集大全:25个深度学习的开放数据集
- 深度学习开源数据集