当前位置: 首页 > news >正文

datasets笔记:两种数据集对象

Datasets 提供两种数据集对象:Dataset✨ IterableDataset ✨

  • Dataset 提供快速随机访问数据集中的行,并支持内存映射,因此即使加载大型数据集也只需较少的内存。
  • IterableDataset 适用于超大数据集,甚至无法完全下载到磁盘或内存中。它允许在数据集完全下载之前就开始访问和使用数据集。

0 读取数据

from datasets import load_datasetdataset = load_dataset("rotten_tomatoes", split="train")
dataset
'''
Dataset({features: ['text', 'label'],num_rows: 8530
})
'''

1 Dataset

1.1 索引

dataset[0]
'''
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .','label': 1}
'''dataset[-1]
'''
{'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .','label': 0}
'''dataset[0]['text']
'''
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
'''dataset['text']

1.2 切片

dataset[:3]
'''
{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .','the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .','effective but too-tepid biopic'],'label': [1, 1, 1]}
'''

2 IterableDataset

当设置 streaming=True 时加载的数据集为 IterableDataset

IterableDataset 的行为与 Dataset 不同:

  • 无法随机访问。
  • 只能逐个迭代获取元素,例如使用 next(iter())for 循环。
from datasets import load_datasetiter_dataset = load_dataset("rotten_tomatoes", split="train",streaming=True)
iter_dataset
'''
IterableDataset({features: ['text', 'label'],n_shards: 1
})
'''
for i in iter_dataset:print(i)break
'''
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
'''

2.1 从现有 Dataset 创建 IterableDataset

iter_dataset2=dataset.to_iterable_dataset()
for i in iter_dataset2:print(i)break
'''
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
'''

2.2  获取指定数量的示例

list(iter_dataset2.take(3))
'''
[{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .','label': 1},{'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .','label': 1},{'text': 'effective but too-tepid biopic', 'label': 1}]
'''

http://www.lryc.cn/news/508201.html

相关文章:

  • 【ETCD】【Linearizable Read OR Serializable Read】ETCD 数据读取:强一致性 vs 高性能,选择最适合的读取模式
  • 【CSS in Depth 2 精译_089】15.2:CSS 过渡特效中的定时函数
  • 不常用命令指南
  • spring mvc | servlet :serviceImpl无法自动装配 UserMapper
  • STM32 HAL库之串口接收不定长字符
  • Pyqt6的tableWidget填充数据
  • ASP.NET Core - 依赖注入 自动批量注入
  • UVM 验证方法学之interface学习系列文章(十一)virtual interface 再续篇
  • 面试题整理5----进程、线程、协程区别及僵尸进程处理
  • OpenTK 中帧缓存的深度解析与应用实践
  • 第2节-Test Case如何调用Object Repository中的请求并关联参数
  • 【HarmonyOS NEXT】Web 组件的基础用法以及 H5 侧与原生侧的双向数据通讯
  • Android学习(六)-Kotlin编程语言-数据类与单例类
  • CV-OCR经典论文解读|An Empirical Study of Scaling Law for OCR/OCR 缩放定律的实证研究
  • 力扣274. H 指数
  • 挑战一个月基本掌握C++(第五天)了解运算符,循环,判断
  • Python的sklearn中的RandomForestRegressor使用详解
  • ReactPress 1.6.0:重塑博客体验,引领内容创新
  • 人脸生成3d模型 Era3D
  • kubeadm搭建k8s集群
  • centOS系统进程管理基础知识
  • STM32中ADC模数转换器
  • 初学stm32 --- 外部中断
  • wordpress调用指定分类ID下 相同标签的内容
  • SQL语法基础知识总结
  • css 实现呼吸灯效果
  • IMX6ULL开发板如何关掉自带的QT的GUI界面和poky的界面的方法
  • 几种广泛使用的 C++ 编译器
  • 《Vue进阶教程》第十六课:深入完善响应式系统之单例模式
  • C语言版解法力扣题:将整数按权重排序