当前位置: 首页 > news >正文

在sf=0.1时测试fireducks、duckdb、polars的tpch

首先,从https://github.1git.de/fireducks-dev/polars-tpch下载源代码包,将其解压缩到/par/fire目录。
然后进入此目录,运行
SCALE_FACTOR=0.1 ./run-fireducks.sh,脚本会首先安装所需的包,编译tpch的数据生成器,然后按照sf=0.1生成tbl文件,再转化为parquet格式,最后执行。
如下所示:

root@DESKTOP-59T6U68:/par/fire# SCALE_FACTOR=0.1 ./run-fireducks.sh
Looking in indexes: https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
Requirement already satisfied: pyarrow in ./.venv/lib/python3.13/site-packages (20.0.0)
Requirement already satisfied: pydantic in ./.venv/lib/python3.13/site-packages (2.11.7)
Requirement already satisfied: pydantic_settings in ./.venv/lib/python3.13/site-packages (2.10.1)
Requirement already satisfied: linetimer in ./.venv/lib/python3.13/site-packages (0.1.5)
Requirement already satisfied: annotated-types>=0.6.0 in ./.venv/lib/python3.13/site-packages (from pydantic) (0.7.0)
Requirement already satisfied: pydantic-core==2.33.2 in ./.venv/lib/python3.13/site-packages (from pydantic) (2.33.2)
Requirement already satisfied: typing-extensions>=4.12.2 in ./.venv/lib/python3.13/site-packages (from pydantic) (4.14.0)
Requirement already satisfied: typing-inspection>=0.4.0 in ./.venv/lib/python3.13/site-packages (from pydantic) (0.4.1)
Requirement already satisfied: python-dotenv>=0.21.0 in ./.venv/lib/python3.13/site-packages (from pydantic_settings) (1.1.1)
make -C tpch-dbgen dbgen
make[1]: Entering directory '/par/fire/tpch-dbgen'
make[1]: 'dbgen' is up to date.
make[1]: Leaving directory '/par/fire/tpch-dbgen'
cd tpch-dbgen && ./dbgen -vf -s 0.1 && cd ..
TPC-H Population Generator (Version 2.17.2)
Copyright Transaction Processing Performance Council 1994 - 2010
Generating data for suppliers table/
Preloading text ... 100%
done.
Generating data for customers tabledone.
Generating data for orders/lineitem tablesdone.
Generating data for part/partsupplier tablesdone.
Generating data for nation tabledone.
Generating data for region tabledone.
mkdir -p "data/tables_pyarrow/scale-0.1"
mv tpch-dbgen/*.tbl data/tables_pyarrow/scale-0.1/
.venv/bin/python -m scripts.prepare_data_pyarrow
Processing table: customer
Processing table: lineitem
Processing table: nation
Processing table: orders
Processing table: part
Processing table: partsupp
Processing table: region
Processing table: supplier
rm -rf data/tables_pyarrow/scale-0.1/*.tbl
PATH_TABLES=data/tables_pyarrow .venv-fireducks/bin/python -m queries.fireducks
{"scale_factor":0.1,"large_string_comment":false,"paths":{"answers":"data/answers","tables":"data/tables_pyarrow","timings":"output/run","timings_filename":"timings.csv","plots":"output/plot"},"plot":{"show":false,"n_queries":7,"y_limit":null},"run":{"io_type":"skip","log_timings":true,"show_results":false,"check_results":false,"polars_show_plan":false,"polars_eager":false,"polars_streaming":false,"polars_new_streaming":false,"polars_gpu":false,"polars_gpu_device":0,"use_rmm_mr":"cuda-async","modin_memory":8000000000,"spark_driver_memory":"2g","spark_executor_memory":"1g","spark_log_level":"ERROR","include_io":false},"dataset_base_dir":"data/tables_pyarrow/scale-0.1"}
Code block 'Run fireducks query 1' took: 0.20121 s
Code block 'Run fireducks query 2' took: 0.52730 s
Code block 'Run fireducks query 3' took: 0.15594 s
Code block 'Run fireducks query 4' took: 0.15536 s
Code block 'Run fireducks query 5' took: 0.23419 s
Code block 'Run fireducks query 6' took: 0.11777 s
Code block 'Run fireducks query 7' took: 0.27936 s
Code block 'Run fireducks query 8' took: 0.22832 s
Code block 'Run fireducks query 9' took: 0.18384 s
Code block 'Run fireducks query 10' took: 0.33037 s
Code block 'Run fireducks query 11' took: 0.16605 s
Code block 'Run fireducks query 12' took: 0.16841 s
Code block 'Run fireducks query 13' took: 0.14314 s
Code block 'Run fireducks query 14' took: 0.13404 s
Code block 'Run fireducks query 15' took: 0.14402 s
Code block 'Run fireducks query 16' took: 0.20629 s
Code block 'Run fireducks query 17' took: 0.15346 s
Code block 'Run fireducks query 18' took: 0.19930 s
Code block 'Run fireducks query 19' took: 0.20121 s
Code block 'Run fireducks query 20' took: 0.27538 s
Code block 'Run fireducks query 21' took: 0.30119 s
Code block 'Run fireducks query 22' took: 0.22134 s
Code block 'Overall execution of ALL fireducks queries' took: 130.80006 s

如果要和其他工具的性能比较,queries目录下有duckdb、polars等的脚本,调用方法如下:

PATH_TABLES=data/tables_pyarrow SCALE_FACTOR=0.1 .venv/bin/python -m queries.duckdb
Code block 'Run duckdb query 1' took: 2.36939 s
...
Code block 'Overall execution of ALL duckdb queries' took: 88.98257 sPATH_TABLES=data/tables_pyarrow SCALE_FACTOR=0.1 .venv/bin/python -m queries.polars
Code block 'Run polars query 1' took: 0.34880 s
...
Code block 'Overall execution of ALL polars queries' took: 61.85478 s

fireducks的这个脚本是从polars那里fork的,不知做了什么加工,单个查询duckdb比polars和fireducks慢很多,相差10倍,难以置信。直接用如下语句测试,明明不到1秒

import duckdb
q1="""
SELECTl_returnflag,l_linestatus,SUM(l_quantity) AS sum_qty,SUM(l_extendedprice) AS sum_base_price,SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,AVG(l_quantity) AS avg_qty,AVG(l_extendedprice) AS avg_price,AVG(l_discount) AS avg_disc,COUNT(*) AS count_order
FROM'data/tables_pyarrow/scale-0.1/lineitem.parquet' l
WHEREl_shipdate <= CAST('1998-09-02' AS date)
GROUP BYl_returnflag,l_linestatus
ORDER BYl_returnflag,l_linestatus;
"""
import time
t=time.time();df = duckdb.sql(q1);df.show();print(time.time()-t)
 .venv/bin/python /par/duckdbq1.py
┌──────────────┬──────────────┬─────────┬────────────────────┬───────────────────┬────────────────────┬────────────────────┬───────────────────┬─────────────────────┬─────────────┐
│ l_returnflag │ l_linestatus │ sum_qty │   sum_base_price   │  sum_disc_price   │     sum_charge     │      avg_qty       │     avg_price     │      avg_disc       │ count_order │
│   varchar    │   varchar    │ int128  │       double       │      double       │       double       │       double       │      double       │       double        │    int64    │
├──────────────┼──────────────┼─────────┼────────────────────┼───────────────────┼────────────────────┼────────────────────┼───────────────────┼─────────────────────┼─────────────┤
│ A            │ F            │ 37742005320753880.689985054096266.6828355256751331.44926725.53758711685499736002.1238290140.05014459706345448147790 │
│ N            │ F            │   95257133737795.83999994127132372.6512132286291.2294447325.3006640106241735521.326916334650.04939442231075733765 │
│ N            │ O            │ 745929710512270008.899929986238338.38476610385578376.58547625.54553767123287536000.924688013420.05009595890418491292000 │
│ R            │ F            │ 37855235337950526.46987155071818532.9421015274405503.04936625.525943857425135994.029214030060.04998927856189752148301 │
└──────────────┴──────────────┴─────────┴────────────────────┴───────────────────┴────────────────────┴────────────────────┴───────────────────┴─────────────────────┴─────────────┘0.6631364822387695
http://www.lryc.cn/news/581237.html

相关文章:

  • 《设计模式之禅》笔记摘录 - 4.抽象工厂模式
  • pagecache过多导致oom的排查记录
  • 单用户模式、紧急模式、救援模式有什么区别
  • LeetCode 第89题:格雷编码
  • PostgreSQL表操作
  • 深度剖析:OPENPPP2 libtcpip 实现原理与架构设计
  • python缓存装饰器实现方案
  • python中执行前置操作,后置操作的几种方法
  • 【QT】事件(鼠标、按键、定时器、窗口)
  • JVM的位置和JVM的结构体系
  • Java创建型模式---工厂模式
  • PVE DDNS IPV6
  • 基于Elasticsearch的短视频平台个性化推荐系统设计与实现
  • SwiftUI 7(iOS 26)中玻璃化工具栏的艺术
  • 介绍electron
  • 基于spark的奥运会奖牌变化数据分析
  • 国产 OFD 标准公文软件数科 OFD 阅读器:OFD/PDF 双格式支持,公务办公必备
  • day44打卡
  • cmd 的sftp传输;Conda出现环境问题: error: invalid value for --gpu-architecture (-arch)
  • 浅度解读-(未完成版)浅层神经网络-多个隐层神经元
  • 前端-CSS-day1
  • 【openp2p】学习3:【专利分析】一种基于混合网络的自适应切换方法、装 置、设备及介质
  • WSL命令
  • 【爬虫】逆向爬虫初体验之爬取音乐
  • 大模型算法面试笔记——Bert
  • 计算机网络(网页显示过程,TCP三次握手,HTTP1.0,1.1,2.0,3.0,JWT cookie)
  • 一键将 SQL 转为 Java 实体类,全面支持 MySQL / PostgreSQL / Oracle!
  • 永磁同步电机无速度算法--基于锁频环前馈锁相环的滑模观测器
  • 使用SSH隧道连接远程主机
  • 五、Python新特性指定类型用法