当前位置：首页 > news >正文

Python分块读取大型Excel文件

news 2025/8/6 8:27:10

文章目录

一、核心特性
二、使用示例
三、性能建议
四、适用场景
五、相关文章

传统方法在处理大型Excel文件时可能面临内存不足的问题。以下方法通过分块读取适合处理数百MB甚至GB级别的Excel文件。

一、核心特性

内存高效：采用分块处理机制，避免一次性加载整个文件到内存
灵活配置：可自定义分块大小、工作表选择和表头设置
生成器模式：以迭代方式返回数据，适合流式处理
自动清理：正确处理文件资源，避免内存泄漏

二、使用示例

import pandas as pd
from openpyxl import load_workbook
from typing import Generator, Optionaldef read_excel_in_chunks(file_path: str,chunk_size: int = 1000,sheet_name: Optional[str] = None,header: Optional[int] = 0,
) -> Generator[pd.DataFrame, None, None]:"""分块读取大型 Excel 文件，避免内存不足。Args:file_path (str): Excel 文件路径。chunk_size (int): 每块的行数（默认 1000）。sheet_name (str): 工作表名（默认第一个工作表）。header (int): 表头所在行（默认第 0 行，无表头设为 None）。Yields:pd.DataFrame: 每个分块的 DataFrame。"""# 以只读模式打开 Excel 文件wb = load_workbook(file_path, read_only=True)# 选择工作表if sheet_name is not None:sheet = wb[sheet_name]else:sheet = wb.active  # 默认第一个工作表# 读取表头（如果有）headers = []if header is not None:for row in sheet.iter_rows(min_row=header + 1, max_row=header + 1, values_only=True):headers = list(row)# 分块读取数据chunk_data = []start_row = (header + 1) if header is not None else 1  # 数据起始行for i, row in enumerate(sheet.iter_rows(min_row=start_row, values_only=True), start=1):chunk_data.append(row)# 每积累 chunk_size 行，生成一个 DataFrameif i % chunk_size == 0:yield pd.DataFrame(chunk_data, columns=headers)chunk_data = []# 处理剩余行（不足 chunk_size 的部分）if chunk_data:yield pd.DataFrame(chunk_data, columns=headers)# 关闭工作簿wb.close()for dfdf_chunk in read_excel_in_chunks(f'big.xlsx', chunk_size=1000):# 可以转换为字典列表data=df_chunk.to_dict(orient='records')...