当前位置: 首页 > news >正文

hive 小文件分析

1、获取fsimage文件:
hdfs dfsadmin -fetchImage /data/xy/
2、从二进制文件解析:
hdfs oiv -i /data/xy/fsimage_0000000019891608958 -t /data/xy/tmpdir -o /data/xy/out -p Delimited -delimiter “,”
3、创建hive表
create database if not exists hdfsinfo;
use hdfsinfo;
CREATE TABLE fsimage_info_csv(
path string,
replication int,
modificationtime string,
accesstime string,
preferredblocksize bigint,
blockscount int,
filesize bigint,
nsquota string,
dsquota string,
permission string,
username string,
groupname string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (‘field.delim’=‘,’, ‘serialization.format’=‘,’)
STORED AS INPUTFORMAT ‘org.apache.hadoop.mapred.TextInputFormat’;

4、存储HDFS元数据加载进hive中
hdfs dfs -put /data/xy/out /user/hive/warehouse/hdfsinfo.db/fsimage_info_csv/
hdfs dfs -ls /user/hive/warehouse/hdfsinfo.db/fsimage_info_csv/
Hive: MSCK REPAIR TABLE hdfsinfo.fsimage_info_csv;
select * from hdfsinfo.fsimage_info_csv limit 5;

5、统计叶子目录下小文件数据量(4194304 H字节,即<4M)
SELECT
dir_path ,
COUNT(*) AS small_file_num,
modificationtime,
accesstime
FROM
( SELECT
modificationtime,
accesstime,
relative_size,
dir_path
FROM
(
SELECT
(CASE filesize < 4194304 WHEN TRUE THEN ‘small’ ELSE ‘large’ END) AS relative_size,
modificationtime,
accesstime,
split(
substr(
concat_ws(‘/’, split(PATH, ‘/’)),
1,
length(concat_ws(‘/’, split(PATH, ‘/’))) - length(last_element) - 1
),
‘,’)[0] as dir_path
FROM (
SELECT
modificationtime,
accesstime,
filesize,
PATH,
split(PATH, ‘/’)[size(split(PATH, ‘/’)) - 1] as last_element
FROM hdfsinfo.fsimage_info_csv
) t0 ) t1
WHERE
relative_size=‘small’) t2
GROUP BY
dir_path,modificationtime,accesstime
ORDER BY
small_file_num desc
limit 500;

5、统计叶子目录下小文件数据量(4194304 H字节,即<4M)
SELECT
dir_path,
COUNT(*) AS small_file_num
FROM
( SELECT
relative_size,
dir_path
FROM
(
SELECT
(CASE filesize < 41943040 WHEN TRUE THEN ‘small’ ELSE ‘large’ END) AS relative_size,
split(
substr(
concat_ws(‘/’, split(PATH, ‘/’)),
1,
length(concat_ws(‘/’, split(PATH, ‘/’))) - length(last_element) - 1
),
‘,’)[0] as dir_path
FROM (
SELECT
filesize,
PATH,
split(PATH, ‘/’)[size(split(PATH, ‘/’)) - 1] as last_element
FROM hdfsinfo.fsimage_info_csv
WHERE
permission not LIKE ‘d%’
) t0 ) t1
WHERE
relative_size=‘small’) t2
GROUP BY
dir_path
ORDER BY
small_file_num desc
limit 50000;

http://www.lryc.cn/news/505643.html

相关文章:

  • 【JavaWeb后端学习笔记】WebSocket通信
  • 搭建springmvc项目
  • Springboot3.x配置类(Configuration)和单元测试
  • java后端环境配置
  • 手眼标定工具操作文档
  • WebGIS城市停水及影响范围可视化实践
  • 无管理员权限 LCU auth-token、port 获取(全网首发 go)
  • 【数字花园】数字花园(个人网站、博客)搭建经历教程
  • python模拟练习第一期
  • Xcode
  • RabbitMQ安装延迟消息插件(mq报错)
  • es 3期 第15节-词项查询与跨度查询实战运用
  • iOS Delegate模式
  • java-使用druid sqlparser将SQL DDL脚本转化为自定义的java对象
  • React状态管理常见面试题目(一)
  • jenkins 出现 Jenkins: 403 No valid crumb was included in the request
  • 【前端面试】list转树、拍平, 指标,
  • 游戏引擎学习第43天
  • NVM:安装配置使用(详细教程)
  • matlab测试ADC动态性能的原理
  • PostgreSQL JSON/JSONB 查询与操作指南
  • 【Isaac Lab】Ubuntu22.04安装英伟达驱动
  • JS,递归,处理树形数据组件,模糊查询树形结构数据字段
  • 神州数码DCME-320 online_list.php 任意文件读取漏洞复现
  • nginx的内置变量以及nginx的代理
  • ubuntu监测硬盘状态
  • 3.2.1.2 汇编版 原子操作 CAS
  • InnoDB事务系统(二):事务的实现
  • xdoj :模式匹配
  • Redis的基本使用命令(GET,SET,KEYS,EXISTS,DEL,EXPIRE,TTL,TYPE)