当前位置：首页 > news >正文

carbondata连接数优化

news 2025/8/4 8:51:14

一，背景

  carbondata的入库采用arbonData Thrift Server方式提供，由于存在异常的入库segments但是显示状态是success，所以每天运行另一个博客中的脚本，出现连接超时，运行不正常，排查是每天连接数太多，每天将segments都遍历一遍。

二优化策略

a,策略一：
1，通过添加spark的调度池
在Spark中，调度池（Scheduler Pool）用于为不同的作业分配资源池，以控制其执行优先级。设置调度池可以帮助管理不同作业之间的资源争用情况。要使用调度池，您需要配置Fair Scheduler并创建相应的调度池配置文件。
1-1 设置调度池
spark.sql.hive.thriftServer.scheduler.pool=my-pool
1-2配置调度池文件
cp fairscheduler.xml.template fairscheduler.xml

 <pool name="my-pool"><schedulingMode>FAIR</schedulingMode><weight>1</weight><minShare>3</minShare><maxRunningApps>50</maxRunningApps><maxResources>100g,50</maxResources><minResources>4g,8</minResources><fairSharePreemptionTimeout>300</fairSharePreemptionTimeout><minSharePreemptionTimeout>120</minSharePreemptionTimeout><fairSharePreemptionThreshold>0.5</fairSharePreemptionThreshold></pool>

2，启用异步模式，提搞并发能力spark.sql.hive.thriftServer.async = true 
3,spark-default中配置


```xml
spark.sql.hive.thriftServer.scheduler.pool=my-pool
spark.sql.hive.thriftServer.thrift.port=10000
spark.sql.hive.thriftServer.idleSessionTimeout=3600
spark.sql.hive.thriftServer.async=true

4，启动命令/bin/spark-submit --master yarn   --conf spark.driver.maxResultSize=20g --conf spark.sql.hive.thriftServer.scheduler.pool=my-pool  --conf spark.scheduler.mode=FAIR \--conf spark.scheduler.allocation.file=$SPARK_HOME/conf/fairscheduler.xml --conf spark.sql.shuffle.partition=50 --driver-memory 25g --executor-cores 4 --executor-memory 5G --num-executors 10 --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer $SPARK_HOME/carbonlib/apache-carbondata-2.X-bin-sparkx-hadoop2.x.x.jar 
通过指定spark.sql.hive.thriftServer.scheduler.pool设置
5，验证通过查看是否 有create pool和 Removed from pool
b,策略二：可以尝试通过zk进行负载均衡，这样还待测试

查看全文

http://www.lryc.cn/news/374105.html