Skip to content

allen's blog

About Me

Search for...

allen's blog

Search for...

About Me

Home

spark

spark

spark shuffle

When it comes to the problem of high disk io on yarn cluster, it may lead to many undesirable situations, such as application’s slowing down too much, hdfs operation shows high latency, although the file being operated is very small. What happened? It may be a shuffle problem. For example, spark is our main compute […]

Written by allen November 9, 2022January 6, 2023

spark

spark 3.0 AQE

https://xie.infoq.cn/article/fad821a83e19c6478458e0b03 https://cloud.tencent.com/developer/article/1791911 https://aws.amazon.com/cn/blogs/china/application-and-actual-combat-of-new-features-of-apache-spark-3-0-in-freewheels-core-business-data-team/ trino cbo https://trino.io/blog/2019/07/04/cbo-introduction.html spark cbo https://docs.databricks.com/spark/latest/spark-sql/cbo.html spark cbo http://www.jasongj.com/spark/cbo/ 升级的成本考量：每个公司的大数据架构不一，统一的架构管控更能够适应这种计算引擎的升级。参考上文freewheel的数据平台，它做spark升级较为容易。有没有必要升级，是基于公司成本和迁移难度的考量。spark 3.0提供的很多优秀特性，可以提高集群的资源利用率，同时提高数据处理效率。如果是云上的集群，确实可以节约可观的成本。 AQE解决了历史性难题。所以Flink和Spark哪个好，老生常谈。做流式做了4年多，总结到，批量还是看spark，spark streaming基本上大多数业务场景spark完全cover了，没必要上技术要求比较高的flink，flink对开发人员的水平要求目前还是太高了。flink checkpoint用的好不好，决定着lambda是否还是需要建设，可以肯定的是基本上二线城市或二线公司没有这个能力完全流批一体。

Written by allen June 6, 2022June 7, 2022

Search

Recent Posts

Why We’re All Forgetting Things Right Now – from WSJ
spark shuffle
hbase memstore的gc问题
马太效应
yarn调度之资源抢占

Recent Comments

allen on youtube google面试感触
Darkato on youtube google面试感触
allen on Management is not a promotion

Archives

December 2022
November 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022

Categories

architect
career
gradle
hbase
hive
java
javascript
kafka
ml
mysql
reproduced
spark
ui
vue
yarn
分布式一致性
工具
生活
随笔

Referral badge: Get $100 in credit.

Copyright © 2018-2023 zhanghh.cn All Rights Reserved. 闽ICP备2020019544号-3

web analytics