When it comes to the problem of high disk io on yarn cluster, it may lead to many undesirable situations, such as application’s slowing down too much, hdfs operation shows high latency, although the file being operated is very small. What happened? It may be a shuffle problem. For example, spark is our main compute […]
https://xie.infoq.cn/article/fad821a83e19c6478458e0b03 https://cloud.tencent.com/developer/article/1791911 https://aws.amazon.com/cn/blogs/china/application-and-actual-combat-of-new-features-of-apache-spark-3-0-in-freewheels-core-business-data-team/ trino cbo https://trino.io/blog/2019/07/04/cbo-introduction.html spark cbo https://docs.databricks.com/spark/latest/spark-sql/cbo.html spark cbo http://www.jasongj.com/spark/cbo/ 升级的成本考量:每个公司的大数据架构不一,统一的架构管控更能够适应这种计算引擎的升级。参考上文freewheel的数据平台,它做spark升级较为容易。 有没有必要升级,是基于公司成本和迁移难度的考量。spark 3.0提供的很多优秀特性,可以提高集群的资源利用率,同时提高数据处理效率。如果是云上的集群,确实可以节约可观的成本。 AQE解决了历史性难题。所以Flink和Spark哪个好,老生常谈。做流式做了4年多,总结到,批量还是看spark,spark streaming基本上大多数业务场景spark完全cover了,没必要上技术要求比较高的flink,flink对开发人员的水平要求目前还是太高了。flink checkpoint用的好不好,决定着lambda是否还是需要建设,可以肯定的是基本上二线城市或二线公司没有这个能力完全流批一体。