spark设置分区(并行度):保存分区信息文件
生活随笔
收集整理的這篇文章主要介紹了
spark设置分区(并行度):保存分区信息文件
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
代碼
package com.atguigu.bigdata.spark.core.rdd.builderimport org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext}object Spark01_RDD_Memory_Par {def main(args: Array[String]): Unit = {// TODO 準備環境val sparkConf = new SparkConf().setMaster("local[*]").setAppName("RDD")sparkConf.set("spark.default.parallelism", "5")val sc = new SparkContext(sparkConf)// TODO 創建RDD// RDD的并行度 & 分區// makeRDD方法可以傳遞第二個參數,這個參數表示分區的數量// 第二個參數可以不傳遞的,那么makeRDD方法會使用默認值 : defaultParallelism(默認并行度)// scheduler.conf.getInt("spark.default.parallelism", totalCores)// spark在默認情況下,從配置對象中獲取配置參數:spark.default.parallelism// 如果獲取不到,那么使用totalCores屬性,這個屬性取值為當前運行環境的最大可用核數//val rdd = sc.makeRDD(List(1,2,3,4),2)val rdd = sc.makeRDD(List(1,2,3,4))// 將處理的數據保存成分區文件rdd.saveAsTextFile("output")while (true){}// TODO 關閉環境sc.stop()} }pom
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.zxl.bigdata</groupId><artifactId>spark-core</artifactId><version>1.0.0</version><dependencies><dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_2.12</artifactId><version>3.0.0</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-yarn_2.12</artifactId><version>3.0.0</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-sql_2.12</artifactId><version>3.0.0</version></dependency><dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><version>5.1.27</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-hive_2.12</artifactId><version>3.0.0</version></dependency><dependency><groupId>org.apache.hive</groupId><artifactId>hive-exec</artifactId><version>1.2.1</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-streaming_2.12</artifactId><version>3.0.0</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-streaming-kafka-0-10_2.12</artifactId><version>3.0.0</version></dependency><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-core</artifactId><version>2.10.1</version></dependency><!-- https://mvnrepository.com/artifact/com.alibaba/druid --><dependency><groupId>com.alibaba</groupId><artifactId>druid</artifactId><version>1.1.10</version></dependency></dependencies></project>log4j.properties
log4j.rootCategory=ERROR, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n# Set the default spark-shell log level to ERROR. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular Spark apps. log4j.logger.org.apache.spark.repl.Main=ERROR# Settings to quiet third party logs that are too verbose log4j.logger.org.spark_project.jetty=ERROR log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR log4j.logger.org.apache.parquet=ERROR log4j.logger.parquet=ERROR# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR總結
以上是生活随笔為你收集整理的spark设置分区(并行度):保存分区信息文件的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: spark wordcount完整工程代
- 下一篇: win10微软账户登录后以管理员都无法修