Spark入门(八)之WordCount
生活随笔
收集整理的這篇文章主要介紹了
Spark入门(八)之WordCount
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
?一、WordCount
計算文本里面的每個單詞出現的個數,輸出結果。
?
二、maven設置
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.mk</groupId><artifactId>spark-test</artifactId><version>1.0</version><name>spark-test</name><url>http://spark.mk.com</url><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><maven.compiler.source>1.8</maven.compiler.source><maven.compiler.target>1.8</maven.compiler.target><scala.version>2.11.1</scala.version><spark.version>2.4.4</spark.version><hadoop.version>2.6.0</hadoop.version></properties><dependencies><!-- scala依賴--><dependency><groupId>org.scala-lang</groupId><artifactId>scala-library</artifactId><version>${scala.version}</version></dependency><!-- spark依賴--><dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_2.11</artifactId><version>${spark.version}</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-sql_2.11</artifactId><version>${spark.version}</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.11</version><scope>test</scope></dependency></dependencies><build><pluginManagement><plugins><plugin><artifactId>maven-clean-plugin</artifactId><version>3.1.0</version></plugin><plugin><artifactId>maven-resources-plugin</artifactId><version>3.0.2</version></plugin><plugin><artifactId>maven-compiler-plugin</artifactId><version>3.8.0</version></plugin><plugin><artifactId>maven-surefire-plugin</artifactId><version>2.22.1</version></plugin><plugin><artifactId>maven-jar-plugin</artifactId><version>3.0.2</version></plugin></plugins></pluginManagement></build> </project>?
三、編程代碼?
public class WordCountApp implements SparkConfInfo{public static void main(String[]args){String filePath = "F:\\test\\log.txt";SparkSession sparkSession = new WordCountApp().getSparkConf("WordCount");Map<String, Integer> wordCountMap = sparkSession.sparkContext().textFile(filePath, 4).toJavaRDD().flatMap(v -> Arrays.asList(v.split("[(\\s+)(\r?\n),.。'’]")).iterator()).filter(v -> v.matches("[a-zA-Z-]+")).map(String::toLowerCase).mapToPair(v -> new Tuple2<>(v, 1)).reduceByKey(Integer::sum).collectAsMap();wordCountMap.forEach((k, v) -> System.out.println(k + ":" + v));sparkSession.stop();} }public interface SparkConfInfo {default SparkSession getSparkConf(String appName){SparkConf sparkConf = new SparkConf();if(System.getProperty("os.name").toLowerCase().contains("win")) {sparkConf.setMaster("local[4]");System.out.println("使用本地模擬是spark");}else{sparkConf.setMaster("spark://hadoop01:7077,hadoop02:7077,hadoop03:7077");sparkConf.set("spark.driver.host","192.168.150.1");//本地ip,必須與spark集群能夠相互訪問,如:同一個局域網sparkConf.setJars(new String[] {".\\out\\artifacts\\spark_test\\spark-test.jar"});//項目構建生成的路徑}SparkSession session = SparkSession.builder().appName(appName).config(sparkConf).config(sparkConf).getOrCreate();return session;} }文件內容
Spark Streaming is an extension of the core Spark API that enables scalable,high-throughput, fault-tolerant stream processing of live 。data streams. Data, can be ,ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems,Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.This guide shows you how to start writing Spark Streaming programs with DStreams. You can write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2), all of which are presented in this guide. You will find tabs throughout this guide that let you choose between code snippets of different languages. databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.輸出
created:1 continuous:1 high-level:3 either:1 reduce:1 many:1 writing:1 learning:1 sources:2 is:2 spark:7 can:6 high-throughput:1 filesystems:1 using:1 rdds:1 of:6 input:1 scala:1 you:5 operations:1 kinesis:2 fact:1 or:4 provides:1 pushed:1 how:1 will:1 join:1 databases:1 window:1 be:4 data:5 from:3 s:1 abstraction:1 languages:1 to:2 all:1 and:5 that:2 fault-tolerant:1 core:1 a:4 expressed:1 internally:1 streaming:4 on:2 dashboards:1 java:1 let:1 processed:2 with:2 write:1 by:1 between:1 in:4 live:2 like:2 represented:1 code:1 are:1 stream:3 algorithms:2 sequence:1 streams:3 graph:1 an:1 flume:2 apply:1 kafka:2 sockets:1 the:1 out:1 presented:1 snippets:1 extension:1 scalable:1 guide:3 dstream:2 choose:1 represents:1 dstreams:3 find:1 shows:1 programs:2 such:1 functions:1 called:1 tcp:1 machine:1 api:1 throughout:1 tabs:1 start:1 which:2 this:3 different:1 processing:2 applying:1 enables:1 complex:1 finally:1 introduced:1 python:1 other:1 discretized:1 map:1 as:2?
總結
以上是生活随笔為你收集整理的Spark入门(八)之WordCount的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 笔记本电脑鼠标没反应是怎么回事 笔记本电
- 下一篇: Spark入门(九)之PI估值