用spark3.5.2+scala2.13.11+JDK17统计文件单词词频

Posted on 2025-02-24 23:06 懒人ABC 阅读(22) 评论(0) 收藏举报

IDE：IntelliJ IDEA 2024.2.3

JDK17下载：通过IDEA里的项目结构/项目设置/项目/SDK选择corretto-17

MAVEN：3.9.9，必须在配置/构建执行部署/构建工具/Maven/用户设置文件里指定apache-maven-3.9.9\conf\settings.xml

这里面要配置阿里云库连接，

<?xml version="1.0" encoding="UTF-8"?>

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->

<!--
 | This is the configuration file for Maven. It can be specified at two levels:
 |
 |  1. User Level. This settings.xml file provides configuration for a single user,
 |                 and is normally provided in ${user.home}/.m2/settings.xml.
 |
 |                 NOTE: This location can be overridden with the CLI option:
 |
 |                 -s /path/to/user/settings.xml
 |
 |  2. Global Level. This settings.xml file provides configuration for all Maven
 |                 users on a machine (assuming they're all using the same Maven
 |                 installation). It's normally provided in
 |                 ${maven.conf}/settings.xml.
 |
 |                 NOTE: This location can be overridden with the CLI option:
 |
 |                 -gs /path/to/global/settings.xml
 |
 | The sections in this sample file are intended to give you a running start at
 | getting the most out of your Maven installation. Where appropriate, the default
 | values (values used when the setting is not specified) are provided.
 |
 |-->
<settings xmlns="http://maven.apache.org/SETTINGS/1.2.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd">
  <!-- localRepository
   | The path to the local repository maven will use to store artifacts.
   |
   | Default: ${user.home}/.m2/repository
  <localRepository>/path/to/local/repo</localRepository>
  -->
 <localRepository>D:\maven_store</localRepository>
  <!-- interactiveMode
   | This will determine whether maven prompts you when it needs input. If set to false,
   | maven will use a sensible default value, perhaps based on some other setting, for
   | the parameter in question.
   |
   | Default: true
  <interactiveMode>true</interactiveMode>
  -->

  <!-- offline
   | Determines whether maven should attempt to connect to the network when executing a build.
   | This will have an effect on artifact downloads, artifact deployment, and others.
   |
   | Default: false
  <offline>false</offline>
  -->

  <!-- pluginGroups
   | This is a list of additional group identifiers that will be searched when resolving plugins by their prefix, i.e.
   | when invoking a command line like "mvn prefix:goal". Maven will automatically add the group identifiers
   | "org.apache.maven.plugins" and "org.codehaus.mojo" if these are not already contained in the list.
   |-->
  <pluginGroups>
    <!-- pluginGroup
     | Specifies a further group identifier to use for plugin lookup.
    <pluginGroup>com.your.plugins</pluginGroup>
    -->
  </pluginGroups>

  <!-- TODO Since when can proxies be selected as depicted? -->
  <!-- proxies
   | This is a list of proxies which can be used on this machine to connect to the network.
   | Unless otherwise specified (by system property or command-line switch), the first proxy
   | specification in this list marked as active will be used.
   |-->
  <proxies>
    <!-- proxy
     | Specification for one proxy, to be used in connecting to the network.
     |
    <proxy>
      <id>optional</id>
      <active>true</active>
      <protocol>http</protocol>
      <username>proxyuser</username>
      <password>proxypass</password>
      <host>proxy.host.net</host>
      <port>80</port>
      <nonProxyHosts>local.net|some.host.com</nonProxyHosts>
    </proxy>
    -->
  </proxies>

  <!-- servers
   | This is a list of authentication profiles, keyed by the server-id used within the system.
   | Authentication profiles can be used whenever maven must make a connection to a remote server.
   |-->
  <servers>
    <!-- server
     | Specifies the authentication information to use when connecting to a particular server, identified by
     | a unique name within the system (referred to by the 'id' attribute below).
     |
     | NOTE: You should either specify username/password OR privateKey/passphrase, since these pairings are
     |       used together.
     |
    <server>
      <id>deploymentRepo</id>
      <username>repouser</username>
      <password>repopwd</password>
    </server>
    -->

    <!-- Another sample, using keys to authenticate.
    <server>
      <id>siteServer</id>
      <privateKey>/path/to/private/key</privateKey>
      <passphrase>optional; leave empty if not used.</passphrase>
    </server>
    -->
  </servers>

  <!-- mirrors
   | This is a list of mirrors to be used in downloading artifacts from remote repositories.
   |
   | It works like this: a POM may declare a repository to use in resolving certain artifacts.
   | However, this repository may have problems with heavy traffic at times, so people have mirrored
   | it to several places.
   |
   | That repository definition will have a unique id, so we can create a mirror reference for that
   | repository, to be used as an alternate download site. The mirror site will be the preferred
   | server for that repository.
   |-->
  <mirrors>
    <!-- mirror
     | Specifies a repository mirror site to use instead of a given repository. The repository that
     | this mirror serves has an ID that matches the mirrorOf element of this mirror. IDs are used
     | for inheritance and direct lookup purposes, and must be unique across the set of mirrors.
     |
    <mirror>
      <id>mirrorId</id>
      <mirrorOf>repositoryId</mirrorOf>
      <name>Human Readable Name for this Mirror.</name>
      <url>http://my.repository.com/repo/path</url>
    </mirror>
     -->
    <mirror>
      <id>maven-default-http-blocker</id>
      <mirrorOf>external:http:*</mirrorOf>
      <name>Pseudo repository to mirror external repositories initially using HTTP.</name>
      <url>http://0.0.0.0/</url>
      <blocked>true</blocked>
    </mirror>
    <mirror>
        <id>aliyun</id>
        <name>Aliyun Maven</name>
        <url>https://maven.aliyun.com/repository/public</url>
        <mirrorOf>central</mirrorOf>
    </mirror>
    
    <mirror>
      <id>aspose</id>
      <name>aspose maven</name>
      <url>https://repository.aspose.com/repo/</url>
      <mirrorOf>repo</mirrorOf>        
    </mirror>
  </mirrors>

  <!-- profiles
   | This is a list of profiles which can be activated in a variety of ways, and which can modify
   | the build process. Profiles provided in the settings.xml are intended to provide local machine-
   | specific paths and repository locations which allow the build to work in the local environment.
   |
   | For example, if you have an integration testing plugin - like cactus - that needs to know where
   | your Tomcat instance is installed, you can provide a variable here such that the variable is
   | dereferenced during the build process to configure the cactus plugin.
   |
   | As noted above, profiles can be activated in a variety of ways. One way - the activeProfiles
   | section of this document (settings.xml) - will be discussed later. Another way essentially
   | relies on the detection of a property, either matching a particular value for the property,
   | or merely testing its existence. Profiles can also be activated by JDK version prefix, where a
   | value of '1.4' might activate a profile when the build is executed on a JDK version of '1.4.2_07'.
   | Finally, the list of active profiles can be specified directly from the command line.
   |
   | NOTE: For profiles defined in the settings.xml, you are restricted to specifying only artifact
   |       repositories, plugin repositories, and free-form properties to be used as configuration
   |       variables for plugins in the POM.
   |
   |-->
  <profiles>
    <profile>
      <id>jdk-21</id>
      <activation>
        <jdk>21</jdk>
      </activation>
      <repositories>
        <repository>
          <id>aliyun-jdk21</id>
          <name>Aliyun Maven Repository for JDK 21 builds</name>
          <url>https://maven.aliyun.com/repository/public</url>
          <layout>default</layout>
        </repository>
      </repositories>
    </profile>
    <profile>
      <id>default-profile</id>
      <activation>
        <activeByDefault>true</activeByDefault>
      </activation>
      <repositories>
        <repository>
          <id>aliyun-central</id>
          <name>Aliyun Maven Central Repository</name>
          <url>https://maven.aliyun.com/repository/central</url>
          <layout>default</layout>
        </repository>
      </repositories>
    </profile>
    <!-- profile
     | Specifies a set of introductions to the build process, to be activated using one or more of the
     | mechanisms described above. For inheritance purposes, and to activate profiles via <activatedProfiles/>
     | or the command line, profiles have to have an ID that is unique.
     |
     | An encouraged best practice for profile identification is to use a consistent naming convention
     | for profiles, such as 'env-dev', 'env-test', 'env-production', 'user-jdcasey', 'user-brett', etc.
     | This will make it more intuitive to understand what the set of introduced profiles is attempting
     | to accomplish, particularly when you only have a list of profile id's for debug.
     |
     | This profile example uses the JDK version to trigger activation, and provides a JDK-specific repo.
    <profile>
      <id>jdk-1.4</id>

      <activation>
        <jdk>1.4</jdk>
      </activation>

      <repositories>
        <repository>
          <id>jdk14</id>
          <name>Repository for JDK 1.4 builds</name>
          <url>http://www.myhost.com/maven/jdk14</url>
          <layout>default</layout>
          <snapshotPolicy>always</snapshotPolicy>
        </repository>
      </repositories>
    </profile>
    -->

    <!--
     | Here is another profile, activated by the property 'target-env' with a value of 'dev', which
     | provides a specific path to the Tomcat instance. To use this, your plugin configuration might
     | hypothetically look like:
     |
     | ...
     | <plugin>
     |   <groupId>org.myco.myplugins</groupId>
     |   <artifactId>myplugin</artifactId>
     |
     |   <configuration>
     |     <tomcatLocation>${tomcatPath}</tomcatLocation>
     |   </configuration>
     | </plugin>
     | ...
     |
     | NOTE: If you just wanted to inject this configuration whenever someone set 'target-env' to
     |       anything, you could just leave off the <value/> inside the activation-property.
     |
    <profile>
      <id>env-dev</id>

      <activation>
        <property>
          <name>target-env</name>
          <value>dev</value>
        </property>
      </activation>

      <properties>
        <tomcatPath>/path/to/tomcat/instance</tomcatPath>
      </properties>
    </profile>
    -->
  </profiles>

  <!-- activeProfiles
   | List of profiles that are active for all builds.
   |
  <activeProfiles>
    <activeProfile>alwaysActiveProfile</activeProfile>
    <activeProfile>anotherAlwaysActiveProfile</activeProfile>
  </activeProfiles>
  -->
</settings>

在设置/插件里查找并安装Scala、Spark

配置Scala环境
安装Scala插件：在Plugins中搜索并安装Scala插件。

Scala Release Notes

配置Spark环境
安装Spark插件：在Plugins中搜索并安装Spark插件。

安装 SBT
BT（Simple Build Tool）是一种用于构建 Scala 和 Java 项目的强大工具。它类似于 Maven 和 Gradle，但专门为 Scala 项目设计，同时也能很好地支持 Java 项目。以下是 SBT 的主要功能和用途：

下载并安装 SBT：访问 SBT 官方网站，下载适合 Windows 的安装包。
安装完成后，确保 SBT 的安装路径已添加到系统的 PATH 环境变量中。
打开“系统属性” -> “高级” -> “环境变量”。
在“系统变量”部分找到 Path，点击“编辑”。
添加 SBT 的安装路径（例如：C:\Program Files (x86)\sbt\bin）。
重启命令行窗口，然后运行以下命令验证安装：

sbt --version

安装和配置建议
步骤 1：下载 Hadoop
访问 Apache Hadoop 官方网站下载目标版本的 Hadoop。
例如，下载 Hadoop 3.3.6。
步骤 2：设置环境变量
设置 HADOOP_HOME 环境变量，指向 Hadoop 的安装目录。
将 %HADOOP_HOME%\bin 添加到系统的 PATH 环境变量中。

Hadoop 在 Windows 上运行时需要 winutils.exe 来模拟 Linux 的文件系统操作工具。以下是获取和配置 winutils.exe 的方法：

(1) 下载 winutils.exe
访问 GitHub 上的 Hadoop for Windows 项目，这是一个专门为 Windows 提供 Hadoop 工具的项目。
根据你的 Hadoop 版本（如 3.3.6），下载对应的 winutils.exe 文件。
(2) 放置 winutils.exe
创建一个目录用于存放 winutils.exe，例如：
C:\hadoop\bin
将下载的 winutils.exe 文件放入该目录。

配置 HADOOP_HOME 和 PATH
确保 HADOOP_HOME 指向 Hadoop 的安装目录（如 C:\hadoop）。
确保 PATH 包含 %HADOOP_HOME%\bin。
(4) 验证 winutils.exe
打开命令提示符，输入以下命令验证是否成功：
winutils.exe version

创建Maven项目

打开IDEA，选择Create New Project。
选择Maven，勾选Create from archetype，选择org.scala-tools.archetypes:scala-archetype-simple。
输入项目名称和存放位置。
3. 配置项目依赖
打开项目的pom.xml文件。
添加Spark依赖和Scala插件配置：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>wordcount</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <!-- 设置 JDK 版本为 17 -->
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <!-- Scala 版本 -->
        <scala.version>2.13.11</scala.version>
        <!-- Spark 版本 -->
        <spark.version>3.5.2</spark.version>
    </properties>

    <!-- 添加依赖 -->
    <dependencies>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-slf4j2-impl</artifactId>
            <version>2.20.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>2.20.0</version>
        </dependency>
        <!-- Scala Library -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <!-- Scala Compiler (if needed) -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-compiler</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <!-- Scala Reflect -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-reflect</artifactId>
            <version>${scala.version}</version>
        </dependency>

        <!-- Compiler Bridge -->
        <dependency>
            <groupId>org.scala-sbt</groupId>
            <artifactId>compiler-bridge_2.13</artifactId>
            <version>1.9.6</version>
        </dependency>

        <!-- Spark Core 依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.13</artifactId>
            <version>${spark.version}</version>
        </dependency>


        <!-- 如果需要使用 Spark SQL，可以添加以下依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.13</artifactId>
            <version>${spark.version}</version>
        </dependency>


    </dependencies>

    <!-- 构建配置 -->
    <build>
        <plugins>
            <!-- Scala Maven Plugin 配置 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>4.9.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <scalaVersion>${scala.version}</scalaVersion>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

在 Spark 项目中配置 Log4J 2.x

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
    <Appenders>
        <!-- 控制台输出 -->
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout pattern="%d{yyyy-MM-dd HH:mm:ss} [%t] %-5level %logger{36} - %msg%n"/>
        </Console>

        <!-- 文件输出 -->
        <File name="LogFile" fileName="logs/app.log">
            <PatternLayout pattern="%d{yyyy-MM-dd HH:mm:ss} [%t] %-5level %logger{36} - %msg%n"/>
        </File>
    </Appenders>

    <Loggers>
        <!-- 根日志级别 -->
        <Root level="info">
            <AppenderRef ref="Console"/>
            <AppenderRef ref="LogFile"/>
        </Root>

        <!-- 调整 Spark 日志级别 -->
        <Logger name="org.apache.spark" level="warn" additivity="false">
            <AppenderRef ref="Console"/>
        </Logger>
    </Loggers>
</Configuration>

编写WordCount程序
在src/main/scala目录下创建包，例如com.example.spark。
在包中创建WordCount.scala文件，并输入以下代码：

package com.example.spark

import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import java.io.File

object WordCount {
  def main(args: Array[String]): Unit = {
    // 创建 SparkSession，这是 Spark 2.x 和 3.x 的入口点
    val spark = SparkSession.builder()
      .appName("WordCount")
      .master("local[*]") // 设置为本地模式，使用所有可用的核心
      .getOrCreate()

    import spark.implicits._ // 导入隐式转换，方便操作 DataFrame 和 Dataset

    try {
      // 获取当前运行目录
      val currentDir = System.getProperty("user.dir")

      // 定义输入和输出路径（相对路径）
      val inputPath = s"$currentDir/data/input.txt"
      val outputPath = s"$currentDir/output"

      // 删除已存在的输出目录
      val outputDir = new File(outputPath)
      if (outputDir.exists()) {
        deleteDirectory(outputDir)
      }

      // 设置输出路径配置项
      spark.conf.set("mapreduce.output.fileoutputformat.outputdir", outputPath)

      // 读取文本文件
      val text: RDD[String] = spark.sparkContext.textFile(inputPath)

      // 执行单词计数逻辑
      val counts: RDD[(String, Int)] = text
        .flatMap(line => line.split("\\s+")) // 使用正则表达式分隔单词
        .map(word => (word, 1))
        .reduceByKey(_ + _)

      // 提取频率最高的 N 个单词（例如前 10 个）
      val topN = 10
      val topWords = counts.takeOrdered(topN)(Ordering.by[(String, Int), Int] { case (_, count) => -count })

      // 打印结果
      println("Top N Words:")
      topWords.foreach { case (word, count) => println(s"$word -> $count") }

      // 将结果保存到输出目录
      counts.saveAsTextFile(outputPath)
    } finally {
      // 停止 SparkSession
      spark.stop()
    }
  }

  /**
   * 递归删除目录及其内容
   *
   * @param file 目标文件或目录
   */
  def deleteDirectory(file: File): Boolean = {
    if (file.isDirectory) {
      file.listFiles().foreach(child => deleteDirectory(child)) // 递归删除子目录和文件
    }
    file.delete() // 删除当前文件或空目录
  }
}

java.lang.IllegalAccessError 错误

class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x3e27ba32) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x3e27ba32

原因分析：
这是一个 Java 模块化系统的访问权限问题，通常发生在使用 Java 9 及以上版本时。
Spark 使用了 sun.nio.ch.DirectBuffer 类，但该类在 Java 9+ 中被限制访问。

如果必须使用 Java 11+：
添加 JVM 参数以允许访问受限模块：
在启动 Spark 时，添加以下参数：
--add-opens java.base/sun.nio.ch=ALL-UNNAMED
如果使用 IntelliJ IDEA，可以在运行配置中添加上述参数：
打开 Run/Debug Configurations。
在 VM options 中添加：

--add-opens java.base/sun.nio.ch=ALL-UNNAMED

总的来说一共加了四五个：

--add-opens java.base/java.io=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.base/jdk.internal.ref=ALL-UNNAMED --add-opens java.base/java.lang.invoke=ALL-UNNAMED

刷新页面返回顶部

si812cn

公告

用spark3.5.2+scala2.13.11+JDK17统计文件单词词频