Apache Beam Java SDK 快速入门
Apache Beam Java SDK 快速入门
设置开发环境
-
下载并安装 Java Development Kit (JDK) 1.7 或更高版本。验证是否已经设置了JAVA_HOME 并指向JDK安装。
-
按照指定操作系统上Maven的安装指南下载并安装 Apache Maven。
获取WordCount 代码
$ mvn archetype:generate \ -DarchetypeGroupId=org.apache.beam \ -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \ -DarchetypeVersion=2.1.0 \ -DgroupId=org.example \ -DartifactId=word-count-beam \ -Dversion="0.1" \ -Dpackage=org.apache.beam.examples \ -DinteractiveMode=false
$ cd word-count-beam/$ lspom.xml src$ ls src/main/java/org/apache/beam/examples/DebuggingWordCount.java WindowedWordCount.java commonMinimalWordCount.java WordCount.java
运行WordCount
- 确保您已正确配置了该runner。
- 通过以下方式构建命令行:
- 使用 --runner=<runner> 选项制定选定好的runner(默认为 DirectRunner)
- 添加该runner的必须选项
- 选择runner能访问到的输入文件和输出位置。 (例如,如果如果在外部群集上运行管道,则无法访问本地文件。)
- 运行您的第一个WordCount管道。
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=counts --runner=ApexRunner" -Papex-runner
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=FlinkRunner --inputFile=pom.xml --output=counts" -Pflink-runner
$ mvn package exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=FlinkRunner --flinkMaster=<flink master> --filesToStage=target/word-count-beam-bundled-0.1.jar \ --inputFile=/path/to/quickstart/pom.xml --output=/tmp/counts" -Pflink-runnerYou can monitor the running job by visiting the Flink dashboard at http://<flink master>:8081
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -Pspark-runner
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \ --gcpTempLocation=gs://<your-gcs-bucket>/tmp \ --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \ -Pdataflow-runner
检查结果
$ ls counts*
$ ls counts*
$ ls counts*
$ ls /tmp/counts*
$ ls counts*
$ gsutil ls gs://<your-gcs-bucket>/counts*
$ more counts*api: 9bundled: 1old: 4Apache: 2The: 1limitations: 1Foundation: 1...
$ cat counts*BEAM: 1have: 1simple: 1skip: 4PAssert: 1...
$ more counts*The: 1api: 9old: 4Apache: 2limitations: 1bundled: 1Foundation: 1...
$ more /tmp/counts*The: 1api: 9old: 4Apache: 2limitations: 1bundled: 1Foundation: 1...
$ more counts*beam: 27SF: 1fat: 1job: 1limitations: 1require: 1of: 11profile: 10...
$ gsutil cat gs://<your-gcs-bucket>/counts*feature: 15smother'st: 1revelry: 1bashfulness: 1Bashful: 1Below: 2deserves: 32barrenly: 1...
下一步
- 了解有关Beam SDK for Java的更多信息,请查看Java SDK API reference。
- 使用 WordCount Example Walkthrough来浏览这些WordCount的例子。
- 深入了解我们最喜欢的文章和演示文稿。
- 加入users@ 邮件列表。