Flink学习笔记——DataSet API
Flink中的DataSet任务用于实现data sets的转换,data set通常是固定的数据源,比如可读文件,或者本地集合等。
Ref
1 | https: //ci .apache.org /projects/flink/flink-docs-release-1 .12 /zh/dev/batch/ |
使用DataSet API需要使用 批处理 env
1 | ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); |
DataSet支持的Data Source有:File-based,Collection-based,Generic
1.File-based
1 2 3 4 5 6 7 8 9 | readTextFile(path) / TextInputFormat - Reads files line wise and returns them as Strings. readTextFileWithValue(path) / TextValueInputFormat - Reads files line wise and returns them as StringValues. StringValues are mutable strings . readCsvFile(path) / CsvInputFormat - Parses files of comma (or another char) delimited fields. Returns a DataSet of tuples or POJOs. Supports the basic java types and their Value counterparts as field types. readFileOfPrimitives(path, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer. readFileOfPrimitives(path, delimiter, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer using the given delimiter. |
2.Collection-based
1 2 3 4 5 6 7 8 9 | fromCollection(Collection) - Creates a data set from a Java.util.Collection. All elements in the collection must be of the same type . fromCollection(Iterator, Class) - Creates a data set from an iterator. The class specifies the data type of the elements returned by the iterator. fromElements(T ...) - Creates a data set from the given sequence of objects. All objects must be of the same type . fromParallelCollection(SplittableIterator, Class) - Creates a data set from an iterator, in parallel. The class specifies the data type of the elements returned by the iterator. generateSequence(from, to) - Generates the sequence of numbers in the given interval, in parallel. |
3.Generic
1 2 3 | readFile(inputFormat, path) / FileInputFormat - Accepts a file input format . createInput(inputFormat) / InputFormat - Accepts a generic input format . |
Data Set支持的transformations算子
1 | https: //ci .apache.org /projects/flink/flink-docs-release-1 .12 /zh/dev/batch/dataset_transformations .html |
DataSet支持的Data Sink有:
1 2 3 4 5 6 | writeAsText() / TextOutputFormat - Writes elements line-wise as Strings. The Strings are obtained by calling the toString() method of each element. writeAsFormattedText() / TextOutputFormat - Write elements line-wise as Strings. The Strings are obtained by calling a user-defined format () method for each element. writeAsCsv(...) / CsvOutputFormat - Writes tuples as comma-separated value files. Row and field delimiters are configurable. The value for each field comes from the toString() method of the objects. print() / printToErr() / print(String msg) / printToErr(String msg) - Prints the toString() value of each element on the standard out / standard error stream. Optionally, a prefix (msg) can be provided which is prepended to the output. This can help to distinguish between different calls to print. If the parallelism is greater than 1, the output will also be prepended with the identifier of the task which produced the output. write() / FileOutputFormat - Method and base class for custom file outputs. Supports custom object-to-bytes conversion. output()/ OutputFormat - Most generic output method, for data sinks that are not file based (such as storing the result in a database). |
本文只发表于博客园和tonglin0325的博客,作者:tonglin0325,转载请注明原文链接:https://www.cnblogs.com/tonglin0325/p/14121353.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源!
· 字符编码:从基础到乱码解决