打赏

[Spark] Spark读取gbk编码文件

def output_mapper(line):
    """ 输入文件是gbk编码,
        使用spark的GBKFileInputFormat读取后自动转为utf-8编码.
        Keys are the position in the file,
        and values are the line of text,
        and will be converted to UTF-8 Text.
    Args:
        line    (position, bidword \t sp \t tag_info)
    Returns:
        list    [bidword, sp, tag_info, theDate]
    """
    try:
        global theDate
        value = line[1]
        bidword, sp, tag_info = value.strip().split('\t')
        return [bidword, sp, tag_info, theDate]
    except Exception as e:
        logging.error("add_date_mapper error: {}".format(e))
        return None

test_df = sc.hadoopFile(test_file,
                        "org.apache.spark.input.GBKFileInputFormat",
                        "org.apache.hadoop.io.LongWritable",
                        "org.apache.hadoop.io.Text")\
                   .map(output_mapper)\
                   .filter(lambda x: x is not None)\
                   .toDF()

 

参考链接:

https://www.wangt.cc/2019/11/feature%EF%BC%9Aspark%E6%94%AF%E6%8C%81gbk%E6%96%87%E4%BB%B6%E8%AF%BB%E5%8F%96%E5%8A%9F%E8%83%BD/

/**
 * FileInputFormat for gbk encoded files. Files are broken into lines.Either linefeed
 * or carriage-return are used to signal end of line.  Keys are the position in the file,
 * and values are the line of text and will be converted to UTF-8 Text.
 */

posted @ 2021-02-03 19:37  listenviolet  阅读(592)  评论(0编辑  收藏  举报