【译】用 Rust 实现 csv 解析-part5

Rust and CSV parsing 译文（用 Rust 实现 csv 解析-part5）

原文链接：https://blog.burntsushi.net/csv/

原文作者：BurntSushi

译文来自：https://github.com/suhanyujie/article-transfer-rs/

译者：suhanyujie

译者博客：suhanyujie

ps：水平有限，翻译不当之处，还请指正。

标签：Rust，csv

使用 Serde 处理非法数据

在本节中，我们将看到一个关于如何处理非正常数据的简单示例。为了完成这个练习，我们将使用前面一直使用的美国人口数据的调整版。这个版本的数据比之前要混乱一些。你可以下载它：

$ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-null.csv'

我们基于上一节的示例程序，继续开发：

#[derive(Debug, Deserialize)]
 #[serde(rename_all = "PascalCase")]
struct Record {
    latitude: f64,
    longitude: f64,
    population: Option<u64>,
    city: String,
    state: String,
}

fn run() -> Result<(), Box<Error>> {
    let mut rdr = csv::Reader::from_reader(io::stdin());
    for result in rdr.deserialize() {
        let record: Record = result?;
        println!("{:?}", record);
    }
    Ok(())
}

编译并用我们的不整洁数据运行它：

$ cargo build
$ ./target/debug/csvtutor < uspop-null.csv
Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
# ... more records
CSV deserialize error: record 42 (line: 43, byte: 1710): field 2: invalid digit found in string

哇！发生什么了？程序打印了几条记录，但是当它反序列化时遇到问题时停止了。错误消息显示，它在第 43 行第 2 个字段（即 Population 字段）中发现了一个无效的数字。第 43 行是什么？

$ head -n 43 uspop-null.csv | tail -n1
Flint Springs,KY,NULL,37.3433333,-86.7136111

啊！第三个字段（索引为 2）应该要么是空的，要么是人口计数。但是，在这里，似乎值就是 NULL，作者可能是为了表明这里没有计数。

我们程序当前的问题是，它无法读取这个记录，因为它不知道如何将 NULL 反序列化为 Option<u64> 类型。也就是说，Option<u64> 要么对应一个空值。要么对应一个整数。

为了修复这个问题，我们需要让 Serde 在这个字段上支持将任意反序列化时的错误转换为一个 None 值，如下方示例所示：

#[derive(Debug, Deserialize)]
 #[serde(rename_all = "PascalCase")]
struct Record {
    latitude: f64,
    longitude: f64,
    #[serde(deserialize_with = "csv::invalid_option")]
    population: Option<u64>,
    city: String,
    state: String,
}

fn run() -> Result<(), Box<Error>> {
    let mut rdr = csv::Reader::from_reader(io::stdin());
    for result in rdr.deserialize() {
        let record: Record = result?;
        println!("{:?}", record);
    }
    Ok(())
}

如果你编译并运行这个示例，它能像其他示例一样执行完成：

$ cargo build
$ ./target/debug/csvtutor < uspop-null.csv
Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
# ... and more

这个示例中唯一改变的是向记录中 population 字段增加了这个属性：

#[serde(deserialize_with = "csv::invalid_option")]

invalid_option 函数是一个通用的辅助函数，它做了一件非常简单的事情：当其作用于字段时，它把任何反序列化错误转换为 None 值。当你需要处理混乱的 CSV 数据时，这非常好用。

写入 CSV 数据

在这一节中，我们会展示一些写 CSV 数据的示例。写入 CSV 数据往往比读取 CSV 数据更直接一些，因为你需要控制输出格式。

我们先从最基本的例子开始：写一些 CSV 数据记录到标准输出。

extern crate csv;

use std::error::Error;
use std::io;
use std::process;

fn run() -> Result<(), Box<Error>> {
    let mut wtr = csv::Writer::from_writer(io::stdout());
    // Since we're writing records manually, we must explicitly write our
    // header record. A header record is written the same way that other
    // records are written.
    wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
    wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
    wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
    wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;

    // A CSV writer maintains an internal buffer, so it's important
    // to flush the buffer when you're done.
    wtr.flush()?;
    Ok(())
}

fn main() {
    if let Err(err) = run() {
        println!("{}", err);
        process::exit(1);
    }
}

编译并运行这个示例，程序会将数据打印出来：

$ cargo build
$ ./target/debug/csvtutor
City,State,Population,Latitude,Longitude
Davidsons Landing,AK,,65.2419444,-165.2716667
Kenai,AK,7610,60.5544444,-151.2583333
Oakman,AL,,33.7133333,-87.3886111

在继续学习之前，有必要仔细研究一下 write_record 方法。在这个例子中，它看起来相当简单，但是如果你是 Rust 新手，可能会不太明白它的类型签名：

pub fn write_record<I, T>(&mut self, record: I) -> csv::Result<()>
    where I: IntoIterator<Item=T>, T: AsRef<[u8]>
{
    // 省略
}

为了搞懂这个类型签名，我们可以对其进行逐步分解。

1.这个方法有两个参数 self 和 record。
2.self 是一个特殊的参数，它对应于 Writer 本身。
3.record 是我们即将要写入的 CSV 记录。它的类型是 I，是一个泛型。
4.在方法的 where 子句中，I 类型由 IntoIterator<Item=T> bound 约束着。这意味着 I 必须满足 IntoIterator trait。如果你阅读 IntoIterator trait 文档，那么就能看到它描述了构建迭代器的类型。在这个例子中，我们需要一个迭代器，它生成另一个泛型 T，其中 T 是我们写入 CSV 数据的字段类型。
5.T 也出现在 where 字句中了，但它受 AsRef<[u8]> bound 约束。AsRef trait 是 Rust 中零成本抽象类型描述的一种方式。在本例中，AsRef<[u8]> 中的 [u8] 就是我们想要从 T 中借用的字节切片。CSV writer 将接受这些字节并将其作为单个字段的值写入。AsRef<[u8]> 绑定非常有用，因为像 String，&str，Vec<u8> 和 &[u8] 等类型都满足它的约束。
6.最终，方法返回 csv::Result<()>，这是 Result<(), csv::Error> 的简写。该类型意味着 write_record 要么成功时不返回任何值，要么失败了，返回 csv::Error。

现在，可以应用我们已经理解了的 write_record 的函数签名了。如果还记得，在我们之前的例子中，我们是这样使用的：

wtr.write_record(&["field 1", "field 2", "etc"])?;

这些类型是如何匹配的呢？好了，这个记录中的字段的类型是 &'static str（这是 Rust 中字符串字面量的类型）。因为我们将其放在一个切片文本中，所以我们的参数类型是 &'static [&'static str]，或者更简洁地写成没有生命周期注释的形式 &[&str]。因为切片满足 IntoIterator 绑定，字符串满足 AsRef<[u8]> 绑定，这将会是一个合法的调用。

这里是一些调用 write_record 的示例：

// A slice of byte strings.
wtr.write_record(&[b"a", b"b", b"c"]);
// A vector.
wtr.write_record(vec!["a", "b", "c"]);
// A string record.
wtr.write_record(&csv::StringRecord::from(vec!["a", "b", "c"]));
// A byte record.
wtr.write_record(&csv::ByteRecord::from(vec!["a", "b", "c"]));

最终，上面的例子可以很容易地修改成写入文件而非 stdout：

extern crate csv;

use std::env;
use std::error::Error;
use std::ffi::OsString;
use std::process;

fn run() -> Result<(), Box<Error>> {
    let file_path = get_first_arg()?;
    let mut wtr = csv::Writer::from_path(file_path)?;

    wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
    wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
    wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
    wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;

    wtr.flush()?;
    Ok(())
}

/// Returns the first positional argument sent to this process. If there are no
/// positional arguments, then this returns an error.
fn get_first_arg() -> Result<OsString, Box<Error>> {
    match env::args_os().nth(1) {
        None => Err(From::from("expected 1 argument, but got none")),
        Some(file_path) => Ok(file_path),
    }
}

fn main() {
    if let Err(err) = run() {
        println!("{}", err);
        process::exit(1);
    }
}

Writing tab separated values

写入 tab 分隔符
如何写入 tab 分隔符呢？可以自己思考一下，下一期翻译，揭晓答案。当然，你也可以尝试阅读英文原文。：）

posted @ 2020-12-22 10:10 suhanyujie 阅读(313) 评论(0) 编辑收藏举报

刷新页面返回顶部

知而不行，是为不知；行而不知，可以致知

【译】用 Rust 实现 csv 解析-part5

使用 Serde 处理非法数据

写入 CSV 数据

Writing tab separated values

公告