【rust】《Rust深度学习[2]-数据分析和挖掘库(Polars)》

什么是Polars?

Polars是一个用于操作结构化数据的高性能DataFrame库,可以用来进行数据清洗和格式转换、数据分析和统计、数据可视化、数据读取和存储、数据合并和拼接等等,相当于Rust版本的Pandas库。

Polars读写数据支持如下:

  常见数据文件:csv、parquet(不支持xlsx、json文件)

  云存储:S3、Azure Blob、BigQuery

  数据库:Postgresql、Mysql

为什么需要Polars?

跟Pandas比,Polars有如下优势:

  Polars取消了DataFrame中的索引(index)。消除索引让Polar可以更容易地操作数据。(Pandas中的DataFrame的索引很鸡肋);

  Polars数据底层用Apache Arrow数组表示,而Pandas数据背后用NumPy数组表示。Apache Arrow在加载速度、内存占用和计算效率上都更加高效。

  Polars比Pandas支持更多并行操作。因为Polars是用Rust写的,因此可以无畏并发。

  Polars支持延迟计算(lazy evaluation),Polars会根据请求,检验、优化他们以找到加速方法或降低内存占用。另一方面,Pandas仅支持立即计算(eager evaluation),即收到请求立即求值。

  Polars就是为了解决Pandas的性能而生的。在很多测试中,Polars比Pandas快2-3倍。

导入依赖库

[dependencies]
polars = "0.38.2"

数据表创建

创建空表结构

use polars::frame::DataFrame;
use polars::prelude::{CsvReader, NamedFrom, Series, SerReader};
fn main() {
// 创建空表结构
let data = DataFrame::default();
println!("{:?}", &data);
}
// 输出结果
shape: (0, 0)
┌┐
╞╡
└┘

series(列创建)

use polars::frame::DataFrame;
use polars::prelude::{CsvReader, NamedFrom, Series, SerReader};
fn main() {
// 创建3列数据,并标注表头
// 从 Vec, 切片和数组, series 可以携带名称
let s1 = Series::new("from vec", vec![1, 2, 3]);
let s2 = Series::new("from slice", &[true, false, true]);
let s3 = Series::new("from array", ["rookie", "long", "lin"]);
// 整合成表格结构
let data:PolarsResult<DataFrame> = DataFrame::new(vec![s1, s2, s3]);
println!("{:?}", &data.unwrap());
}
// 输出结果
shape: (3, 3)
┌──────────┬────────────┬────────────┐
│ from vec ┆ from slice ┆ from array │
│ --- ┆ --- ┆ --- │
i32boolstr
╞══════════╪════════════╪════════════╡
1true ┆ rookie │
2false ┆ long │
3true ┆ lin │
└──────────┴────────────┴────────────┘

df! 宏创建

use polars::prelude::*;
fn main() {
let df = df! [
// 表头 对应数据
"Model" => ["iPhone XS", "iPhone 12", "iPhone 13", "iPhone 14", "Samsung S11", "Samsung S12", "Mi A1", "Mi A2"],
"Company" => ["Apple", "Apple", "Apple", "Apple", "Samsung", "Samsung", "Xiao Mi", "Xiao Mi"],
"Sales" => [80, 170, 130, 205, 400, 30, 14, 8],
"Comment" => [None, None, Some("Sold Out"), Some("New Arrival"), None, Some("Sold Out"), None, None],
].unwrap();
println!("{}", &df);
}
// 输出结果
┌─────────────┬─────────┬───────┬─────────────┐
│ Model ┆ Company ┆ Sales ┆ Comment │
│ --- ┆ --- ┆ --- ┆ --- │
strstri32str
╞═════════════╪═════════╪═══════╪═════════════╡
│ iPhone XS ┆ Apple ┆ 80 ┆ null │
│ iPhone 12 ┆ Apple ┆ 170 ┆ null │
│ iPhone 13 ┆ Apple ┆ 130 ┆ Sold Out │
│ iPhone 14 ┆ Apple ┆ 205 ┆ New Arrival │
│ Samsung S11 ┆ Samsung ┆ 400 ┆ null │
│ Samsung S12 ┆ Samsung ┆ 30 ┆ Sold Out │
│ Mi A1 ┆ Xiao Mi ┆ 14 ┆ null │
│ Mi A2 ┆ Xiao Mi ┆ 8 ┆ null │
└─────────────┴─────────┴───────┴─────────────┘

csv文件创建

use polars::prelude::{CsvReader, SerReader};
fn main() {
// 读取csv文件
let data = CsvReader::from_path("number.csv").unwrap()
.has_header(true)
.finish().unwrap();
// 输出结果
println!("{}", &data);
}
// 输出结果
shape: (11, 4)
┌───────┬───────┬───────┬───────┐
1234
│ --- ┆ --- ┆ --- ┆ --- │
i64i64i64i64
╞═══════╪═══════╪═══════╪═══════╡
10001200013000140001
10002200023000240002
10003200033000340003
10004200043000440004
10005200053000540005
│ … ┆ … ┆ … ┆ … │
10007200073000740007
10008200083000840008
10009200093000940009
10010200103001040010
10011200113001140011
└───────┴───────┴───────┴───────┘

数据表信息

查看表维度

use polars::prelude::{CsvReader, SerReader};
fn main() {
let data = CsvReader::from_path("number.csv").unwrap()
.has_header(true)
.finish().unwrap();
println!("{:?}", &data.shape());
}
// 输出 (11, 4)

查看每列数据的类型

use polars::prelude::{CsvReader, SerReader};
fn main() {
let data = CsvReader::from_path("number.csv").unwrap()
.has_header(true)
.finish().unwrap();
println!("{:?}", &data.dtypes());
}
// 输出 [Int64, Int64, Int64, Int64]

查看列名

use polars::prelude::{CsvReader, SerReader};
fn main() {
let data = CsvReader::from_path("number.csv").unwrap()
.has_header(true)
.finish().unwrap();
println!("{:?}", &data.get_column_names());
}
// 输出 ["1", "2", "3", "4"]

查看前几行数据

use polars::prelude::{CsvReader, SerReader};
fn main() {
let data = CsvReader::from_path("number.csv").unwrap()
.has_header(true)
.finish().unwrap();
// 查看前5行数据
println!("{:?}", &data.head(Some(5)));
}
// 输出
shape: (5, 4)
┌───────┬───────┬───────┬───────┐
1234
│ --- ┆ --- ┆ --- ┆ --- │
i64i64i64i64
╞═══════╪═══════╪═══════╪═══════╡
10001200013000140001
10002200023000240002
10003200033000340003
10004200043000440004
10005200053000540005
└───────┴───────┴───────┴───────┘

查看后几行数据

use polars::prelude::{CsvReader, SerReader};
fn main() {
let data = CsvReader::from_path("number.csv").unwrap()
.has_header(true)
.finish().unwrap();
// 查看后5行数据
println!("{:?}", &data.tail(Some(5)));
}
// 输出
╞═══════╪═══════╪═══════╪═══════╡
10007200073000740007
10008200083000840008
10009200093000940009
10010200103001040010
10011200113001140011
└───────┴───────┴───────┴───────┘

查看某行数据及类型

use polars::prelude::{CsvReader, SerReader};
fn main() {
let data = CsvReader::from_path("number.csv").unwrap()
.has_header(true)
.finish().unwrap();
println!("{:?}", &data.get_row(2));
}
// 输出
Ok(Row([Int64(10003), Int64(20003), Int64(30003), Int64(40003)]))

聚合统计

转置

use polars::df;
use polars::frame::DataFrame;
use polars::prelude::{CsvReader, NamedFrom, Series, SerReader};
fn main() {
// 可变常量
let mut df1: DataFrame = df![
"D1" => &[1,2,3,4,5,6,7,78],
"D2" => &[23,5,5,76,7,89,89,95]
].unwrap();
// 转置,这个操作会消耗大量性能
let df2 = &df1.transpose(None, None).unwrap();
println!("{:?}", &df2);
}
// 从维度(8,2) 转置为 (2,8)
shape: (2, 8)
┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 ┆ column_3 ┆ column_4 ┆ column_5 ┆ column_6 ┆ column_7 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
i32i32i32i32i32i32i32i32
╞══════════╪══════════╪══════════╪══════════╪══════════╪══════════╪══════════╪══════════╡
123456778
2355767898995
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘

排序

use polars::df;
use polars::frame::DataFrame;
use polars::prelude::{CsvReader, NamedFrom, Series, SerReader};
fn main() {
let mut df1: DataFrame = df![
"D1" => &[1,2,3,4,5,6,7,78],
"D2" => &[23,5,5,76,7,89,89,95]
].unwrap();
// 对某一列排序,对应的另一列也会跟着变化
let df2 = &df1.sort(["D1"], vec![true], false).unwrap();
println!("{:?}", &df2);
}
// 输出结果
shape: (8, 2)
┌─────┬─────┐
│ D1 ┆ D2 │
│ --- ┆ --- │
i32i32
╞═════╪═════╡
7895
789
689
57
476
35
25
123
└─────┴─────┘

合并

use polars::df;
use polars::prelude::{DataFrameJoinOps, JoinArgs, JoinType, NamedFrom, SerReader};
fn main() {
// 创建表结构,内部有空数据
let df = df! [
// 表头 对应数据
"Model" => ["iPhone XS", "iPhone 12", "iPhone 13", "iPhone 14", "Samsung S11", "Samsung S12", "Mi A1", "Mi A2"],
"Company" => ["Apple", "Apple", "Apple", "Apple", "Samsung", "Samsung", "Xiao Mi", "Xiao Mi"],
"Sales" => [80, 170, 130, 205, 400, 30, 14, 8],
"Comment" => [None, None, Some("Sold Out"), Some("New Arrival"), None, Some("Sold Out"), None, None],
].unwrap();
let df_price = df! [
"Model" => ["iPhone XS", "iPhone 12", "iPhone 13", "iPhone 14", "Samsung S11", "Samsung S12", "Mi A1", "Mi A2"],
"Price" => [2430, 3550, 5700, 8750, 2315, 3560, 980, 1420],
"Discount" => [Some(0.85), Some(0.85), Some(0.8), None, Some(0.87), None, Some(0.66), Some(0.8)],
].unwrap();
// 合并
// join()接收5个参数,分别是:要合并的DataFrame,左表主键,右表主键,合并方式
let mut df_join = df.join(&df_price, ["Model"], ["Model"], JoinArgs::from(JoinType::Inner)).unwrap();
println!("{:?}", &df_join);
}
// 输出结果
shape: (8, 6)
┌─────────────┬─────────┬───────┬─────────────┬───────┬──────────┐
│ Model ┆ Company ┆ Sales ┆ Comment ┆ Price ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
strstri32stri32f64
╞═════════════╪═════════╪═══════╪═════════════╪═══════╪══════════╡
│ iPhone XS ┆ Apple ┆ 80 ┆ null ┆ 24300.85
│ iPhone 12 ┆ Apple ┆ 170 ┆ null ┆ 35500.85
│ iPhone 13 ┆ Apple ┆ 130 ┆ Sold Out ┆ 57000.8
│ iPhone 14 ┆ Apple ┆ 205 ┆ New Arrival ┆ 8750 ┆ null │
│ Samsung S11 ┆ Samsung ┆ 400 ┆ null ┆ 23150.87
│ Samsung S12 ┆ Samsung ┆ 30 ┆ Sold Out ┆ 3560 ┆ null │
│ Mi A1 ┆ Xiao Mi ┆ 14 ┆ null ┆ 9800.66
│ Mi A2 ┆ Xiao Mi ┆ 8 ┆ null ┆ 14200.8
└─────────────┴─────────┴───────┴─────────────┴───────┴──────────┘

数据操作

获取某列数据

use polars::df;
use polars::frame::DataFrame;
use polars::prelude::{CsvReader, NamedFrom, Series, SerReader};
fn main() {
let mut df1: DataFrame = df![
"D1" => &[1,2,3,4,5,6,7,78],
"D2" => &[23,5,5,76,7,89,89,95]
].unwrap();
// 查询 D1列 数据
println!("{:?}", &df1["D1"]);
// 查询 第2列 数据
println!("{:?}", &df1[1]);
}
// 输出结果
shape: (8,)
Series: 'D1' [i32]
[
1
2
3
4
5
6
7
78
]
shape: (8,)
Series: 'D2' [i32]
[
23
5
5
76
7
89
89
95
]

查询某列数据

use polars::df;
use polars::frame::DataFrame;
use polars::prelude::{ArgAgg, CsvReader, FillNullStrategy, NamedFrom, Series, SerReader};
fn main() {
// 创建表结构,内部有空数据
let df = df! [
// 表头 对应数据
"Model" => ["iPhone XS", "iPhone 12", "iPhone 13", "iPhone 14", "Samsung S11", "Samsung S12", "Mi A1", "Mi A2"],
"Company" => ["Apple", "Apple", "Apple", "Apple", "Samsung", "Samsung", "Xiao Mi", "Xiao Mi"],
"Sales" => [80, 170, 130, 205, 400, 30, 14, 8],
"Comment" => [None, None, Some("Sold Out"), Some("New Arrival"), None, Some("Sold Out"), None, None],
].unwrap();
// 查询某列或某几列数据
println!("{:?}", df.select(["Company"]).unwrap());
// 另一种写法 取第1列
// println!("{:?}", df.select_by_range(0..1));
// 取第1、2列
// println!("{:?}", df.select_by_range(0..=1));
}
// 输出结果
shape: (8, 1)
┌─────────┐
│ Company │
│ --- │
str
╞═════════╡
│ Apple │
│ Apple │
│ Apple │
│ Apple │
│ Samsung │
│ Samsung │
│ Xiao Mi │
│ Xiao Mi │
└─────────┘

截取切片

use polars::df;
use polars::frame::DataFrame;
use polars::prelude::{ArgAgg, CsvReader, NamedFrom, Series, SerReader};
fn main() {
let mut df1: DataFrame = df![
"D1" => &[1,2,3,4,5,6,7,78],
"D2" => &[23,5,5,76,7,89,89,95]
].unwrap();
// 从每列的第3行开始截取,共截取4行
let temp = &df1.slice(2,4);
println!("{:?}", temp);
}
// 输出结果
┌─────┬─────┐
│ D1 ┆ D2 │
│ --- ┆ --- │
i32i32
╞═════╪═════╡
35
476
57
689
└─────┴─────┘

获取某个标量

use polars::df;
use polars::frame::DataFrame;
use polars::prelude::{ArgAgg, CsvReader, NamedFrom, Series, SerReader};
fn main() {
let mut df1: DataFrame = df![
"D1" => &[1,2,3,4,5,6,7,78],
"D2" => &[23,5,5,76,7,89,89,95]
].unwrap();
// 获取第2列的第4个值
println!("{:?}", df1[1].get(3).unwrap());
}
// 输出结果
Int32(76)

列值替换

use polars::df;
use polars::frame::DataFrame;
use polars::prelude::{ArgAgg, CsvReader, NamedFrom, Series, SerReader};
fn main() {
let mut df1: DataFrame = df![
"D1" => &[1,2,3,4,5,6,7,78],
"D2" => &[23,5,5,76,7,89,89,95]
].unwrap();
// 创建新列
let new_d1 = Series::new("", &[100,2,3,4,5,6,7,78]);
// 覆盖列并生成新表结构
let df2 = df1.replace("D1", new_d1).unwrap();
println!("{:?}", df2);
}
// 输出结果
shape: (8, 2)
┌─────┬─────┐
│ D1 ┆ D2 │
│ --- ┆ --- │
i32i32
╞═════╪═════╡
10023
25
35
476
57
689
789
7895
└─────┴─────┘

删除整列

use polars::df;
use polars::frame::DataFrame;
use polars::prelude::{ArgAgg, CsvReader, FillNullStrategy, NamedFrom, Series, SerReader};
fn main() {
// 创建表结构,内部有空数据
let df = df! [
// 表头 对应数据
"Model" => ["iPhone XS", "iPhone 12", "iPhone 13", "iPhone 14", "Samsung S11", "Samsung S12", "Mi A1", "Mi A2"],
"Company" => ["Apple", "Apple", "Apple", "Apple", "Samsung", "Samsung", "Xiao Mi", "Xiao Mi"],
"Sales" => [80, 170, 130, 205, 400, 30, 14, 8],
"Comment" => [None, None, Some("Sold Out"), Some("New Arrival"), None, Some("Sold Out"), None, None],
].unwrap();
// 删除Company整列
let df2 = df.drop_many(&["Company"]);
println!("{}", df2);
}

数据清洗

删除缺失值整行

use polars::df;
use polars::frame::DataFrame;
use polars::prelude::{ArgAgg, CsvReader, NamedFrom, Series, SerReader};
fn main() {
// 创建表结构,内部有空数据
let df = df! [
// 表头 对应数据
"Model" => ["iPhone XS", "iPhone 12", "iPhone 13", "iPhone 14", "Samsung S11", "Samsung S12", "Mi A1", "Mi A2"],
"Company" => ["Apple", "Apple", "Apple", "Apple", "Samsung", "Samsung", "Xiao Mi", "Xiao Mi"],
"Sales" => [80, 170, 130, 205, 400, 30, 14, 8],
"Comment" => [None, None, Some("Sold Out"), Some("New Arrival"), None, Some("Sold Out"), None, None],
].unwrap();
// 清除Comment列为空的整行数据
let df2 = df.drop_nulls(Some(&["Comment"])).unwrap();
println!("{}", df2);
}
// 输出结果
shape: (3, 4)
┌─────────────┬─────────┬───────┬─────────────┐
│ Model ┆ Company ┆ Sales ┆ Comment │
│ --- ┆ --- ┆ --- ┆ --- │
strstri32str
╞═════════════╪═════════╪═══════╪═════════════╡
│ iPhone 13 ┆ Apple ┆ 130 ┆ Sold Out │
│ iPhone 14 ┆ Apple ┆ 205 ┆ New Arrival │
│ Samsung S12 ┆ Samsung ┆ 30 ┆ Sold Out │
└─────────────┴─────────┴───────┴─────────────┘

填充缺失值

use polars::df;
use polars::frame::DataFrame;
use polars::prelude::{ArgAgg, CsvReader, FillNullStrategy, NamedFrom, Series, SerReader};
fn main() {
// 创建表结构,内部有空数据
let df = df! [
// 表头 对应数据
"Model" => ["iPhone XS", "iPhone 12", "iPhone 13", "iPhone 14", "Samsung S11", "Samsung S12", "Mi A1", "Mi A2"],
"Company" => ["Apple", "Apple", "Apple", "Apple", "Samsung", "Samsung", "Xiao Mi", "Xiao Mi"],
"Sales" => [80, 170, 130, 205, 400, 30, 14, 8],
"Comment" => [None, None, Some("Sold Out"), Some("New Arrival"), None, Some("Sold Out"), None, None],
].unwrap();
// 取当前列第一个非空的值填充后面的空值
let df2 = df.fill_null(FillNullStrategy::Forward(None)).unwrap();
// Forward(Option):向后遍历,用遇到的第一个非空值(或给定下标位置的值)填充后面的空值
// Backward(Option):向前遍历,用遇到的第一个非空值(或给定下标位置的值)填充前面的空值
// Mean:用算术平均值填充
// Min:用最小值填充
// Max: 用最大值填充
// Zero:用0填充
// One:用1填充
// MaxBound:用数据类型的取值范围的上界填充
// MinBound:用数据类型的取值范围的下界填充
println!("{}", df2);
}
// 输出结果
shape: (8, 4)
┌─────────────┬─────────┬───────┬─────────────┐
│ Model ┆ Company ┆ Sales ┆ Comment │
│ --- ┆ --- ┆ --- ┆ --- │
strstri32str
╞═════════════╪═════════╪═══════╪═════════════╡
│ iPhone XS ┆ Apple ┆ 80 ┆ null │
│ iPhone 12 ┆ Apple ┆ 170 ┆ null │
│ iPhone 13 ┆ Apple ┆ 130 ┆ Sold Out │
│ iPhone 14 ┆ Apple ┆ 205 ┆ New Arrival │
│ Samsung S11 ┆ Samsung ┆ 400 ┆ New Arrival │
│ Samsung S12 ┆ Samsung ┆ 30 ┆ Sold Out │
│ Mi A1 ┆ Xiao Mi ┆ 14 ┆ Sold Out │
│ Mi A2 ┆ Xiao Mi ┆ 8 ┆ Sold Out │
└─────────────┴─────────┴───────┴─────────────┘

数据筛选

条件过滤

use polars::df;
use polars::frame::DataFrame;
use polars::prelude::{ArgAgg, ChunkCompare, CsvReader, FillNullStrategy, NamedFrom, Series, SerReader};
fn main() {
// 创建表结构,内部有空数据
let df = df! [
// 表头 对应数据
"Model" => ["iPhone XS", "iPhone 12", "iPhone 13", "iPhone 14", "Samsung S11", "Samsung S12", "Mi A1", "Mi A2"],
"Company" => ["Apple", "Apple", "Apple", "Apple", "Samsung", "Samsung", "Xiao Mi", "Xiao Mi"],
"Sales" => [80, 170, 130, 205, 400, 30, 14, 8],
"Comment" => [None, None, Some("Sold Out"), Some("New Arrival"), None, Some("Sold Out"), None, None],
].unwrap();
// 创建过滤条件 查询Company列等于Samsung的行
let mask = df.column("Company").unwrap().equal("Samsung").unwrap();
// 还可以运用 逻辑运算(与&, 或|, 非!)
// 比如筛选出苹果公司销售量大于100的数据
// let mask = df.column("Company")?.equal("Apple")? & df.column("Sales")?.gt(100)?;
println!("{:?}", df.filter(&mask).unwrap());
}
// 输出结果
shape: (2, 4)
┌─────────────┬─────────┬───────┬──────────┐
│ Model ┆ Company ┆ Sales ┆ Comment │
│ --- ┆ --- ┆ --- ┆ --- │
strstri32str
╞═════════════╪═════════╪═══════╪══════════╡
│ Samsung S11 ┆ Samsung ┆ 400 ┆ null │
│ Samsung S12 ┆ Samsung ┆ 30 ┆ Sold Out │
└─────────────┴─────────┴───────┴──────────┘

 

部分内容来源:https://jarod.blog.csdn.net/article/details/127896326

本文作者:芋白🥕

本文链接:https://www.cnblogs.com/-CO-/p/18152669

版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 2.5 中国大陆许可协议进行许可。

posted @   芋白  阅读(985)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 展开说说关于C#中ORM框架的用法!
点击右上角即可分享
微信分享提示
💬
评论
📌
收藏
💗
关注
👍
推荐
🚀
回顶
收起
🔑