pyspark - 逻辑回归

是在整理文件时, 翻到的, 感觉是好久以前的代码了, 不过看了, 还是可以的. 起码注释还是蛮清晰的. 那时候我真的是妥妥的调包man....

复制代码
# 逻辑回归-标准化套路

from pyspark.ml.feature import VectorAssembler
import pandas as pd

# 1. 准备数据 - 样本数据集
sample_dataset = [
    (0, "male", 37, 10, "no", 3, 18, 7, 4),
    (0, "female", 27, 4, "no", 4, 14, 6, 4),
    (0, "female", 32, 15, "yes", 1, 12, 1, 4),
    (0, "male", 57, 15, "yes", 5, 18, 6, 5),
    (0, "male", 22, 0.75, "no", 2, 17, 6, 3),
    (0, "female", 32, 1.5, "no", 2, 17, 5, 5),
    (0, "female", 22, 0.75, "no", 2, 12, 1, 3),
    (0, "male", 57, 15, "yes", 2, 14, 4, 4),
    (0, "female", 32, 15, "yes", 4, 16, 1, 2),
    (0, "male", 22, 1.5, "no", 4, 14, 4, 5),
    (0, "male", 37, 15, "yes", 2, 20, 7, 2),
    (0, "male", 27, 4, "yes", 4, 18, 6, 4),
    (0, "male", 47, 15, "yes", 5, 17, 6, 4),
    (0, "female", 22, 1.5, "no", 2, 17, 5, 4),
    (0, "female", 27, 4, "no", 4, 14, 5, 4),
    (0, "female", 37, 15, "yes", 1, 17, 5, 5),
    (0, "female", 37, 15, "yes", 2, 18, 4, 3),
    (0, "female", 22, 0.75, "no", 3, 16, 5, 4),
    (0, "female", 22, 1.5, "no", 2, 16, 5, 5),
    (0, "female", 27, 10, "yes", 2, 14, 1, 5),
    (1, "female", 32, 15, "yes", 3, 14, 3, 2),
    (1, "female", 27, 7, "yes", 4, 16, 1, 2),
    (1, "male", 42, 15, "yes", 3, 18, 6, 2),
    (1, "female", 42, 15, "yes", 2, 14, 3, 2),
    (1, "male", 27, 7, "yes", 2, 17, 5, 4),
    (1, "male", 32, 10, "yes", 4, 14, 4, 3),
    (1, "male", 47, 15, "yes", 3, 16, 4, 2),
    (0, "male", 37, 4, "yes", 2, 20, 6, 4)
]

columns = ["affairs", "gender", "age", "label", "children", "religiousness", "education", "occupation", "rating"]

# pandas构建dataframe,方便
pdf = pd.DataFrame(sample_dataset, columns=columns)

# 2. 特征选取:affairs为目标值,其余为特征值 - 这是工作中最麻烦的地方, 多张表, 数据清洗
df2 = df.select("affairs","age", "religiousness", "education", "occupation", "rating")

# 3. 合并特征-将多列特征合并为一列"feature", 如果是离散数据, 需要先 onehot 再合并, 挺繁琐的
# 3.1 用于计算特征向量的字段
colArray2 = ["age", "religiousness", "education", "occupation", "rating"]
# 3.2 计算出特征向量
df3 = VectorAssembler().setInputCols(colArray2).setOutputCol("features").transform(df2)

# 4. 划分分为训练集和测试集(随机)
trainDF, testDF = df3.randomSplit([0.8,0.2])
# print("训练集:")
# trainDF.show(10)
# print("测试集:")
# testDF.show(10)

# 5. 训练模型
from pyspark.ml.classification import LogisticRegression
# 5.1 创建逻辑回归训练器
lr = LogisticRegression()
# 5.2 训练模型
model = lr.setLabelCol("affairs").setFeaturesCol("features").fit(trainDF)
# 5.3 预测数据
model.transform(testDF).show()

# todo 
# 6. 评估, 交叉验证, 保存, 封装.....
复制代码

主要也是作为一个历史的笔记, 当然也作为一个反例, 即如果不懂原理,来调用包的话, 你会发现, ML 其实是多么的无聊, 至少从代码套路上看这样的.

posted @   致于数据科学家的小陈  阅读(1240)  评论(0编辑  收藏  举报
编辑推荐:
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
阅读排行:
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通
点击右上角即可分享
微信分享提示