spark将dataframe按照比例分割为2份方法

 

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
 
def split2df(prod_df, ratio=0.8):
    # Calculate count of each dataframe rows
    length = int(prod_df.count() * ratio)
 
    # Create a copy of original dataframe
    copy_df = prod_df
 
    # Iterate for each dataframe
    temp_df = copy_df.limit(length)
 
    # Truncate the `copy_df` to remove
    # the contents fetched for `temp_df`
    copy_df = copy_df.subtract(temp_df)
 
    length2 = prod_df.count() - length
    temp_df2 = copy_df.limit(length2)
 
    copy_df2 = copy_df.subtract(temp_df2)
 
    return temp_df, temp_df2
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# Column names for the dataframe
columns = ["Brand", "Product"]
 
# Row data for the dataframe
data = [
    ("HP", "Laptop"),
    ("Lenovo", "Mouse"),
    ("Dell", "Keyboard"),
    ("Samsung", "Monitor"),
    ("MSI", "Graphics Card"),
    ("Asus", "Motherboard"),
    ("Gigabyte", "Motherboard"),
    ("Zebronics", "Cabinet"),
    ("Adata", "RAM"),
    ("Transcend", "SSD"),
    ("Kingston", "HDD"),
    ("Toshiba", "DVD Writer")
]
 
# Create the dataframe using the above values
prod_df = spark.createDataFrame(data=data,
                                schema=columns)
 
 
# View the dataframe
prod_df.show()
df1, df2 = split2df(prod_df)
df1.show(truncate=False)
df2.show(truncate=False)

  

分割结果:

+---------+-------------+
| Brand| Product|
+---------+-------------+
| HP| Laptop|
| Lenovo| Mouse|
| Dell| Keyboard|
| Samsung| Monitor|
| MSI|Graphics Card|
| Asus| Motherboard|
| Gigabyte| Motherboard|
|Zebronics| Cabinet|
| Adata| RAM|
|Transcend| SSD|
| Kingston| HDD|
| Toshiba| DVD Writer|
+---------+-------------+

+---------+-------------+
|Brand |Product |
+---------+-------------+
|HP |Laptop |
|Lenovo |Mouse |
|Dell |Keyboard |
|Samsung |Monitor |
|MSI |Graphics Card|
|Asus |Motherboard |
|Gigabyte |Motherboard |
|Zebronics|Cabinet |
|Adata |RAM |
+---------+-------------+

+---------+----------+
|Brand |Product |
+---------+----------+
|Transcend|SSD |
|Toshiba |DVD Writer|
|Kingston |HDD |
+---------+----------+

 

参考:

https://www.geeksforgeeks.org/pyspark-split-dataframe-into-equal-number-of-rows/

posted @   bonelee  阅读(598)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
点击右上角即可分享
微信分享提示