spark将dataframe按照比例分割为2份方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession def split2df(prod_df, ratio=0.8): # Calculate count of each dataframe rows length = int (prod_df.count() * ratio) # Create a copy of original dataframe copy_df = prod_df # Iterate for each dataframe temp_df = copy_df.limit(length) # Truncate the `copy_df` to remove # the contents fetched for `temp_df` copy_df = copy_df.subtract(temp_df) length2 = prod_df.count() - length temp_df2 = copy_df.limit(length2) copy_df2 = copy_df.subtract(temp_df2) return temp_df, temp_df2 # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # Column names for the dataframe columns = [ "Brand" , "Product" ] # Row data for the dataframe data = [ ( "HP" , "Laptop" ), ( "Lenovo" , "Mouse" ), ( "Dell" , "Keyboard" ), ( "Samsung" , "Monitor" ), ( "MSI" , "Graphics Card" ), ( "Asus" , "Motherboard" ), ( "Gigabyte" , "Motherboard" ), ( "Zebronics" , "Cabinet" ), ( "Adata" , "RAM" ), ( "Transcend" , "SSD" ), ( "Kingston" , "HDD" ), ( "Toshiba" , "DVD Writer" ) ] # Create the dataframe using the above values prod_df = spark.createDataFrame(data=data, schema=columns) # View the dataframe prod_df.show() df1, df2 = split2df(prod_df) df1.show(truncate=False) df2.show(truncate=False) |
分割结果:
+---------+-------------+
| Brand| Product|
+---------+-------------+
| HP| Laptop|
| Lenovo| Mouse|
| Dell| Keyboard|
| Samsung| Monitor|
| MSI|Graphics Card|
| Asus| Motherboard|
| Gigabyte| Motherboard|
|Zebronics| Cabinet|
| Adata| RAM|
|Transcend| SSD|
| Kingston| HDD|
| Toshiba| DVD Writer|
+---------+-------------+
+---------+-------------+
|Brand |Product |
+---------+-------------+
|HP |Laptop |
|Lenovo |Mouse |
|Dell |Keyboard |
|Samsung |Monitor |
|MSI |Graphics Card|
|Asus |Motherboard |
|Gigabyte |Motherboard |
|Zebronics|Cabinet |
|Adata |RAM |
+---------+-------------+
+---------+----------+
|Brand |Product |
+---------+----------+
|Transcend|SSD |
|Toshiba |DVD Writer|
|Kingston |HDD |
+---------+----------+
参考:
https://www.geeksforgeeks.org/pyspark-split-dataframe-into-equal-number-of-rows/
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」