Python-MySQL抽取表中每个分组的前N个数据
在超大规模数据处理时,有时候需要按分组取出一部分数据,作为整体数据的参考。
例如,有表table_1,按column_1可以划分为200组,现在我们希望从这200组中每组各取出10条记录作为代表。
我们可以通过以下Python语句操作MySQL,来实现此功能:
import mysql.connector conn = mysql.connector.connect(host = '127.0.0.1', port = 3306, user = 'root', password = 'xxx', db = 'xxx' ) cursor = conn.cursor() for i in range(0,200): #注意limit i,j 时 i是从0开始索引的 sql = "INSERT INTO Sample_table_1 SELECT t.column_1, t.column_2 FROM table_1 t, (SELECT * FROM table_1 group by column1 limit %s,1) p where t.column1=p.column1 limit 10;"%i cursor.execute(sql) conn.commit() conn.close()
当然,由于这是个插入操作,我们也可以使用excutemany来实现:
import mysql.connector conn = mysql.connector.connect(host = '127.0.0.1', port = 3306, user = 'root', password = 'xxx', db = 'xxx' ) cursor = conn.cursor() cursor.executemany("INSERT INTO Sample_table_1 SELECT t.column_1, t.column_2 FROM table_1 t, (SELECT * FROM table_1 group by column1 limit %s,1) p where t.column1=p.column1 limit 10;", [(i,) for i in range(0,200)] ) #注意必须保留(i,)中的",",否则会报错 conn.commit() conn.close()
再或者,也可以使用MySQL存储过程。
import mysql.connector conn = mysql.connector.connect(host = '127.0.0.1', port = 3306, user = 'root', password = 'xxx', db = 'xxx' ) cursor = conn.cursor() sql1="drop procedure if exists test;" sql2="create procedure test() \ begin\ declare i bigint default 0; \ DECLARE var BIGINT DEFAULT 0; \ while i < 200 do \ INSERT INTO Sample_table_1 SELECT t.column_1, t.column_2 FROM table_1 t, (SELECT * FROM table_1 group by column1 limit i,1) p where t.column1=p.column1 limit 10;\ set i = i + 1;\ end while;\ end;" sql3="call test;" cursor.execute(sql1) cursor.execute(sql2) cursor.execute(sql3) conn.commit() conn.close()
就执行效率来说,excutemany较excute略高一些,存储过程的效率最低。