Spark repartition
repartitionByRange
repartitionByRange(numPartitions, *cols) method of pyspark.sql.dataframe.DataFrame instance Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The resulting DataFrame is range partitioned. :param numPartitions: can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the first partitioning column. If not specified, the default number of partitions is used. At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed.
begin = time.time() df = merge_data df.repartitionByRange(10,"probeset_id").write.format("delta").mode("append").save(f) print(time.time()-begin)