关于CoreML需要外部生成randn这档子事
coremltools在转换后会固定randn结果,实现真·randn需要外部生成,再使用MLMultiArray作为输入参数传入。
太长不看
使用BNNS
import Accelerate
import CoreML
@available(iOS 16.0, *)
extension MLMultiArray {
static func randnFP32(shape: consuming [NSNumber], mean: Float = 0, std: Float = 1) throws -> MLMultiArray {
let arr = try MLMultiArray(shape: shape, dataType: .float32)
let cnt = arr.count
arr.withUnsafeMutableBufferPointer(ofType: Float.self) { ptr, strides in
guard var des = BNNSNDArrayDescriptor(data: ptr, shape: .vector(cnt)),
let gen = BNNSCreateRandomGenerator(BNNSRandomGeneratorMethodAES_CTR, nil)
else { fatalError() }
BNNSRandomFillNormalFloat(gen, &des, mean, std)
BNNSDestroyRandomGenerator(gen)
}
return arr
}
}
胡乱折腾
已FP32为例,iOS上(个人尝试)大概3种randn实现方法。
-
BoxMuller (手动)
torch有实现,直接抄一遍。
func boxMuller(count: Int) -> [Float32] { let mean: Float32 = 0, std: Float32 = 1 var arr = (0..<(count/16 + 1) * 16).map { _ in Float32.random(in: 0..<1) } for i in stride(from: 0, to: arr.count, by: 16) { for j in i..<(i+8) { let u1 = 1 - arr[j] let u2 = arr[j + 8] let radius = sqrt(-2 * log(u1)) let theta = 2 * Float32.pi * u2 arr[j] = radius * cos(theta) * std + mean arr[j+8] = radius * sin(theta) * std + mean } } return arr }
循环生成,缺乏加速,tensor增长后性能退化最明显。
-
MPS
BoxMuller 由 MPSGraph 提供。
import MetalPerformanceShadersGraph @available(iOS 15.4, *) func mps(count: Int, seed: Int = .random(in: 0..<Int.max)) -> [Float32] { guard let op = MPSGraphRandomOpDescriptor(distribution: .normal, dataType: .float32) else { fatalError() } op.samplingMethod = .boxMuller let graph = MPSGraph() let y = graph.randomTensor(withShape: [count as NSNumber], descriptor: consume op, seed: seed, name: nil) guard let yData = graph.run( feeds: [:], targetTensors: [y], targetOperations: nil )[consume y] else { fatalError() } var arr = [Float32](repeating: 0, count: count) yData.mpsndarray().readBytes(&arr, strideBytes: nil) return arr }
虽然用上GPU是件美事,然鹅在输入多为MLMultiArray的情况下,独立使用缺乏优势。除非后续网络全部接入计算图,配合编译优化才有搞头。此处只能抛砖引玉。
众所周知,比起算力瓶颈,IO更易拖后腿。
-
BNSS
爱,来自Accelerate。
import Accelerate @available(iOS 16.0, *) func bnns(count: Int) -> [Float32] { let mean: Float32 = 0, std: Float32 = 1 let arr = [Float32](unsafeUninitializedCapacity: count) { buffer, initializedCount in guard var des = BNNSNDArrayDescriptor(data: buffer, shape: .vector(count)), let gen = BNNSCreateRandomGenerator(BNNSRandomGeneratorMethodAES_CTR, nil) else { fatalError() } BNNSRandomFillNormalFloat(gen, &des, mean, std) BNNSDestroyRandomGenerator(gen) initializedCount = count } return arr }
虽然不是真BoxMuller,无法满足强迫症的严谨追求,但在生成2k随机数的 Jarque–Bera test下各方法都只剩随机波动。
MLTensor表现上似是BNNS封装,但不知为啥性能退化;懒癌发作,放弃探究。
在绝对的性能面前,一切纠结都是纸老虎。
性能测试
参考价值有限,各方法一次生成[100, 1000, 10_000, 100_000]
个随机数
方法 | 耗时(ms) |
---|---|
BNNS | 2 |
MPS | 74 |
BoxMuller (手动) | 148 |