关于CoreML需要外部生成randn这档子事

coremltools在转换后会固定randn结果,实现真·randn需要外部生成,再使用MLMultiArray作为输入参数传入。

太长不看

使用BNNS

import Accelerate
import CoreML

@available(iOS 16.0, *)
extension MLMultiArray {
    
    static func randnFP32(shape: consuming [NSNumber], mean: Float = 0, std: Float = 1) throws -> MLMultiArray {
        let arr = try MLMultiArray(shape: shape, dataType: .float32)
        let cnt = arr.count
        
        arr.withUnsafeMutableBufferPointer(ofType: Float.self) { ptr, strides in
            guard var des = BNNSNDArrayDescriptor(data: ptr, shape: .vector(cnt)),
                  let gen = BNNSCreateRandomGenerator(BNNSRandomGeneratorMethodAES_CTR, nil)
            else { fatalError() }
            BNNSRandomFillNormalFloat(gen, &des, mean, std)
            BNNSDestroyRandomGenerator(gen)
        }
        return arr
    }
}

胡乱折腾

已FP32为例,iOS上(个人尝试)大概3种randn实现方法。

  1. BoxMuller (手动)

    torch有实现,直接抄一遍。

    func boxMuller(count: Int) -> [Float32] {
        let mean: Float32 = 0, std: Float32 = 1
        var arr = (0..<(count/16 + 1) * 16).map { _ in Float32.random(in: 0..<1) }
        for i in stride(from: 0, to: arr.count, by: 16) {
            for j in i..<(i+8) {
                let u1 = 1 - arr[j]
                let u2 = arr[j + 8]
                let radius = sqrt(-2 * log(u1))
                let theta = 2 * Float32.pi * u2
                
                arr[j] = radius * cos(theta) * std + mean
                arr[j+8] = radius * sin(theta) * std + mean
            }
        }
        return arr
    }
    

    循环生成,缺乏加速,tensor增长后性能退化最明显。

  2. MPS

    BoxMuller 由 MPSGraph 提供。

    import MetalPerformanceShadersGraph
    
    @available(iOS 15.4, *)
    func mps(count: Int, seed: Int = .random(in: 0..<Int.max)) -> [Float32] {
        guard let op = MPSGraphRandomOpDescriptor(distribution: .normal, dataType: .float32) else { fatalError() }
        op.samplingMethod = .boxMuller
        let graph = MPSGraph()
        
        let y = graph.randomTensor(withShape: [count as NSNumber], descriptor: consume op, seed: seed, name: nil)
        
        guard let yData = graph.run(
            feeds: [:],
            targetTensors: [y],
            targetOperations: nil
        )[consume y] else { fatalError() }
        
        var arr = [Float32](repeating: 0, count: count)
        yData.mpsndarray().readBytes(&arr, strideBytes: nil)
        return arr
    }
    

    虽然用上GPU是件美事,然鹅在输入多为MLMultiArray的情况下,独立使用缺乏优势。除非后续网络全部接入计算图,配合编译优化才有搞头。此处只能抛砖引玉。

    众所周知,比起算力瓶颈,IO更易拖后腿。

  3. BNSS

    爱,来自Accelerate。

    import Accelerate
    
    @available(iOS 16.0, *)
    func bnns(count: Int) -> [Float32] {
        let mean: Float32 = 0, std: Float32 = 1
        let arr = [Float32](unsafeUninitializedCapacity: count) { buffer, initializedCount in
            guard var des = BNNSNDArrayDescriptor(data: buffer, shape: .vector(count)),
                  let gen = BNNSCreateRandomGenerator(BNNSRandomGeneratorMethodAES_CTR, nil)
            else { fatalError() }
            
            BNNSRandomFillNormalFloat(gen, &des, mean, std)
            BNNSDestroyRandomGenerator(gen)
            initializedCount = count
        }
        return arr
    }
    

    虽然不是真BoxMuller,无法满足强迫症的严谨追求,但在生成2k随机数的 Jarque–Bera test下各方法都只剩随机波动。

    MLTensor表现上似是BNNS封装,但不知为啥性能退化;懒癌发作,放弃探究。

    在绝对的性能面前,一切纠结都是纸老虎。

性能测试

参考价值有限,各方法一次生成[100, 1000, 10_000, 100_000]个随机数

方法 耗时(ms)
BNNS 2
MPS 74
BoxMuller (手动) 148
posted @ 2024-10-02 19:16  Simon_X  阅读(12)  评论(0编辑  收藏  举报