如何用有限内存读取千万级Sqlite?

当sqlite足够大,matlab一次性读取往往会出现OOM的错误。为了能够成功读取其中数据,我首次的尝试是用Sql的LIMITOFFSET参数,但事实证明这样做并不能缓解内存占用过高的情况。笔者猜测Matlab在offset跳过前面的数据时会将跳过的数据以某种方式放入内存。换言之,越到后面,跳过的内容越多,内存占用会越高。

一种比较好的处理方法是用主键结合BETWEEN AND查询,主键的唯一性可以确保查询速度,同时内存占用只取决于查询条目的个数,非常好用。

我的代码如下:

function sqlite_data = read_large_sqlite(sqlite_path,  splite_num)


conn = sqlite(sqlite_path, 'readonly');
len_quiry = 'SELECT COUNT(*) FROM path_utd';
total_len = table2array(fetch(conn,len_quiry));

sqlite_data = zeros(total_len, 4);
segments = divide_sequence(total_len,splite_num);

for i = 1:splite_num
    query = sprintf('SELECT * FROM path_utd WHERE path_utd_id BETWEEN %d AND %d', segments{i}(1), segments{i}(2));
    data = fetch(conn,query);
    sqlite_data(segments{i}(1):segments{i}(2),:) = [cast(data.path_id,'double'),...
                                                    cast(data.utd_instance_id, 'double'),...
                                                    data.received_power, data.cir_phs];
    fprintf('read finish %d/%d \n', i, splite_num);
end
conn.close();
end

其中使用的函数divide_sequence如下

function segments = divide_sequence(N,splite_num)
    N = cast(N, 'double');
    indices = round(linspace(1, N+1, splite_num+1));  % 均匀生成边界索引
    
    segments = cell(1, splite_num);  % 初始化分段
    for i = 1:splite_num
        segments{i} = [indices(i),(indices(i+1)-1)];  % 每一段的索引范围
    end
end

posted @ 2024-09-20 17:01  sq-Feng  阅读(5)  评论(0编辑  收藏  举报