需要重采样的数据文件(Libsvm format),如heart_scale
1 2 3 | + 1 1 : 0.708333 2 : 1 3 : 1 4 : - 0.320755 5 : - 0.105023 6 : - 1 7 : 1 8 : - 0.419847 9 : - 1 10 : - 0.225806 12 : 1 13 : - 1 - 1 1 : 0.583333 2 : - 1 3 : 0.333333 4 : - 0.603774 5 : 1 6 : - 1 7 : 1 8 : 0.358779 9 : - 1 10 : - 0.483871 12 : - 1 13 : 1 .... |
重采样后的数据保存文件(Libsvm format),这里heart_scale_balance.txt
Python code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | from sklearn.datasets import load_svmlight_file from sklearn.datasets import dump_svmlight_file import numpy as np from sklearn.utils import check_random_state from scipy.sparse import hstack,vstack def fit_sample(X, y): """Resample the dataset. """ label = np.unique(y) stats_c_ = {} maj_n = 0 for i in label: nk = sum (y = = i) stats_c_[i] = nk if nk > maj_n: maj_n = nk maj_c_ = i # Keep the samples from the majority class X_resampled = X[y = = maj_c_] y_resampled = y[y = = maj_c_] # Loop over the other classes over picking at random for key in stats_c_.keys(): # If this is the majority class, skip it if key = = maj_c_: continue # Define the number of sample to create num_samples = int (stats_c_[maj_c_] - stats_c_[key]) # Pick some elements at random random_state = check_random_state( 42 ) indx = random_state.randint(low = 0 , high = stats_c_[key],size = num_samples) # Concatenate to the majority class X_resampled = vstack([X_resampled,X[y = = key],X[y = = key][indx]]) print np.shape(y_resampled),np.shape(y[y = = key]),np.shape(y[y = = key][indx]) y_resampled = list (y_resampled) + list (y[y = = key]) + list (y[y = = key][indx]) return X_resampled, y_resampled X_train, y_train = load_svmlight_file( "heart_scale" ) # Apply the random over-sampling X_train, y_train = fit_sample(X_train,y_train) dump_svmlight_file(X_train, y_train, 'heart_scale_balance.txt' ,zero_based = False ) |
标签:
python
, Oversample
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧