pandas LabelEncoder 测试集出现了训练集中未出现过的值怎么解决(y contains previously unseen labels 解决方法)
for i in categorical_ix: le = joblib.load(f"./LabelEncoder/{i}_LabelEncoder.model") #由于test集合中可能出现新的label,没有在train中出现过,因此将新的标签也转为<unk> test_labels = df_test[i].unique() #array 形式 train_class = le.classes_ for t in test_labels: if(t not in train_class): print("***Warning***: y contains previously unseen labels") print("列名是:",i) print("将新出现值转换成<unk>") df_test[i] =df_test[i].map(lambda s:'<unk>' if s not in le.classes_ else s) le.classes_ = np.append(le.classes_, '<unk>') #这里有一个大坑,就是如果label是数字的时候,这样编码会出现问题!必须是字符串类型,才行! df_test[i] = le.transform(df_test[i])
参考:https://blog.csdn.net/qq_41185868/article/details/109408387#1%E3%80%81%E5%9C%A8%E6%95%B0%E6%8D%AE%E7%BC%BA%E5%A4%B1%E5%92%8Ctest%E6%95%B0%E6%8D%AE%E5%86%85%E5%AD%98%E5%9C%A8%E6%96%B0%E5%80%BC%28train%E6%95%B0%E6%8D%AE%E6%9C%AA%E5%87%BA%E7%8E%B0%E8%BF%87%29%E7%8E%AF%E5%A2%83%E4%B8%8B%E7%9A%84%E6%95%B0%E6%8D%AELabelEncoder%E5%8C%96
如果是onehotEncoder,建议enc = OneHotEncoder(handle_unknown=‘ignore’)
参考:https://blog.csdn.net/lizz2276/article/details/106281697
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder