医学数据预处理
医学数据预处理
前言
任务
处理pO2,pCO2两个指标,按照采集时间的前后顺序,汇总每个病人每次住院期间的所有的pO2, pCO2指标值
数据集
数据来自:
https://physionet.org/content/mimiciii-demo/1.4/
中的CHARTEVENTS.csv和LABEVENTS.csv两个文件。 文件都有7万多条数据,但实际包含此次任务的指标的数据只有1000多条
处理过程
预处理
先提取包含pO2和pCO2这两个指标的数据,文件中ITEMID字段为[490, 3785, 3837, 50821]中之一的为pO2,pCO2的为[3784, 3835, 50818]中之一。并且文件中还包含大量其它字段,通过预处理全部去除,最后将数据保存在my.csv中
原CHARTEVENTS.csv文件中数据格式如下
原LABEVENTS.csv文件中数据格式如下
提取数据后my.csv文件中格式如下:
可以看到,因为在处理过程中对pO2与pCO2分开处理,导致同一时间检测的pO2与pCO2分为两条数据,这里先对其进行补全,在之后再进行去重
补全后的数据如下所示
去重:由上知,数据中是有重复的,因此先进行去重,
去除缺失值:首先检测是否有空值,然后直接丢弃有空值的数据,其实返回结果显示,不存在空值
去除异常值
通过绘制箱图检测pO2与pCO2的异常值
检测箱图如下:
以pO2为例,其异常值如下:
261.20001220703125 253.19999694824216 526.4000244140625 388.7999877929688
359.0 382.0 371.0 386.0 444.0 382.0 448.0 383.0 334.0 399.0 340.0 411.0
261.0 475.0 560.0 431.0 279.0 269.0 348.0 282.0 286.0 442.0 257.0 457.0
300.0 432.0 371.0 313.0 319.0 301.0 326.0 524.0 543.0 415.0 244.0 363.0
299.0 472.0 256.0 259.0 534.0 404.0 293.0 261.0 374.0 277.0 258.0 276.0
305.0 248.0 287.0 291.0 405.0 353.0 249.0 273.0 514.0 479.0 414.0 319.0
255.0 341.0 332.0
通过将异常值替换为均值的方式去除异常值
去除后,箱图如下所示,虽然并不完美,但相比之前有很高提升:
插值
首先统计最多的测量间隔,以小时为单位,核心代码如下,将时间化为秒后,相邻两次测量时间相减,算出间隔秒数,最后化为小时。最后再进行统计
统计结果如图所示,间隔0小时的有1199次,间隔2小时的有160次,但统计时数据没有去重,存在多次相同时间的情况,所以间隔次数最多的为2小时
{0: 1199, 2: 160, 6: 116, 1: 113, 4: 110, 3: 97, 5: 79, 7: 60, 8: 43, 9: 32, 11: 27, 10: 24, 12: 23, 13: 19, 14: 14, 15: 12, 16: 11, 18: 11, 25: 8, 19: 6, 21: 6, 24: 5, 17: 5, 26: 5, 22: 4, 20: 4, 23: 3, 27: 3, 49: 2, 92: 1, 4771: 1, 66: 1, 61: 1, 196: 1, 46: 1, 129: 1, 533: 1, 365: 1, 55: 1, 391: 1, 47: 1, 71: 1, 469: 1, 121: 1, 152: 1, 35: 1, 37: 1, 127: 1, 68: 1, 4442: 1, 43: 1, 95: 1, 3511: 1, 42: 1, 28: 1, 75: 1, 40: 1, 65: 1, 77: 1, 1151: 1, 1820: 1, 401: 1, 9683: 1, 80: 1, 41: 1, 53: 1, 5247: 1, 4689: 1, 4485: 1, 7292: 1, 390: 1, 1672: 1, 91: 1, 89: 1, 45: 1, 1294: 1, 218: 1, 222: 1, 44: 1, 60: 1, 82: 1, 62: 1, 90: 1, 29: 1, 311: 1}
所以用2为间隔运用拉格朗日插值算法进行插值计算。计算时,对于每一个待计算点,提取其前后各两个值构造函数,来计算这一点的值。
为了图表刻度统一,对x轴,即时间单位进行变形,以小时为单位,第一次测量的时间为0起点
以subject_id为10027用户,计算其前五个插值,即时间x轴为0,2,4,6,8处的pO2和pCO2为例。
图示如下,黄点为时间为0,2,4,6,8时的pO2数值,蓝点为原始的测量数据
图示如下,黄点为时间为0,2,4,6,8时的pCO2数值,蓝点为原始的测量数据
代码
import csv
from datetime import datetime
import collections
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import lagrange
# 构造一个多项式函数
def func(p, x):
f = np.poly1d(p)
return f(x)
# 返回预测值与真实值的差值
def error(p, x, y):
return func(p, x) - y
# 每一条数据结构subject_id/charttime/02/co2
o2_co2_data = []
# 读入数据,提取需要的值
file1 = csv.reader(open(r'C:\Users\DELL\Downloads\CHARTEVENTS.csv', 'r'))
for i in file1:
if (i[4] == '490' or i[4] == '3785' or i[4] == '3837' or i[4] == '50821'):
o2_co2_data.append([i[1], i[5], i[8], ''])
if (i[4] == '3784' or i[4] == '3835' or i[4] == '50818'):
o2_co2_data.append([i[1], i[5], '', i[8]])
file2 = csv.reader(open(r'C:\Users\DELL\Downloads\LABEVENTS.csv', 'r'))
for i in file2:
if (i[3] == '490' or i[3] == '3785' or i[3] == '3837' or i[3] == '50821'):
o2_co2_data.append([i[1], i[4], i[5], ''])
if (i[3] == '3784' or i[3] == '3835' or i[3] == '50818'):
o2_co2_data.append([i[1], i[4], '', i[5]])
# 对数据进行合并
for i in o2_co2_data:
for j in o2_co2_data:
if (j[2] == ''):
if (j[0] == i[0] and j[1] == i[1]):
j[2] = i[2]
if (j[3] == ''):
if (j[0] == i[0] and j[1] == i[1]):
j[3] = i[3]
# 写入数据
head = ['subject_id', 'charttime', 'pO2', 'pCO2']
with open(r'C:\Users\DELL\Downloads\my.csv', 'w', newline='') as f3:
writer = csv.writer(f3)
writer.writerow(head)
writer.writerows(o2_co2_data)
# # #日期转秒
# # second=datetime.strptime('2163-05-14 20:07:00','%Y-%m-%d %H:%M:%S').timestamp()
# # #秒转日期
# # print(time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(second)))
# 统计采集时间间隔的最大值,以小时为单位
test_gap = []
j = o2_co2_data[0]
for i in o2_co2_data:
if (i[0] == j[0]):
test_gap.append(round((datetime.strptime(i[1], '%Y-%m-%d %H:%M:%S').timestamp() - datetime.strptime(j[1],
'%Y-%m-%d %H:%M:%S').timestamp()) / 3600))
j = i
print(collections.Counter(test_gap))
# 读入数据
data = pd.read_csv(r'C:\Users\DELL\Downloads\my.csv', encoding='utf-8')
# 去重
# 是否有重复
print(data.duplicated().any())
data = data.drop_duplicates()
# 去除空值
# 是否有空值
print(data.isnull().any())
data = data.dropna()
# 检测离群值,绘制箱图
outliner = data.boxplot(column=list(data.columns[2:]), return_type='dict')
out_pO2 = outliner['fliers'][0].get_ydata()
print(out_pO2)
out_pCO2 = outliner['fliers'][1].get_ydata()
plt.show()
# 去除离群值,用平均值代替
pO2_mean = (data['pO2'].sum() - out_pO2.sum()) / (len(data['pO2']) - len(out_pO2))
pCO2_mean = (data['pCO2'].sum() - out_pCO2.sum()) / (len(data['pCO2']) - len(out_pCO2))
data['pO2'].replace(out_pO2, pO2_mean, inplace=True)
data['pCO2'].replace(out_pCO2, pCO2_mean, inplace=True)
outliner = data.boxplot(column=list(data.columns[2:]), return_type='dict')
plt.show()
# 以10027用户为例
person = data[data['subject_id'] == 10027]
print(person)
# 以时间(小时为横坐标)将图像横坐标方所到以零为起点
x_time = person['charttime']
x_second = []
for i in x_time:
x_second.append(datetime.strptime(i, '%Y-%m-%d %H:%M:%S').timestamp() / 3600)
x_second2 = [] # 转化后的横坐标
for i in x_second:
x_second2.append(i - min(x_second))
# 拉格朗日插值pCO2
y_pCO2 = list(person['pCO2'])
# 待插值的x坐标
lag_x_second2 = list(range(0, int(max(x_second2)), 2))
lag_y_pCO2 = []
for i in lag_x_second2[:6]:
x_small = []
y_small = []
x_big = []
y_big = []
for j in x_second2:
if (i > j):
x_small.append(j)
y_small.append(y_pCO2[x_second2.index(j)])
if (i < j):
x_big.append(j)
y_big.append(y_pCO2[x_second2.index(j)])
x = x_small[len(x_small) - 2 if len(x_small) > 2 else 0:]
x.extend(x_big[:2 if len(x_big) > 2 else len(x_big)])
y = y_small[len(y_small) - 2 if len(y_small) > 2 else 0:]
y.extend(y_big[:2 if len(y_big) > 2 else len(y_big)])
lag = lagrange(x, y)
lag_y_pCO2.append(lag(i))
plt.scatter(x_second2, y_pCO2)
plt.scatter(lag_x_second2[:6], lag_y_pCO2)
# 拉格朗日插值pO2
y_pO2 = list(person['pO2'])
# 待插值的x坐标
lag_x_second2 = list(range(0, int(max(x_second2)), 2))
lag_y_pO2 = []
for i in lag_x_second2[:6]:
x_small = []
y_small = []
x_big = []
y_big = []
for j in x_second2:
if (i > j):
x_small.append(j)
y_small.append(y_pO2[x_second2.index(j)])
if (i < j):
x_big.append(j)
y_big.append(y_pO2[x_second2.index(j)])
x = x_small[len(x_small) - 2 if len(x_small) > 2 else 0:]
x.extend(x_big[:2 if len(x_big) > 2 else len(x_big)])
y = y_small[len(y_small) - 2 if len(y_small) > 2 else 0:]
y.extend(y_big[:2 if len(y_big) > 2 else len(y_big)])
lag = lagrange(x, y)
lag_y_pO2.append(lag(i))
plt.scatter(x_second2, y_pO2)
plt.scatter(lag_x_second2[:6], lag_y_pO2)
plt.show()