「Python」用mysql处理excel数据
1 前言
这是一个python的英文题目,原文如下:
The UK Department of Health and Social Care provides a multitude of medical services to people. The UK government is considering developing a new information system to get faster and better decision-support information. The government has limited experience with information system development, so it engages a consulting company to plan and implement it. You need to take the role of an employee of the consulting company and are in charge of this project. The first business process the government will implement is the hospital admission process.
The government looks after many public hospitals with a range of medical officers(MO)within different medical fields(Specialty). Whenever a patient is admitted, a record is created with an admission number(EpisodeNo), date and time, urgency and admission, as well as hospital, patient no., and other patient data. During the stay,all the procedures performed and costs are recorded. When patients are discharged,again the date and time as well as the status get recorded.
At this stage, the government is providing you with test data only. The spreadsheet will be provided with different worksheets in which data are explained further.
Your project consists of a range of tasks,beginning with designing the databaseimplementinc it,creating queries,documenting the design and implementation stepsand answering clients questions. You are required to answer each of the following tasks in sequence:
You need to set up the database operating environment in Python and then use Python to
answer and discuss the following questions:
- Admission count per hospital, by specialty,in a certain period of time;
- Average length of stay by hospital, by class, by medical officer;
- Average cost by hospital, by class, by age, by age group(age groups are 0-1617-3839-6061 and above);
简单说就是有一组数量比较多的患者就医情况数据和医院相关信息,实际上是一份excel表格,包括多个sheet,要求用python脚本对数据库进行操作,实现三个查询问题。
2 分析
2.1 数据库建库方案
使用mysql建立数据库,并将excel表格导入。在此之前,需要先构建一个数据库及其中的各个表。建库方案如下:
灰色格子为表名,其他为字段名,箭头为外键。
2.2 程序分析
- python使用第三方库pymysql实现对数据库的操作;
- python使用第三方库pandas将excel表格数据导入数据库表;
- 程序开始时判断数据库是否已经存在,若不存在则建库建表并导入数据,若存在则进行查询操作。
3 代码
3.1 判断是否存在相应的库
import os
try:
import pymysql
import pandas as pd
import openpyxl # pandas读取excel的.xlsx以来此库
except:
os.system("pip install pandas pymysql openpyxl")
import pymysql
import pandas as pd
import openpyxl
3.2 基本框架
class myDatabase():
def __init__(self) -> None:
self.createTable()
self.insertValue()
def createTables(self): # 建表
pass
def insertValue(self): # 导入excel表格数据
pass
def CountbyClass(self): # 实现查询1.Admission count per hospital, by specialty,in a certain period of time;
pass
def averageAdmtime(self): # 实现查询2.Average length of stay by hospital, by class, by medical officer;
pass
def averageCost(self): #实现查询3.Average cost by hospital, by class, by age, by age group(age groups are 0-1617-3839-6061 and above);
pass
if __name__ == "__main__":
mysql = myDatabase() # 实例化一个myDatabase对象
mysql.CountbyClass() # 查询1
mysql.averageAdmtime() # 查询2
mysql.averageCost() # 查询3
3.3 pymysql库的使用
首先,使用pymysql需要连接到mysql或mysql下的一个数据库,pymysql提供了这样一个connect函数:
db=pymysql.connect(host,user,password,charset)
# host是mysql所在的ip,本机的可以不传入这个参数或者host="localhost"
# user和password是mysql用户和密码
# charset是字符编码
其次,连接上mysql后,需要用在指针进行操作,所以连接成功后需要有一个变量存储指针:
cursor=db.cursor()
最后,对mysql的操作还是需要使用sql语句进行的,也就是说python将sql命令传给mysql执行。用到的函数:
cursor.execute(sql)
# sql是sql语句,可以不带分号;
3.4 sql语句
/*创建数据库*/
create database databaseName;
/*创建数据表*/
create table tableName(word1 class, word2 class, word3 class……); -- 直接创建,如果已存在这个表会报错
create table if not exists tableName(word1 class, word2 class, word3 class……); -- 判断如果不存在则创建表
-- word是字段名,class是数据类型(如int,char,double,date等)
/*插入数据*/
insert into tableName(word1,word2,word3……) value (value1,value2,value3……);
/*删除表*/
drop table tableName;
/*删除数据库*/
drop database databaseName;
/*查询*/
select word1,word2 from tableName; -- 从表tableName中查询word1和word2字段的记录
select * from tableName; -- 显示tableName表的全部记录
select count(*) from tableName; -- 统计tableName的全部记录数量
select count(*) from tableName group by word1; -- 以word1为条件进行分组统计
select avg(word1) from tableName; -- 计算tableName的word1字段的平均值
select word11,word21 from (select word11,word12,word21,word22 from table1,table2 where table1.word1n=table2.word2m) -- 嵌套查询
运用以上命令即可完成开头提到的三个查询问题:
- Admission count per hospital, by specialty,in a certain period of time;
- Average length of stay by hospital, by class, by medical officer;
- Average cost by hospital, by class, by age, by age group(age groups are 0-1617-3839-6061 and above);
以1为例,查询按医院、类别、医务人员分列的平均住院时间。查询语句如下:
select Hospital,
avg(timestampdiff(day,AdmDT,DischDT)) -- avg函数计算平均值,timestampdiff函数计算时间差
from(
select Hospital,
str_to_date(concat(AdmDate," ",AdmTime),"%Y-%m-%d %H:%i:%s") as AdmDT, -- concat函数将原本分为两个字段的入院日期(AdmDate)和入院时间(AdmTime)合为一个字符串,str_to_date函数将字符串转为时间格式,as AdmDT给这个数据取一个别名
str_to_date(concat(DischDate," ",DischTime),"%Y-%m-%d %H:%i:%s") as DischDT
from Patientinfo,AdmDisch -- 医院数据Hospital来自表Patientinfo,出入院时间来自表AdmDisch
where Patientinfo.EpisodeNo=AdmDisch.EpisodeNo) as hospital -- 将Patientinfo表和AdmDisch表按EpisodeNo进行合并
group by Hospital; -- 按医院分组求平均住院时间