【数据库】 postgresql 和 SQL初级使用(TPC-H数据应用)
工 具:postgresql
运行环境:ubantu 14.0
数据来源:TPC-H dbgen
TASK1:
1.下载postgresql(sudo install postgresql 即可),成果如下。
TASK2:
1.按照教程操作,进入dbgen文件夹修改makefile文件。
然后执行make运用TPC-H生成数据。这里可以看http://www.cnblogs.com/joyeecheung/p/3599698.html 的前面5点。
另外,TPC-H的关系图文件在官网上就可以找到:http://www.tpc.org/tpc_documents_current_versions/pdf/tpch2.17.1.pdf
(2016-4-6更新:上面这个链接失效了,貌似人家网站更新了,新的资源地址:http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp)
2.将tbl格式文件转换成json格式文件。
1)代码:主要使用c++文件读写来完成,读取指定tbl文件,并按照tpch官
网介绍的数据模型来转换,写json文件,示例为转换orders.tbl文件。
当然这里为了想的快,就直接一个一个表格转换,也可以直接对所有的文件同时进行转换操作。
#include<iostream> #include<fstream> #include<string> #include<vector> using namespace std; vector<string> split(string s, string c) { string::size_type pos; vector<string> result; int size = s.size(); for (int i = 0; i < size; i++) { pos = s.find(c, i); if(pos < size) { result.push_back(s.substr(i, pos-i)); i = pos + c.size() -1; } } return result; }
int main() { string s; int i, j; ifstream in; in.open("orders.tbl"); ofstream out; out.open("orders.json"); string title[9] = { "\"orderkey\"","\"custkey\"","\"orderstatus\"", "\"totalprice\"","\"orderdate\"","\"orderpriority\"", "\"clerk\"","\"shippriority\"","\"comment\""}; if (in.is_open()) { for (i = 1; getline(in, s); i++) { vector<string> result = split(s, "|"); /*if (i == 3270184 ){ cout << "!!!" << s << endl; break; }*/ s = "\t{"; if (result.size() == 9) { out << i; for (j = 0; j < 8; j++) { s = s + title[j] + ":\"" + result[j] + "\","; } s = s + title[j] + ":\"" + result[j] + "\"}\n"; out << s; } } out.close(); } }
2)效果:我的电脑就要崩了.jpg
TASK3:
1.创建数据库和对应的表,并向里面导入转换好的json格式文件。
2.开始查询。
1)开始查询之前先学习SQL查询的知识,尤其是要知道它的选表、拼表原理,比如select、from、where、group by等等的语句的执行顺序,from的原理,否则你开始做题目的时候就要担心你的查询语句是否是你所期望的了,以及相关的聚集函数的用法等等;
2)要弄清楚这个数据库的三个数据实体customer,part,supplier以及他们之间的联系集之间的关系,这样才能使用恰当的方法查询到你想要的查询结果。
Q1:Display the key and quantity of parts which are shipped in March 13th, 1996.
查询语句(SQL):
select data->>'partkey' as part, data->>'quantity' as quantity from l where data->>'shipdate' = '1996-03-13';
查询结果:
Q2: For each ship mode, find the total quantity of items that are shipped before
查询语句(SQL):
select sum(cast(data->>'quantity' as numeric)) as total, data->>'shipmode' as shipmode from l where data->>'shipdate' < '1998-12-01' group by data->>'shipmode' order by total asc;
查询结果:
Q3:Display the total quantity of parts which are satisfied the following conditions in LINEITEM: – the ship mode is AIR – the priority of the order that the parts belong to is URGENT
查询语句(SQL):
select sum(cast(data->>'quantity' as numeric)) as total from l where data->>'shipmode'='AIR' and data->>'orderkey' in (select data->>'orderkey' from o where data->>'orderpriority'='1-URGENT');
查询结果:
Q4: Find out the suppliers that can supply the part whose key is 100. Display their names, nation and regions.
查询语句(SQL):
select data->>'name' as name, data->>'nationkey' as nation, data->>'regionkey' as region from n where data->>'nationkey' in (select s.data->>'nationkey' from s, ps where s.data->>'suppkey'=ps.data->>'suppkey' and ps.data->>'partkey'='100' and ps.data->>'partkey' < ps.data->>'suppkey');
查询结果:
Q5: Calculate the number of distinct customers who have ordered parts whose type is STEEL.
查询语句(SQL):
select count(distinct data->>'custkey') from c where data->>'custkey' in (select o.data->>'custkey' from o, l where o.data->>'orderkey' = l.data->>'orderkey' and l.data->>'partkey' in (select data->>'partkey' from p where data->>'type' like '%STEEL%'));
查询结果:
Q6: Find the top 10 parts, in terms of their total quantity shipped in 1997. List the parts’ keys,names, brands and total quality shipped in 1997. Display the results in descending order of the total quantity.
查询语句(SQL):
(方法1,效率比较慢):
select data->>'partkey' as key, data->>'name' as name, data->>'brand' as brand, (select sum(cast(l.data->>'quantity' as numeric)) as total from l where l.data->>'shipdate' like '%1997%' and l.data->>'partkey' = p.data->>'partkey' group by l.data->>'partkey') from p order by total desc limit 10;
(方法2,效率比较快):
select p.data->>'partkey' as key, p.data->>'name' as name, p.data->>'brand' as brand, s.total from p, (select l.data->>'partkey' as partkey, sum(cast(l.data->>'quantity' as numeric)) as total from l where l.data->>'shipdate' like '%1997%' group by l.data->>'partkey' order by total desc limit 10 ) as s where p.data->>'partkey' = s.partkey;
查询结果:
方法1和方法2的主要差别其实就是先选10行再拼表还是先拼表再选行,法2十几秒内就可以出结果,法1需要更长的时间。当然,我其实还有法0,直接将l和p进行拼表然后选取……睡觉前输入查询,睡醒了都没出结果(七八个钟头)= =。
Q7: Find out the customers who had at least order 10 times in 1996. Display their keys, names, addresses, phones and the total of their orders in 1996. List the results in descending order of the total orders.
查询语句(SQL):
select c.data->>'custkey' as customer, c.data->>'name' as name, c.data->>'address' as address, c.data->>'phone' as phone, op.count as count from c, (select o.data->>'custkey' as cust, count(*) as count from o group by o.data->>'custkey' having count(*) >= 10) as op where c.data->>'custkey' = op.cust;
查询结果:
Q8: Find out the lineitem which has the max extended price and is ordered in 1998. Display the value of this max extended price
查询语句(SQL):
方法1(由Q6受到启发,降序选第一个即为最大值)
select cast(data->>'extendedprice' as numeric) as maxprice from l, (select data->>'orderkey' as orderkey from o where data->>'orderdate' like '%1998%') as op where l.data->>'orderkey' = op.orderkey order by maxprice desc limit 1;
方法2(正常人的思维,应用sum的聚集函数)
select max(cast(lp.price as numeric)) as maxextendedprice from (select l.data->>'extendedprice' as price from l where l.data->>'orderkey' in (select o.data->>'orderkey' from o where o.data->>'orderdate' like '%1998%')) as lp;
查询结果:
可以发现这两种方法得到的查询结果是一样的~至于效率,直观感觉是差不多的,如果数据库的排序是通过堆排序的话,那确实从算法上来说两种方法的效率是一样的。(特别要注意的是使用排序的时候要先将数据强制转换成numeric类型,否则此时是文本格式,比较的结果就会是9999……)
Q9: Display total quantity of lineitems which are ordered in May 1995 for each ship mode except truck and rail.
查询语句(SQL):
select sum(cast(l.data->>'quantity' as numeric)) as total, l.data->>'shipmode' as mode from l where l.data->>'orderkey' in (select o.data->>'orderkey' from o where o.data->>'orderdate' like '1995-05%') and l.data->>'shipmode' <> 'TRUCK' and l.data->>'shipmode' <> 'RAIL' group by l.data->>'shipmode';
查询结果:
心得体会:
1.安装过程完全按照教程来走,没有遇到任何问题。
2.其实这次作业是很简单的,但开始的时候我很盲目地就开始做作业,都没搞清楚原理,遇到什么属性就去转化什么文件再导入进数据库。后来把数据全删掉了,首先先弄明白TPC-H的实体和联系之间的关系,也才能正确理解题意。第二件重要的事情就是学习SQL语句,尤其重点是语句执行的顺序,因为先执行group by、having再执行where还是顺序颠倒,得到的结果是很不一样的。
3.查询的时候我的模式基本上是:满足XX的并且XX的XX的XX……先找主语再找所在的表。
4.SQL查询在数据很大的时候查询反应好慢……然后听同学说建立索引来进行查询,都只是几秒的事情……