大型数据库应用技术课堂测试05

题目：

Result文件数据说明：

Ip：106.39.41.166,（城市）

Date：10/Nov/2016:00:01:02 +0800,（日期）

Day：10,（天数）

Traffic: 54 ,（流量）

Type: video,（类型：视频video或文章article）

Id: 8701（视频或者文章的id）

1、 数据清洗：按照进行数据清洗，并将清洗后的数据导入hive数据库中。

两阶段数据清洗：

（1）第一阶段：把需要的信息从原始日志中提取出来

ip: 199.30.25.88

time: 10/Nov/2016:00:01:03 +0800

traffic: 62

文章： article/11325

视频： video/3235

（2）第二阶段：根据提取出来的信息做精细化操作

ip--->城市 city（IP）

date--> time:2016-11-10 00:01:03

day: 10

traffic:62

type:article/video

id:11325

（3）hive数据库表结构:

create table data( ip string, time string , day string, traffic bigint,

type string, id string )

2、数据分析：在HIVE统计下列数据。

（1）统计最受欢迎的视频/文章的Top10访问次数（video/article）

（2）按照地市统计最受欢迎的Top10课程（ip）

（3）按照流量统计最受欢迎的Top10课程（traffic）

步骤1：

由于不会使用MySQL格式化数据格式所以使用Java程序将数据提出清洗存入Arraylist再导入新表。

import java.sql.*;
import java.util.ArrayList;

public class test002 {
    public static void main(String[] args) {
        try {
            Class.forName("com.mysql.jdbc.Driver");
        } catch (ClassNotFoundException e) {
            throw new RuntimeException(e);
        }
        Connection conn=null;
        try {
            conn= DriverManager.getConnection("jdbc:mysql://localhost:3306/test03","root","zber1574");
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
        String sql="select * from data";
        try {
            ArrayList<beanlist> temp=new ArrayList<beanlist>();
            PreparedStatement pt=conn.prepareStatement(sql);
            ResultSet rs=pt.executeQuery();
            while(rs.next()){
                String c=rs.getString("ip");
                String time=rs.getString("time");
                String day=rs.getString("day");
                String traffic=rs.getString("traffic");
                String type=rs.getString("type");
                String id=rs.getString("id");
                String[]a=c.split("\\.");
                String ip=""+a[0]+"."+a[1];
                beanlist temp1;
                temp1 = new beanlist(ip,time,day,traffic,type,id);
                temp.add(temp1);
            }
            add a=new add();
            for(int i=0;i<temp.size();i++){
                String sql1="INSERT INTO data2 (ip,time,day,traffic,type,id) VALUES('"+temp.get(i).getIp()+"','"+temp.get(i).getTime()+"','"+temp.get(i).getDay()+"',"+temp.get(i).getTraffic()+",'"+temp.get(i).getType()+"','"+temp.get(i).getId()+"')";
                a.insert(sql1);
                System.out.println(i);
            }

        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
    }
}

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;

public class add {
    public void insert(String sql) throws SQLException {
        try {
            Class.forName("com.mysql.jdbc.Driver");
        } catch (ClassNotFoundException e) {
            throw new RuntimeException(e);
        }
        Connection conn=null;
        try {
            conn= DriverManager.getConnection("jdbc:mysql://localhost:3306/test03","root","zber1574");
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
        PreparedStatement pt= null;
        try {
            pt = conn.prepareStatement(sql);
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
        pt=conn.prepareStatement(sql);
        int a=pt.executeUpdate();
        pt.close();
        conn.close();
    }
}

以上就是数据库的提取解析重构再导入

ps：此时从数据库导出的.csv文件可能会是带有""的

会影响hive的正常读取

之后将csv文件通过xftp传入虚拟机

打开到hive

创建表

create table city(
  ip             string  comment "城市"
  , c_date        string  comment "日期"
  , c_day     string  comment "天数"
  , traffic         string  comment "流量"
  , type          string  comment "类型"
  ,Id string comment "id"
)
row format delimited fields terminated by ',';

将文件导入表

hadoop fs -put /opt/software/city.csv /user/hive/warehouse/city

题目一表

create table type10(
id String,
total String
)ROW format delimited fields terminated by ',' STORED AS TEXTFILE;



insert into type10 select id, count(*) as total from result1 group by id order by total desc limit 10;

create table if not exists tb_ip
comment "ip"
as
select
    ip,
    type,
    id,
    count(*) as paixu from city1 group by ip,type,id
order by paixu desc
limit 10;

create table if not exists tb_traffic
comment "traffic"
as
select
    type,
    id,
    sum(traffic) as liuliang from city1
                             group by type,id
order by liuliang desc
limit 10;

最后用fineUI可视化就完成了

posted @ 2022-10-18 22:39 Rebz 阅读(34) 评论(0) 编辑收藏举报

刷新页面返回顶部

大型数据库应用技术课堂测试05

公告