数据采集第二次作业

作业①:
  要求:在中国气象网(http://www.weather.com.cn)给定城市集的7日天气预报,并保存在数据库。
  步骤:
    1)先爬取给定城市的7日天气预报源码
def getHtml(url):
    try:
        r = requests.get(url, timeout=25, headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
    2)提取需要的信息
def fillList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for li in soup.find('ul', attrs={"class": 't clearfix'}).children:
        if isinstance(li, bs4.element.Tag):
            date = li('h1')
            weather = li('p')
            temperature = li('i')
            ulist.append([date[0].text.strip(), weather[0].text, temperature[0].text.strip()])

    3)存储到数据库

def Store(uinfo):
    conn = sqlite3.connect('cyz_2_test1.db')
    print("Opened database successfully")
    c = conn.cursor()
    c.execute('''CREATE TABLE WEATHER
           (ID INT PRIMARY KEY     NOT NULL,
           NAME           TEXT     NOT NULL,
           DATE           TEXT     NOT NULL,
           WEA        TEXT,
           TEMPERATURE    TEXT );''')
    print("Table created successfully")
    for i in range(len(uinfo)):
        u = uinfo[i]
        str0 = str(u[0])
        str1 = str(u[1])
        str2 = str(u[2])
        d = i + 1
        c.execute("INSERT INTO WEATHER (ID,NAME,DATE,WEA,TEMPERATURE) \
              VALUES (?,?,?,?,?)", (d, '福州', str0, str1, str2))

    cursor = c.execute("SELECT id,name,date,wea,temperature from WEATHER")
    for row in cursor:
        print("ID = ", row[0])
        print("NAME = ", row[1])
        print("DATE = ", row[2])
        print("WEA = ", row[3])
        print("TEMPERTURE = ", row[4], "\n")
    conn.commit()
    conn.close()

    4)结果展示

    

    完整代码链接:陈杨泽/D_BeiMing - Gitee.com

作业②
  要求:用requests和自选提取信息方法定向爬取股票相关信息,并存储在数据库中。
  候选网站:东方财富网:https://www.eastmoney.com/
  新浪股票:http://finance.sina.com.cn/stock/
  技巧:在谷歌浏览器中进入F12调试模式进行抓包,查找股票列表加载使用的url,并分析api
  返回的值,并根据所要求的参数可适当更改api的请求参数。根据URL可观察请求的参数f1、
  f2可获取不同的数值,根据情况可删减请求的参数。
  参考链接:https://zhuanlan.zhihu.com/p/50099084
  步骤:
    1)获取网页源码
#获取网页源码
def getHtml(url, page):
    r = requests.get(url)
    data = re.compile(r'data.*?;', re.S).findall(r.text)
    # print(data)
    return data

page = 1
url = 'http://4.push2.eastmoney.com/api/qt/clist/get?cb=jQuery1124040282861066519904_' \
      '1634125184060&pn=' + str(page) + '&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&' \
                                        'invt=2&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7' \
                                        ',f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f1' \
                                        '36,f115,f152&_=1634125184061'

    2)利用正则提取想要的信息

def getData(html):
    data = []
    out = [11, 13, 1, 2, 3, 4, 5, 6, 14, 15, 16, 17]
    for i in re.findall('[\[,]{(.*?)}', html):
        k = 0
        temp = []
        for j in i.split(','):
            if k in out:
                temp.append(j.split(':')[1])
                # print(j.split(':')[1], end="\t")
            k += 1
        data.append(temp)

    return data

    3)存储到数据库并查询存储结果

def Store(uinfo):
    conn = sqlite3.connect('cyz_2_test2.db')
    print("Opened database successfully")
    c = conn.cursor()
    c.execute('''CREATE TABLE STOCK
           (ID INT PRIMARY KEY     NOT NULL,
           NAME           TEXT     NOT NULL,
           NUMBER         TEXT     NOT NULL,
           NEWPRICE       TEXT     NOT NULL,
           RISERANGE      TEXT     NOT NULL,
           RISEPRICE      TEXT     NOT NULL,
           DEALNUMBER     TEXT     NOT NULL,
           DEALPRICE      TEXT     NOT NULL,
           CHANGE         TEXT     NOT NULL,
           HIGHEST        TEXT     NOT NULL,
           LOWEST         TEXT     NOT NULL);''')
    print("Table created successfully")
    for i in range(len(uinfo)):
        u = uinfo[i]
        d = i + 1
        c.execute("INSERT INTO STOCK (ID,NAME,NUMBER,NEWPRICE,RISERANGE,RISEPRICE,DEALNUMBER,DEALPRICE,CHANGE,HIGHEST,LOWEST) \
              VALUES (?,?,?,?,?,?,?,?,?,?,?)", (d, u[1], u[2], u[3], u[4], u[5], u[6], u[7], u[8], u[9], u[10]))

    cursor = c.execute(
        "SELECT ID,NAME,NUMBER,NEWPRICE,RISERANGE,RISEPRICE,DEALNUMBER,DEALPRICE,CHANGE,HIGHEST,LOWEST from STOCK")
    for row in cursor:
        print("ID = ", row[0])
        print("NAME = ", row[1])
        print("NUMBER = ", row[2])
        print("NEWPRICE = ", row[3])
        print("RISERANGE = ", row[4])
        print("RISEPRICE = ", row[5])
        print("DEALNUMBER = ", row[6])
        print("DEALPRICE = ", row[7])
        print("CHANGE = ", row[8])
        print("HIGHEST = ", row[9])
        print("LOWEST = ", row[10],'\n')
    conn.commit()
    conn.close()
    4)结果展示
  
  完整代码链接:陈杨泽/D_BeiMing - Gitee.com
 
作业③:
  要求:爬取中国大学2021主榜(https://www.shanghairanking.cn/rankings/bcur/2021)所有院校信息,并存储在数据库中,同时将浏览器F12调试分析的过程录制Gif加入至博客中。
  技巧:分析该网站的发包情况,分析获取数据的api 
  步骤:
    1)利用F12进行分析调试
    

    2)获取网页源码

#获取网页源码
def getHtml(url, page):
    r = requests.get(url, headers=headers)
    data = re.compile(r'data.*?;', re.S).findall(r.text)
    # print(data)
    return data

    3)正则匹配需要的数据

def getData(html):
    data = []
    for i in re.findall(r'univNameCn:.*?",.*?score:.*?,', html):
        name = re.findall(r'[\u4e00-\u9fa5]+', i) #匹配中文
        score = re.findall(r'\d.*\d', i)          #匹配浮点数
        data.append(name[0])
        if len(score)!=0:
            data.append(score[0])
        else:
            data.append('暂无')
    return data

    4)存入数据库

def Store(data):
    conn = sqlite3.connect('cyz_2_test3.db')
    print("Opened database successfully")
    c = conn.cursor()
    c.execute('''CREATE TABLE RANK
           (ID INT PRIMARY KEY     NOT NULL,
           NAME           TEXT     NOT NULL,
           COURSE         TEXT     NOT NULL);''')
    print("Table created successfully")

    for i in range(0, len(data), 2):
        d = int((i + 2)/2)
        c.execute("INSERT INTO RANK (ID, NAME, COURSE) \
              VALUES (?,?,?)", (d, data[i], data[i+1]))

    cursor = c.execute("SELECT id,name,course from RANK")
    for row in cursor:
        print("ID = ", row[0])
        print("NAME = ", row[1])
        print("COURSE = ", row[2], "\n")
    conn.commit()
    conn.close()

    5)结果展示

 

 完整代码链接:陈杨泽/D_BeiMing - Gitee.com

 

 
posted @ 2021-10-14 20:34  D北冥  阅读(75)  评论(0编辑  收藏  举报