20220703 爬虫&数据处理

1、

昨天已经获取到数据，今天发现dataframe数据单列数据存储在一行中，分列不太好分，我上网查了下。从列表转换为dataframe，正常是存储为一行，需要转置下发现确实变成逗号分开的形式了。代码如下：

data = get_data()
df = pd.DataFrame(data=[data],index=['a']).T
print(df.head())

如果想把列表转为字典格式，再存为dataframe呢？（参考链接：https://blog.csdn.net/linxent/article/details/104345845）

def get_data():
    j = 1
    total = []
    while j <= 3:
        sleep(1)
        lst = []
        lst1 = []
        lst2 = []
        lst3 = []
        for i in range(1,11):
            Project_name = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[1]/div[4]/div[2]/table/tbody/tr[%s]/td[1]"%i)))
            Stat_tel = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[1]/div[3]/table/tbody/tr[%s]/td[2]/div/span"%i)))
            Recent_person = wait.until(EC.presence_of_element_located((By.XPATH,"//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[1]/div[3]/table/tbody/tr[%s]/td[5]"%i)))
            Last_Updated = wait.until(EC.presence_of_element_located((By.XPATH,"//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[1]/div[3]/table/tbody/tr[%s]/td[6]/div/span"%i)))
            lst.append(Project_name.text)
            lst1.append(Stat_tel.text)
            lst2.append(Recent_person.text)
            lst3.append(Last_Updated.text)
        len_a = len(lst)
        len_b = len(lst1)
        len_c = len(lst2)
        len_d = len(lst3)
        if len_a !=len_b or len_a != len_c != len_d:
            print("抓取到数据个数不同")
        for i in range(len_a):
            total.append(lst[i]+","+ lst1[i]+"," + lst2[i]+","+ lst3[i])
        fanye = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[2]/div[2]/div/ul/li[%s]"%j)))
        fanye.click()
        print("已抓取第%s页"%j)
        #fanye = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[2]/div[2]/div/span[2]/div/input")))
        #fanye.send_keys(j)
        sleep(2)
        #fanye.send_keys(Keys.ENTER)
        j += 1
    sleep(1)
    return total

#def data_clean():
data = get_data()
df = pd.DataFrame(data=[data],index=['a']).T
print(df.head())
df1 = df.join(df['a'].str.split(',', expand=True))
print(df1)

用上面的方法的确是可以分割开单列数据，问题在于取到的列中。时间标注为上午、星期一之类的，不是标准日期字符串。这个后续再改吧。

2、

如何将数据输出为指定路径下的EXCEL表格形式？可以（参考：https://blog.csdn.net/m0_47671192/article/details/117742544）。目前实现代码是：

def write():
    path = GetDesktopPath()
    df = pd.DataFrame(data_clean())
    filename="呼叫系统数据" + re.sub(r'[^0-9]','',datetime.datetime.now().strftime("%Y%m%d")) + '.xlsx'
    with pd.ExcelWriter(filename,mode='w',options={'encoding':'utf-8'},engine="openpyxl") as writer:
        df.to_excel(writer, sheet_name='呼叫系统高级查询',index=False)  
#    writer = pd.ExcelWriter(filename)

目标暂时完成。

3、

具体脚本打包方法参考：https://zhuanlan.zhihu.com/p/370914926。大神写的很详细。

安装库的过程不再赘述

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple python-docx

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pyinstaller

参照上述命令在conda环境中下载相关库能满足脚本运行环境即可。

出现问题：爬虫抓取时，页码未获取到。仔细分析了下网页元素，发现他的xpath路径在页面上不跟随页码，而是按照界面展示仅7个元素，所以当翻页到第4页之后，就该切换翻页方式了。

if i < 5:
        for j in range(1,5):
            fanye = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[2]/div[2]/div/ul/li[%s]"%i)))
            fanye.click()
            sleep(sleeptime)
            i += 1
        print("已抓取第%s页"%i)
    elif [i >= 5 & i < total]:
        fanye = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='main-frame']/div[4]/div/div[1]/div[2]/div[2]/div[2]/div/ul/li[6]")))
        fanye.click()
        sleep(sleeptime)
    i += 1
    print("已抓取第%s页"%i)
    return i

具体翻页判断的参考链接：https://blog.csdn.net/weixin_40000457/article/details/111633965，可以用这个。如果问题找不到答案，一定是自己提问方式不对，我反思下，当前问题是：如何用xpath定位动态页码？（参考链接：https://www.cnblogs.com/String-song/p/14304230.html，https://zhuanlan.zhihu.com/p/68118490，https://blog.csdn.net/qq_33673130/article/details/89468967）

3、目前发现该网页是动态加载的，跳转翻页按钮只能找到按一次就自动退出了，也无报错，不知如何解决。

4、

0705今日已解决问题，发现是clear()操作，应该是输入框input有默认值的存在，导致没法清空。关键是clear操作后，应该有个失去input焦点的动作，导致input又获取了默认值。有大佬解释：前端输入框问题，点击clear的时候检测到框为空，又给你把默认值赋上去了。你可以打开浏览器F12中的console调试一下，看看用js语句直接修改value值能否成功。在网上找了下selenium输入框无法清除默认值的问题，尝试了下双击输入框的办法，完美解决问题，可以正常翻页了。另一种全选再次输入clear()清空的办法，在目标网站上测试未生效。（参考链接：https://blog.csdn.net/sun_977759/article/details/108731881、 https://www.cnblogs.com/lulua/p/10882971.html）

from selenium.webdriver.common.action_chains import ActionChains

element = driver.find_element_by_xpath('xpath路径')
ActionChains(driver).double_click(element).perform()
element.send_keys('009')

posted @ 2022-07-03 13:19 dion至君阅读(162) 评论(0) 编辑收藏举报

刷新页面返回顶部

dion至君

20220703 爬虫&数据处理

公告