pandas read_html 报错： no tables found

pandas是个好东西，相信不少人都接触过，我也是一年前老师教授时，我跟着粗浅的学过。它对数据超快的加载速度，轻松地多样的处理函数，让人爱不释手。也是最近一个月的时候才突然发现pandas

居然可以直接获取目标网页表格(惊喜到了)，以前都是习惯使用类似requests+xpath+lxml的方式来定位获取管兴趣的数据。而pd.read_html的使用能精简代码，处理也方便，简直不要太爽。好了，废话了一堆，

记录哈子今天碰见的问题吧。

1.问题

我感兴趣的页面出现了tables(静态页面)，于是我便使用了pd.read_html(),意外地出现了报错： no tables found

2.解决方案

1.1 添加定位元素

pd.read_html(url,attr={'':''})

好家伙，到这我就发现了问题，这个table标签里没有name，class，id等常见属性，于是我便定位到它的父级容器div

pd.read_html(url,attr={'class':'table_xxx'})

遗憾的是依旧找不到 table

然后更改到table标签的布局属性还是同上。

1.2

回头在看源码里面怎么说的，首当其冲注意到第一段

io : str or file-like

A URL, a file-like object, or a raw string containing HTML. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with 'https'you might try removing the 's'.

接收网址、文件、字符串。网址不接受https，尝试去掉s后使用

结果：失败

1.3 指定header 和添加解码格式

注意：这里的header 不是headers，它是指标题所在的行

此处的挣扎是徒劳的，就不打代码了

1.4 ‘曲线救国’

我想既然有io 参数，那咋们就不在线寻找了，先获取源码，再解析，这样也是可以的。

使用selenium
    url='xxx'
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    #options.add_argument('--disable-gpu')

    # 初始化
    driver = webdriver.Chrome(options=options)
    # driver.maximize_window()

    # selenium浏览器配置大小
    options.add_argument('window-size=1334x750')

    driver.get(url=url)
    html=driver.page_source
    df = pd.read_html(html,header = 0)

使用requests
url='xxxx'

response = requests.get(url=url,headers = headers)

res = response.content.decode

df = pd.read_html(res,header = 0)

然后拿到了table，😼。。。。。

19:59:07

3原因

学疏才浅(打完代码就把问题写在了这里，我此刻也不知道😂)，有大佬路过的话，万望指点一二。

posted @ 2021-02-19 19:58 cheflone 阅读(5495) 评论(2) 编辑收藏举报

刷新页面返回顶部

cheflone

恆～