python获取网页中表格数据

 ####  python如何获取网页中表格数据
####
# -*- coding:utf8-*-
import urllib.request as ur
import pandas as pd
 
pd.set_option('display.width', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.float_format',  '{:,.3f}'.format)
pd.set_option('display.unicode.ambiguous_as_wide', True)
pd.set_option('display.unicode.east_asian_width', True)
 
class HtmlDownloader(object):
    def download(self, url):
        if url is None:
            return None
        response = ur.urlopen(url)
        if response.getcode() != 200:
            return None
        return response.read()
 
 
# 下载指定网页
hd = HtmlDownloader()
html = hd.download(url='http://www.stats.gov.cn/xxgk/sjfb/zxfb2020/202204/t20220419_1829785.html')
# print(html)
# print(type(html))
 
# 读取网页的表格数据--抓取神器
df = pd.read_html(html)     # 如果一个网页只有一张表，那返回的是pandas数据框，如果有多张表，那么返回的是一个列表
# print(type(df[0]))
# 新建文件存放表格数据
writer = pd.ExcelWriter("网页的表格.xlsx")
# ExcelWriter可以看作一个容器
# print(type(writer))
# 一页有多个表格，遍历
cnt = 0
for df1 in df:
    cnt = cnt+1
    # 写进文件
    df1.to_excel(writer, sheet_name='表'+str(cnt), index=False, header=False)
    # index为是否写入索引；header为是否写入列名。
# 写完关闭文件
writer.close()```

posted @ 2023-07-11 14:37 冀未然阅读(50) 评论(0) 编辑收藏举报

刷新页面返回顶部

（评论功能已被禁用）

相关博文：

· python如何获取网页中表格数据

· python读取pdf中的表格社会保险个人权益记录(参保人员缴费信息)

· python获取表格数据总结

· python 读取excel表格中的数据

· python-pandas提取网页内tables（表格类型）数据

阅读排行：
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码，我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了，比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15：你的「微服务管家」又秀新绝活了

公告

昵称：冀未然
园龄： 7年6个月
粉丝： 12
关注： 0

+加关注

2025年3月

日

一

二

三

四

五

六

合集 (12)

随笔分类 (223)

随笔档案 (181)

相册 (1)

第一本相册(1)

python获取网页中表格数据

公告

搜索

常用链接

最新随笔

我的标签

合集 (12)

随笔分类 (223)

随笔档案 (181)

相册 (1)

	#### python如何获取网页中表格数据
	####
	# -- coding:utf8--
	import urllib.request as ur
	import pandas as pd

	pd.set_option('display.width', None)
	pd.set_option('display.max_rows', None)
	pd.set_option('display.max_columns', None)
	pd.set_option('display.max_colwidth', None)
	pd.set_option('display.float_format', '{:,.3f}'.format)
	pd.set_option('display.unicode.ambiguous_as_wide', True)
	pd.set_option('display.unicode.east_asian_width', True)

	class HtmlDownloader(object):
	def download(self, url):
	if url is None:
	return None
	response = ur.urlopen(url)
	if response.getcode() != 200:
	return None
	return response.read()


	# 下载指定网页
	hd = HtmlDownloader()
	html = hd.download(url='http://www.stats.gov.cn/xxgk/sjfb/zxfb2020/202204/t20220419_1829785.html')
	# print(html)
	# print(type(html))

	# 读取网页的表格数据--抓取神器
	df = pd.read_html(html) # 如果一个网页只有一张表，那返回的是pandas数据框，如果有多张表，那么返回的是一个列表
	# print(type(df[0]))
	# 新建文件存放表格数据
	writer = pd.ExcelWriter("网页的表格.xlsx")
	# ExcelWriter可以看作一个容器
	# print(type(writer))
	# 一页有多个表格，遍历
	cnt = 0
	for df1 in df:
	cnt = cnt+1
	# 写进文件
	df1.to_excel(writer, sheet_name='表'+str(cnt), index=False, header=False)
	# index为是否写入索引；header为是否写入列名。
	# 写完关闭文件
	writer.close()```