python获取js里window对象

python环境依赖

pip install PyExecJS
pip install lxml
pip install beautifulsoup4
pip install requests

nodejs环境依赖

全局安装命令

npm install jsdom -g
或者
yarn add jsdom -g

安装后下面这些代码可以正常执行了

const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;

在全局安装jsdom后，在node里按上面的写法是没有问题的，但是我们要在python中使用的话，不能在全局安装
如果在全局安装，使用时会报如下错误，说找不到jsdom

execjs._exceptions.ProgramError: Error: Cannot find module 'jsdom'

解决办法有两种
1.就是在python执行文件所在的运行目录下，使用npm安装jsdom
2. 使用cwd参数，指定模块的所在目录，比如，我们在全局安装的jsdom，在cmd里通过npm root -g 可以查看全局模块安装路径: C:\Users\w001\AppData\Roaming\npm\node_modules
我们使用时，代码可以按下面的写法写

import execjs
with open(r'要运行的.js','r',encoding='utf-8') as f:
    js = f.read()
ct = execjs.compile(js,cwd=r'C:\Users\w001\AppData\Roaming\npm\node_modules')
print(ct.call('Rohr_Opt.reload','1'))
print(js.eval("window.pageData")）

python 爬虫的例子

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author: Irving Shi

import execjs
import json
import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
}


def get_company(key):
    res = requests.get("https://aiqicha.baidu.com/s?q=" + key, headers=headers)
    soup = BeautifulSoup(res.text, features="lxml")
    tag = soup.find_all("script")[2].decode_contents()
    tag = """const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest; """ + tag
    js = execjs.compile(tag, cwd=r'C:\Users\Administrator\AppData\Roaming\npm\node_modules')

    res = js.eval("window.pageData").get("result").get("resultList")[0]
    return res


res = get_company("91360000158304717T")
# for i in res.items():
#     print(i)

pid = res.get("pid")
r = requests.get("https://aiqicha.baidu.com/detail/basicAllDataAjax?pid=" + pid, headers=headers)
data = json.loads(r.text).get("data").get("basicData")
for i in data.items():
    print(i)

使用python的execjs执行js，会有这个错误：

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 41: illegal multibyte sequence

这个问题原因是文件编码问题，具体可以 Google 一下，这里直接解决方法是通过修改 subprocess.py 中的 Popen 类的构造方法 __init__ 中 encoding 参数的默认值为 utf-8。

改前

    _child_created = False  # Set here since __del__ checks it

    def __init__(self, args, bufsize=-1, executable=None,
                 stdin=None, stdout=None, stderr=None,
                 preexec_fn=None, close_fds=_PLATFORM_DEFAULT_CLOSE_FDS,
                 shell=False, cwd=None, env=None, universal_newlines=False,
                 startupinfo=None, creationflags=0,
                 restore_signals=True, start_new_session=False,
                 pass_fds=(), *, encoding=None, errors=None):
        """Create new Popen instance."""
        _cleanup()
        # Held while anything is calling waitpid before returncode has been
        # updated to prevent clobbering returncode if wait() or poll() are
        # called from multiple threads at once.  After acquiring the lock,
        # code must re-check self.returncode to see if another thread just
        # finished a waitpid() call.
        self._waitpid_lock = threading.Lock()

改后

    _child_created = False  # Set here since __del__ checks it

    def __init__(self, args, bufsize=-1, executable=None,
                 stdin=None, stdout=None, stderr=None,
                 preexec_fn=None, close_fds=_PLATFORM_DEFAULT_CLOSE_FDS,
                 shell=False, cwd=None, env=None, universal_newlines=False,
                 startupinfo=None, creationflags=0,
                 restore_signals=True, start_new_session=False,
                 pass_fds=(), *, encoding="utf-8", errors=None):
        """Create new Popen instance."""
        _cleanup()
        # Held while anything is calling waitpid before returncode has been
        # updated to prevent clobbering returncode if wait() or poll() are
        # called from multiple threads at once.  After acquiring the lock,
        # code must re-check self.returncode to see if another thread just
        # finished a waitpid() call.
        self._waitpid_lock = threading.Lock()

因为修改源码的缘故建议大家在虚拟环境venv中用

pip install virtualenv

posted @ 2020-12-06 14:07 陨落&新生阅读(3110) 评论(0) 编辑收藏举报

刷新页面返回顶部

不积跬步,无以至千里.不积小流,无以成江海

在乎对自己好的人，不要在乎那些不重要的人，因为他们在你的生活只是一个过客。珍惜对自己重要的人，不要伤害那些在乎自己的人。

python获取js里window对象

使用python的execjs执行js，会有这个错误：

公告