Perfect Data

Question

我正试图从航班搜索页面抓取一些数据.

此页面以这种方式工作：

你填写一个表格,然后你点击按钮搜索 – 这没关系.当您单击该按钮时,您将被重定向到包含结果的页面,这就是问题所在.这个页面连续添加结果,例如一分钟,这不是什么大问题 – 问题是得到所有这些结果.当您使用真正的浏览器时,您必须向下滚动页面并显示这些结果.所以我试图使用Selenium向下滚动.它可能在页面底部向下滚动可能非常快,或者是跳转而不是滚动页面不会加载任何新结果.

当你慢慢向下滚动时,它会重新加载结果,但是如果你这么做就会停止加载.

我不确定我的代码是否有助于理解,所以我附上它.

SEARCH_STRING = """URL"""

class spider():

    def __init__(self):
        self.driver = webdriver.Firefox()

    @staticmethod
    def prepare_get(dep_airport,arr_airport,dep_date,arr_date):
        string = SEARCH_STRING%(dep_airport,arr_airport,arr_airport,dep_airport,dep_date,arr_date)
        return string


    def find_flights_html(self,dep_airport, arr_airport, dep_date, arr_date):
        if isinstance(dep_airport, list):
            airports_string = str(r'%20').join(dep_airport)
            dep_airport = airports_string

        wait = WebDriverWait(self.driver, 60) # wait for results
        self.driver.get(spider.prepare_get(dep_airport, arr_airport, dep_date, arr_date))
        wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]')))
        wait.until(EC.invisibility_of_element_located((By.XPATH, u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img')))
        self.driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")

        self.driver.find_element_by_xpath('//body').send_keys(Keys.CONTROL+Keys.END)
        return self.driver.page_source

    @staticmethod 
    def get_info_from_borderbox(div):
        arrival = div.find('div',class_='departure').text
        price = div.find('div',class_='pricebox').find('div',class_=re.compile('price'))
        departure = div.find_all('div',class_='departure')[1].contents
        date_departure = departure[1].text 
        airport_departure = departure[5].text
        arrival = div.find_all('div', class_= 'arrival')[0].contents
        date_arrival = arrival[1].text
        airport_arrival = arrival[3].text[1:]
        print 'DEPARTURE: ' 
        print date_departure,airport_departure
        print 'ARRIVAL: '
        print date_arrival,airport_arrival

    @staticmethod
    def get_flights_from_result_page(html):

        def match_tag(tag, classes):
            return (tag.name == 'div'
                    and 'class' in tag.attrs
                    and all([c in tag['class'] for c in classes]))

        soup = mLib.getSoup_html(html)
        divs = soup.find_all(lambda t: match_tag(t, ['borderbox', 'flightbox', 'p2']))

        for div in divs:
            spider.get_info_from_borderbox(div)

        print len(divs)


spider_inst = spider() 

print spider.get_flights_from_result_page(spider_inst.find_flights_html(['BTS','BRU','PAR'], 'MAD', '2015-07-15', '2015-08-15'))

因此,我认为主要问题是滚动太快而无法触发新的结果加载.

你知道如何使它工作吗？

Perfect Data

使用Selenium慢慢向下滚动页面

导航

公告