专栏文章

小爬虫

compbio94 2017-06-12

今天回家把这个月在公司的一个小程序，写出来给大家参考参考！

最近老板说要爬取一些高考的录取数据，所以我就找了一个网站：

就是这个网站，我分析发现：

https://gkcx.eol.cn/soudaxue/queryProvinceScore.html?&page=1

URLs是以递增的方式生成的。也就是从1,2,3,4,...到最后一页

这就好办了，一个循环就好。

for i in range(1,10000):

url='https://gkcx.eol.cn/soudaxue/queryProvinceScore.html?&page=%s'%(str(i))

可是源码中居然没有数据，这么说应该是ajax异步或者DHTML动态HTML，必须执行源码中的js或者jquery代码才能获得数据。

所以就是这个了：

所以我们就获得了数据

至于Phantomjs怎么安装，可以按照网上的方法，我只知道对于这些问题还是比较好解决的。欢迎交流。

================================================

源码：

import urllib

from bs4 import BeautifulSoup

from selenium import webdriver

import time

data=[]

for i in range(2,3):

time.sleep(3)

driver=webdriver.PhantomJS(executable_path='C:\\Users\\jyjh\\Desktop\\phantomjs-2.1.1-windows\\bin\\phantomjs')

url='https://gkcx.eol.cn/soudaxue/queryProvinceScore.html?&page=%s'%(str(i))

driver.get(url)

data.append(driver.find_element_by_id('seachtab').text)

driver.close()

print(i)

file=open('C:\\Users\\jyjh\\Desktop\\zhongguojiaoyuzaixian.txt','w')

for i in data:

file.write(i)

file.close()

陈浩杰

2017.6.12

# ajax # 爬虫 # python # 异步 # 长文章

版权归作者所有，转载请注明出处

compbio94 关注

热度 1

推荐文章

LOFTER-网易轻博