博文_关于爬虫之requests库--俞惠铭的blog

	http://blog.sysuschool.com/u/mygod/index.html
	请稍候，载入中。。。
请稍候，载入中。。。

2020/10/12 15:27:00

博文_关于爬虫之requests库

上篇小结了下urllib库，本篇继续小结requests库的用法，注意requests库，非urllib.request类，这个库需要安装，在终端pip install requests安装即可。

requests模块发送请求获取响应，分requests.get()和requests.post()，直接可以传入url、headers、params、data、proxies，分别指url地址、请求头数据（可包含浏览器信息User-Agent、引用referer、cookie）、url的查询字符串（字典数据类型）、post提交的数据（字典类型）、代理

代码格式，如：

requests.xxx(url=xx, headers=xx, params=xx, data=xx, proxies = xx)

# 简洁了许多，除了url，其他都可以不传

requests.get()和requests.post() 返回的响应对象，常用的属性和方法有：

如response = requests.get(“https://www.baidu.com/”)

response.text # 响应内容，为str类型，用推测的解码（如解码不对，可能出现乱码）

response.content # 响应内容，bytes类型，解码 .decode()，可传“utf-8”、“gbk”等

response.encoding # 编码方式

response.status_code # 响应状态码

response.request.headers # 请求头数据

response.headers # 响应头数据（set-cookie等）

response.request.url # 请求url

response.url # 响应url（可能与请求不一样，如响应时重定向）

response.cookies # cookies，可用cookies = {i.split("=")[0]:i.split("=")[1] for i in cookies.split("; ")}，转字典类型，也可使用下面方法

response.cookies.get_dict() # 以字典提取cookie数据

较重要的是属性text和content，为服务器响应的数据，text为str数据，是经过解码的，不过使用的是猜测的编码进行解码，可能因为解码不对而出现乱码，而content也是服务器响应的数据，不过没有进行解码，是bytes数据，可以用 .decode()进行解码，一般使用utf-8解码。

例：获取百度的响应数据（get请求），模拟百度“长城”关键字

import requests

params={

'wd':'中国'

}

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}

response = requests.get('https://www.baidu.com/s', params=params, headers=headers)

with open('baidu.html','w',encoding='utf-8') as f:

# content为bytes类型，需解码

f.write(response.content.decode('utf-8'))

例：爬拉勾网职位数据（post请求）

import requests

# 一个ajax的post请求

url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',

'referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='}

data = {

'first': 'true',

'pn': 1,

'kd': 'python'

}

response = requests.post(url, data=data, headers=headers)

print(response.text) # response.text 为json数据

print(response.json()) # response.json() 转成字典

# 网站json.cn可解析json数据

例：使用代理

proxies = {

"http": "http://12.34.56.79:9527",

"https": "https://12.34.56.79:9527",

}

requests.get("http://www.baidu.com", proxies = proxies)

处理cookies 、session请求

requests 提供了一个session类，来实现客户端和服务端的会话保持

# 实例化一个session对象，让session发送get或者post请求

session = requests.session()

# 先使用session发送请求，登录网站，会把cookie保存在session中

response = session.get(url,headers)

# 再使用session请求登陆之后才能访问的网站，session能够自动的携带登录成功时保存在其中的cookie，进行请求

例：爬人人网个人主页的数据

# coding=utf-8

import requests

session = requests.session()

post_url = http://www.renren.com/PLogin.do # 人人网

post_data = {"email":"xxx@163.com", "password":"xxx"}

headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"}

# 使用session发送post请求，cookie保存在其中

session.post(post_url,data=post_data,headers=headers)

# 再次使用session进行请求登陆之后才能访问的地址

r = session.get("http://www.renren.com/327550029/profile",headers=headers)

# 保存页面

with open("renren1.html","w",encoding="utf-8") as f:

f.write(r.content.decode())

# 请求 SSL证书验证问题，https请求需要证书验证，可关闭

response = requests.get("https://www.12306.cn/mormhweb/ ", verify=False) # verify验证

requests模块的使用小结到此，requests语法简洁，故多写了几个例子，如有时间，后篇小结下使用selenium自动化工具实现爬虫。

mygod | 阅读全文 | 回复(0) | 引用通告 | 编辑

发表评论：

公告

请稍候，载入中。。。

时间记忆

请稍候，载入中。。。