python3网络爬虫学习录（基本库使用连载中）

2020年2月12日23:27:58

评论1,841 次浏览 6379字阅读21分15秒

urllib的使用

请求模块（request）

urlopen（）

urlopen方法用于模拟向网站发送请求

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author m77 time:2020/2/10
import urllib.request
response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

运行结果为python官网源代码，调用read()方法可以返回指定内容

※ timeout参数

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author m77 time:2020/2/10
import urllib.request
import  socket
import urllib.error
try:
    response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print('TIME OUT')

运行结果：

小结：timeout常用于异常处理

Request类

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/10
import urllib.request
request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

运行结果：

小结：以变量方式传递给urlopen

※ 处理身份验证提示框

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/11
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError
username = 'username'
password = 'password'
url = 'http://localhost:5000'
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

小结：open（）方法打开链接opener发送包含用户名和密码的请求，完成验证

※ Cookies处理

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/11
import http.cookiejar
import urllib.request
cookie = http.cookiejar.CookieJar()
handle = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handle)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

运行结果：

※ 异常处理

URLError

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/14
from urllib import request,error
try:
    response = request.urlopen('https://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

运行结果：

小结：处理URL异常，并输出异常原因

HTTPError

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/14
from urllib import request, error
try:
    response = request.urlopen('https://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

运行结果：

小结：URLError是HTTPError的子类，所以先捕获子类错误，再捕获父类错误

※ 解析链接

urlparse（）方法（url识别和分段）

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/15
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id = 5#comment')
print(type(result), result)

运行结果：

小结：urlparse用于解析url链接，scheme为协议类型，netloc为域名，path为访问路径，params为参数，query为查询条件，#后为锚点，定位页面内部下拉位置，得出标准链接格式为：scheme://netloc/path/;params?query#fragment

urlunparse()方法（url构造）

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/15
from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a = 6', 'comment']
print(urlunparse(data))

运行结果：

小结：与urlparse方法相反，进行构造url。列表长度必须为6

urlsplit()方法（相似于urlparse方法）

此方法只是将params合并到path中

urlunsplit()方法（相似于urlunparse方法）

此方法也是合并url链接，只是传入长度为5

urljoin()方法（两个url拼接）

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/15
from urllib.parse import urljoin
print(urljoin('http://baidu.com', 'FAQ.html'))
print(urljoin('http://baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://baidu.com', '?category=2#comment'))
print(urljoin('http://baidu.com', '?category=2#comment'))
print(urljoin('http://baidu.com#comment', '?category=2'))

运行结果：

小结：将第二个url链接中缺少部分从第一个url中提取并补齐

※ urlencode()方法（将字典参数转化到url中）

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/16
from urllib.parse import urlencode
params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

运行结果：

小结：将参数存入字典，需要时加入url中即可

※ parse_qs()方法（与urlencode相反，将get请求转化为字典）

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/16
from urllib.parse import parse_qs
query = 'name = germey&age=22'
print(parse_qs(query))

运行结果：

parse_qsl()方法（相似于parse_qs方法，将get请求转化为元组）

※ quote()方法（链接转化为url编码）

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/16
from urllib.parse import quote
keyword = '壁纸'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

运行结果：

小结：将url中中文进行url编码

※ unquote()方法（将url解码）

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/16
from urllib.parse import unquote
url = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8'
print(unquote(url))

运行结果：

小结：与quote方法相反，进行url解码

※ robot协议（爬虫协议）

robot协议规定网站中哪些页面可以抓取，哪些不可以抓取。通常在网站根目录下robots.txt文板文件，常见写法如下：

样例1：

User-agent:*
Disallow:/
Allow:/public/

User-agent为爬虫名称，*为任何爬取爬虫有效，Disallow为不允许爬取的目录，/为不允许爬取所有页面，Allow一般和Disallow搭配使用，用来约束，/public/为所有页面不允许抓取，只允许抓取public目录

样例2：

User-agent:*
Disallow:/

禁止所有爬虫爬取任何页面

样例3：

User-agent:*
Disallow:

允许所有爬虫爬取任何目录，robots文件留空默认为允许爬取所有目录

样例4：

User-agent:*
Disallow:/private/
Disallow:/tmp/

禁止爬虫爬取某些目录

样例5：

User-agent:WebCrawler
Disallow:
User-agent:*
Disallow:/

只允许一个爬虫访问

※ robotparser类（判断网站是否可以爬虫）

set_url()：设置robots文件链接，通常为根目录。可以在创建对象时直接传入
read()：读取robots文件并进行分析，必须调用此方法或parse方法，否则都会返回false
parse()：解析robots文件
can_fetch()：传入两个参数，一个是user-agent，一个为抓取的url，判读是否可以抓取，返回true或false
mtime()：返回上次抓取和分析的时间
modified()：将当前时间设置为上次抓取和分析的时间。？

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/16
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://www.7z7z7z.cn/robots.txt')
rp.read()
print(rp.can_fetch('*', 'https://www.7z7z7z.cn'))
print(rp.mtime())

运行结果：

小结：此类判断网站是否可以爬虫

※requests的使用

基本用法

get请求(相似于urlopen)

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/17
import requests
r = requests.get('https://httpbin.org/get')
print(r.text)

运行结果：

小结：此网址会返回发送者请求信息，这里返回get请求

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/17
import requests
data = {
    'name': 'germey',
    'age': '22'
}
r = requests.get("http://httpbin.org/get",params=data)
print(r.text)

运行结果：

小结：先将参数用字典储存，再通过params构造带参数的链接

抓取二进制数据

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/17
import requests
r = requests.get("https://github.com/favicon.ico")
with open('favicon.ico', 'wb')as f:
    f.write(r.content)

运行结果：

同级文件夹下创建favicon.ico文件

小结：open方法内第一个参数为文件名，第二个参数代表以二进制形式打开

post请求

实例：

#!/usr/bin/emv python
# -*- coding:utf-8 -*-
# author 7z time:2020/2/17
import requests
data = {
    'name': 'germey',
    'age': '22'
}
r = requests.post("http://httpbin.org/post", data = data)
print(r.text)

运行结果：

python3网络爬虫学习录（基本库使用连载中）

urllib的使用

请求模块（request）

urlopen（）

※ timeout参数

Request类

※ 处理身份验证提示框

※ Cookies处理

※ 异常处理

URLError

HTTPError

※ 解析链接

urlparse（）方法（url识别和分段）

urlunparse()方法（url构造）

urlsplit()方法（相似于urlparse方法）

urlunsplit()方法（相似于urlunparse方法）

urljoin()方法（两个url拼接）

※ urlencode()方法（将字典参数转化到url中）

※ parse_qs()方法（与urlencode相反，将get请求转化为字典）

parse_qsl()方法（相似于parse_qs方法，将get请求转化为元组）

※ quote()方法（链接转化为url编码）

※ unquote()方法（将url解码）

※ robot协议（爬虫协议）

※ robotparser类（判断网站是否可以爬虫）

※requests的使用

基本用法

get请求(相似于urlopen)

抓取二进制数据

post请求

AssertionError: Start slice bound is non-scalar

ArrayList底层实现

缓存[-128-127]数字

String字符串常量池

static关键字

this关键字

学习通24小时自动签到（服务器版）

autojs实现实时翻译

java保存后两位小数四种方法

java总结

发表评论取消回复

2020 年 2 月
一	二	三	四	五	六	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29

urllib的使用

请求模块（request）

urlopen（）

※ timeout参数

Request类

※ 处理身份验证提示框

※ Cookies处理

※ 异常处理

URLError

HTTPError

※ 解析链接

urlparse（）方法（url识别和分段）

urlunparse()方法（url构造）

urlsplit()方法（相似于urlparse方法）

urlunsplit()方法（相似于urlunparse方法）

urljoin()方法（两个url拼接）

※ urlencode()方法（将字典参数转化到url中）

※ parse_qs()方法（与urlencode相反，将get请求转化为字典）

parse_qsl()方法（相似于parse_qs方法，将get请求转化为元组）

※ quote()方法（链接转化为url编码）

※ unquote()方法（将url解码）

※ robot协议（爬虫协议）

※ robotparser类（判断网站是否可以爬虫）

※requests的使用

基本用法

get请求(相似于urlopen)

抓取二进制数据

post请求

发表评论 取消回复

发表评论取消回复