当前位置：首页 > news >正文

尚硅谷爬虫note004

news 2025/8/23 22:11:32

一、urllib库

1. python自带，无需安装

# _*_ coding : utf-8 _*_
# @Time : 2025/2/11 09:39
# @Author : 20250206-里奥
# @File : demo14_urllib
# @Project : PythonProject10-14#导入urllib.request
import  urllib.request#使用urllib获取百度首页源码
#1.定义一个url:需要访问的地址
url = "http://www.baidu.com"
#2.模拟浏览器向服务器发送请求
#3.导入urllib.request
response = urllib.request.urlopen(url)#4.获取响应中的页面源码
#read()方法返回的是字节形式的二进制内容
# content = response.read()# 解码： 将二进制数据转成字符串
#decode方法
content = response.read().decode("utf-8")#5.打印数据
print(content)

2. 1个类型，6个方法

2-1）1个类型

response = urllib.request.urlopen(url)#1个类型和6个方法
# print(type(response))

HTTPResponse类型

2-2）6个方法

2-2-1）read（）

#一个字节一个字节的读取
content = response.read()
print(content)

#返回多少个字节。（5）——5个
content = response.read(5)
print(content)

2-2-2）readline（）

#读取一行
content = response.readline()print(content)

2-2-3）readlines（）

#读取多行
content = response.readlines()print(content)

2-2-4）response.getcode（）

# 获取状态码
print(response.getcode())

2-2-5）response.geturl（）

# 返回url地址
print(response.geturl())

2-2-6）response.getheaders（）

#获取状态信息3
print(response.getheaders())

二、 url下载

1. 下载网页（xx.html）

#1. 下载网页
url_page = "http://www.baidu.com"
#url：下载的路径 ,filename:下载的文件名
urllib.request.urlretrieve(url_page,"baidu.html")

2. 下载图片（xx.jpg）

# 2. 下载图片
url_img = "https://img2.baidu.com/it/u=872152568,3550679156&fm=253&fmt=auto&app=138&f=JPEG?w=500&h=667"
urllib.request.urlretrieve(url = url_img,filename="meiduan.jpg")

3. 下载视频（xx.mp4）


# 下载视频
url_video = "https://f.video.weibocdn.com/o0/vsYwMHCVlx08lDCdVqAE01041200QVNm0E010.mp4?label=mp4_720p&template=1278x720.25.0&media_id=5129605300813840&tp=8x8A3El:YTkl0eM8&us=0&ori=1&bf=4&ot=h&ps=3lckmu&uid=3ZoTIp&ab=,15568-g4,8012-g2,8013-g0,3601-g36,3601-g36,3601-g37,3601-g37&Expires=1739332655&ssig=pMM%2F7nCPyN&KID=unistore,video"urllib.request.urlretrieve(url = url_video,filename="jingtian.mp4")

三、请求对象的定制

1. user-agent（U-A反扒）

# _*_ coding : utf-8 _*_
# @Time : 2025/2/12 11:11
# @Author : 20250206-里奥
# @File : demo17_qingqiuduixaingdedingzhi
# @Project : PythonProject10-14
import urllib.request#字典--headers
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:135.0) Gecko/20100101 Firefox/135.0"
}url = "https://www.baidu.com"# urlopen（）方法中不能存储字典，所以headers不能传递进去。——》请求对象定制
# 因为参数顺序的问题，不能直接写url和headers，中间还有一个data，所以需要关键字传参
request = urllib.request.Request(url = url,headers = headers)response = urllib.request.urlopen(request)
content = response.read().decode("utf-8")print(content)
#url的组成：大致6部分： 1.协议； 2.主机； 3.端口号； 4.路径； 5. 参数； 6.锚点...

# 因为参数顺序的问题，不能直接写url和headers，中间还有一个data，所以需要关键字传参