python_简单图片爬取（python怎样爬取图片）-eolink官网

python_简单图片爬取（python怎样爬取图片）

文章目录

overviewversion1:version2:

overview

测试可运行于python 3.9+ 正则匹配规则根据具体的网站源码可以适当调整版本2采用beautifulSoup来代替正则表达式,但请注意修改文件保存路径调整

version1:

# -*- coding: utf-8 -*-import osimport reimport urllib.error # 用于错误处理import urllib.request # 主要用于打开和阅读urlprefix_path = r"image1"def picCraw(url, topic, img_pattern_str): count = 1 file_name = os.path.join(prefix_path, topic + ".html") print("正在保存:" + file_name) read_result_bytes = urllib.request.urlopen(url).read() # set save path to_save_path = os.path.join(prefix_path, topic) if not os.path.isdir(to_save_path): os.makedirs(to_save_path) # 创建目录保存图片 #从字节码解码成文本(您应当注意,如果decode()方法不加参数,可能会导致ascii编码相关报错. page_data_str = read_result_bytes.decode("utf8","ignore") MatchedImage_link_list = re.findall(img_pattern_str, page_data_str) # 找出所有匹配 print("MatchedImages:", MatchedImage_link_list) for image in MatchedImage_link_list: # 用正则表达式匹配所有的图片 pattern = re.compile(r'//.*\.jpg$') # 匹配jpg格式的文件 if pattern.search(image): # 如果匹配成功，则获取图片信息；若不成功继续下一个 try: if "not in image: image = "+ image image_data_bytes = urllib.request.urlopen(image).read() # 获取图片信息 image_path = os.path.join(prefix_path, topic, str(count) + ".jpg") # 给图片命名 count += 1 with open(image_path, "wb") as image_file: image_file.write(image_data_bytes) # 将图片写入jpg文件 except urllib.error.URLError as e: print("Download failed") #with open(file_name, "wb") as file: # 将页面写入文件 # file.write(read_result_bytes)if __name__ == "__main__": # url = ' # 匹配图片的pattern,可通过查看网页源代码获悉 #通过findall(),该会返回匹配该分组的匹配部分所构成的集合 img_pattern_str = r'

version2:

# -*- coding: utf-8 -*-"""Created on Fri May 21 22:17:08 2021@author: zero"""# bs4.pics.py'''1. 获取主网页源代码'''import requestsfrom bs4 import BeautifulSoup# url="= "= "= { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) \ AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}response = requests.get(url=url, headers=headers)response.encoding = "utf-8""""Property text of requests.models.Response @property def text(self) -> strContent of the response, in unicode.If Response.encoding is None, encoding will be guessed using chardet.The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property."""resp_text = response.text'''2.从网页源代码获取子标签里的链接'''soup = BeautifulSoup(resp_text, 'html.parser')# print(soup)img_tags_RS = soup.find_all("img")# print(img_tags_RS)'''3.爬取链接下的图片，并写入文件'''for img_tag in img_tags_RS: href = img_tag.get('src') # 超链接目标:href # 查看获取的src的属性值 print(href) # 请断开代理(如果有的话,否则可能失败) # 从这里开始,就基本和requests操作一致: # get the Response instance # get the bytes from the Response instance # save the bytes as a file(with open() as...) img_response = requests.get(href) img_content_bytes = img_response.content img_name = href.split('/')[-1] with open("image2/" + img_name, mode='wb')as fos: fos.write(img_content_bytes)response.close()

Python接口自动化之文件上传/下载接口怎么实现

303 2022-08-30

python_简单图片爬取（python怎样爬取图片）

java中的接口是类吗

Spring中的aware接口详情

Python接口自动化之文件上传/下载接口怎么实现

推荐文章

接口调用是什么意思？几种常用接口调用方式

接口设计原则

8款在线 API 接口文档管理工具

api管理系统是什么？

什么是接口调试？接口调试的步骤有哪些？

api 接口管理系统有哪些？

接口测试有几种测试方法

API文档生成工具有哪些？

微服务和api网关区别

交换机配置步骤

最近发表

热评文章

在线接口文档管理工具推荐，支持在线测试，HTTP接口

开源的在线接口文档wiki工具Mindoc的介绍与使

如何优雅的进行接口设计？接口设计的六大原则是什么？

什么是API测试,api检测公司

遇到百度网址安全中心提醒您该页面可能存在钓鱼欺诈信息

软件接口设计怎么做？前后端分离软件接口设计思路