Scrapy 图片与文件下载

# Scrapy 图片与文件下载

批量下载图片或文件是爬虫开发中的核心需求之一。Scrapy 提供了一套内置的、高效的媒体管道（Media Pipelines）来专门处理这类任务，其中最常用的就是 ImagesPipeline。它能异步下载图片，避免重复下载，并支持方便的路径和文件名配置。

# 一、ImagesPipeline 的工作原理

ImagesPipeline 的工作流程非常智能，理解它有助于你更好地使用它：

Spider 产出 Item：在你的 Spider 中，你从页面上提取图片的 URL，并将这些 URL 放入一个特定的 Item 字段中（默认为 image_urls）。然后 yield 这个 Item。
Pipeline 拦截 Item：当 ImagesPipeline 接收到这个 Item 后，它会自动检查其中是否存在 image_urls 字段。
调度下载请求：ImagesPipeline 会遍历 image_urls 列表中的每一个 URL，并为它们创建 Request 对象，然后将这些请求发送给调度器进行下载。
下载与处理：下载器完成图片下载后，ImagesPipeline 会将图片内容保存到你在 settings.py 中指定的目录（IMAGES_STORE）。
返回结果：最后，ImagesPipeline 会将下载结果（包括本地路径、原始URL、校验和等信息）存入 Item 的另一个字段中（默认为 images），然后将这个被更新过的 Item 传递给下一个管道（如果存在）。

# 二、使用 ImagesPipeline 的步骤

# 步骤 1：安装依赖库

ImagesPipeline 依赖于 Pillow 库来处理图片。请先确保已安装。

pip install Pillow

# 步骤 2：定义 Item

在 items.py 文件中，定义一个 Item。根据 ImagesPipeline 的约定，你需要定义两个字段：

image_urls: 用于存储你从网页上抓取到的图片 URL 列表。它必须是一个列表类型。
images: 用于存储 ImagesPipeline 下载完图片后的结果信息。你不需要手动给它赋值。

# myproject/items.py
import scrapy

class ImageItem(scrapy.Item):
    # field for scraped image urls
    image_urls = scrapy.Field()
    # field for download results
    images = scrapy.Field()

1
2
3
4
5
6
7
8

# 步骤 3：配置 settings.py

你需要做两件事：启用 ImagesPipeline 并指定图片存储的目录。

# settings.py

# 1. 启用 ImagesPipeline
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

# 2. 设置图片存储目录
# 这是一个本地文件夹路径，用于存放下载的图片
IMAGES_STORE = 'downloaded_images'

1
2
3
4
5
6
7
8
9
10

ImagesPipeline 的优先级（设为 1）应该尽可能高（数字小），以确保它能在其他管道处理 Item 之前先完成图片的下载调度。

# 步骤 4：编写 Spider

Spider 的任务很简单：找到图片的 URL，然后把它们放进 ImageItem 的 image_urls 字段里。

# myproject/spiders/image_spider.py
import scrapy
from myproject.items import ImageItem

class ImageSpider(scrapy.Spider):
    name = 'image_spider'
    start_urls = ['https://desk.zol.com.cn/dongman/'] # 示例网站

    def parse(self, response, **kwargs):
        # 假设图片链接在一个特定的 li > a > img 标签中
        img_urls = response.css('li.photo-list-padding a.pic img::attr(src)').getall()
        
        # 为了保证获取到高清图，ZOL 的 URL 需要一些处理
        # 原始 src: //desk-fd.zol-img.com.cn/t_s208x130c5/g5/M00/02/05/ChMkJ1bKyZmIWCwXAAKw-3G_pXwAALG1gAAAAAACrET351.jpg
        # 替换后: //desk-fd.zol-img.com.cn/t_s960x600c5/g5/M00/02/05/ChMkJ1bKyZmIWCwXAAKw-3G_pXwAALG1gAAAAAACrET351.jpg
        processed_urls = [url.replace('t_s208x130c5', 't_s960x600c5') for url in img_urls]
        
        # 实例化 Item 并填充 image_urls 字段
        item = ImageItem()
        # 确保 url 是完整的，使用 response.urljoin()
        item['image_urls'] = [response.urljoin(url) for url in processed_urls]
        
        yield item

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

完成以上四步，运行爬虫 scrapy crawl image_spider，Scrapy 就会自动将图片下载到项目根目录下的 downloaded_images 文件夹中。

# 三、自定义 ImagesPipeline

默认情况下，ImagesPipeline 会使用 URL 的 SHA1 哈希值作为文件名，这通常不是我们想要的。我们可以通过继承 ImagesPipeline 并重写其方法来实现自定义功能，例如根据图片标题来命名文件。

# 3.1 自定义 Pipeline 类

在 pipelines.py 中创建一个新的 Pipeline 类。

# myproject/pipelines.py
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from urllib.parse import urlparse

class CustomImagesPipeline(ImagesPipeline):
    
    # 重写 get_media_requests 方法，可以从 item 中获取数据，例如图片标题
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            # 为每个请求添加 meta 信息，将 item 传递下去，方便后续处理
            yield scrapy.Request(image_url, meta={'image_item': item})

    # 重写 file_path 方法，用于设置图片的存储路径和文件名
    def file_path(self, request, response=None, info=None, *, item=None):
        # 从 request 的 meta 中获取之前传递的 item
        image_item = request.meta['image_item']
        
        # 提取图片标题作为文件夹名，这里假设 item 中有 title 字段
        folder_name = image_item.get('title', 'default_folder')
        
        # 从 URL 中提取原始文件名
        original_filename = urlparse(request.url).path.split('/')[-1]
        
        # 最终的文件路径
        return f'{folder_name}/{original_filename}'

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

# 3.2 更新 Spider 和 Item

为了配合上面的自定义 Pipeline，我们需要在 Item 中添加 title 字段，并在 Spider 中填充它。

# items.py
class ImageItem(scrapy.Item):
    title = scrapy.Field() # 新增标题字段
    image_urls = scrapy.Field()
    images = scrapy.Field()

# image_spider.py
# ... 在 parse 方法中 ...
    # 假设每个图片都有一个标题
    for li in response.css('li.photo-list-padding'):
        title = li.css('a.pic::attr(title)').get()
        img_url = li.css('a.pic img::attr(src)').get()
        
        # ... URL 处理逻辑 ...
        
        item = ImageItem()
        item['title'] = title
        item['image_urls'] = [response.urljoin(processed_url)]
        yield item

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# 3.3 启用自定义 Pipeline

最后，在 settings.py 中，将默认的 ImagesPipeline 替换为你的自定义 Pipeline。

# settings.py
ITEM_PIPELINES = {
   'myproject.pipelines.CustomImagesPipeline': 1,
}
IMAGES_STORE = 'downloaded_images'

1
2
3
4
5

现在再次运行爬虫，图片就会被下载到以其标题命名的子文件夹中，文件名也保持了原始名称。

# 四、FilesPipeline

Scrapy 还提供了一个通用的 FilesPipeline，其工作方式与 ImagesPipeline 几乎完全相同，主要区别在于默认的 Item 字段名：

URL 字段：file_urls
结果字段：files

你可以用它来下载任何类型的文件，如 PDF、ZIP 等。其自定义方式也与 ImagesPipeline 一致。

编辑此页

上次更新: 2025/07/27, 04:30:11

← Scrapy Items与Pipeline数据管道 Scrapy 模拟登录与Cookie处理→