splash

Splash (opens new window) is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python 3 using Twisted and QT5.

It's fast, lightweight and state-less which makes it easy to distribute.

# usage

使用 Splash 可以通过以下步骤:

安装 Splash:

首先, 你需要安装 Splash 服务。你可以使用 Docker 快速部署 Splash 服务, 也可以选择其他安装方式。以下是使用 Docker 的示例:

docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash

1
2

这将在本地的 8050 端口启动 Splash 服务

安装 Splash SDK:

Splash 提供了 Python 客户端 SDK, 你需要安装它:

scrapy-splash: Scrapy+Splash for JavaScript integration

pip install scrapy-splash

在 Scrapy 项目中配置 Splash:

在 Scrapy 项目的 settings.py 文件中配置 Splash, 指定 Splash 服务器地址:

SPLASH_URL = 'http://localhost:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

1
2
3

在 Scrapy Spider 中使用 Splash:

在 Scrapy Spider 中, 你可以使用 SplashRequest 替代常规的 Request, 以便在 Splash 中渲染页面。示例:

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
name = 'example.com'

    def start_requests(self):
        url = 'http://example.com'
        yield SplashRequest(url, self.parse,
            args={'wait': 2},  # 等待2秒以确保页面加载完成
        )

    def parse(self, response):
        # 在这里处理页面渲染后的结果
        # 可以使用 response.xpath 选择器提取数据
        pass

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

配置 Splash 参数:

你可以使用 args 参数来配置 Splash 的行为, 例如等待时间、JavaScript 脚本等。在上述示例中, 我们使用了 args={'wait': 2} 来等待 2 秒, 确保页面加载完整

启动 Scrapy:

使用 scrapy crawl spider_name 命令启动 Scrapy 爬虫, Scrapy 会自动发送请求到 Splash, 等待渲染后返回页面内容

Splash 具有更多功能和配置选项, 可以用于处理更复杂的情况, 例如执行 JavaScript 脚本、单击页面元素等。你可以在 Splash 的官方文档中找到更多详细信息: https://splash.readthedocs.io/en/stable/index.html (opens new window)

# script

Splash UI provides an easy way to try scripts: there is a code editor for Lua and a button to submit a script to execute. Visit http://127.0.0.1:8050/ (opens new window) (or whatever host/port Splash is listening to). 参考 scripting-tutorial (opens new window)

# example

function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(0.5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

1
2
3
4
5
6
7
8
9

这段代码是一个 Lua 脚本, 用于在 Splash 中执行网页渲染和数据提取操作。Splash 是一个基于 WebKit 的轻量级浏览器, 用于呈现网页并提供网页信息提取的功能, 通常用于网页截图、页面渲染和数据抓取

这个 Lua 脚本定义了一个名为 main 的函数, 该函数接受两个参数 splash 和 args。其中 splash 是 Splash 对象, 用于执行各种操作, args 则是一个包含传递参数的表

在这个函数中, 首先使用 splash:go(args.url) 方法打开指定的网址, 然后使用 splash:wait(0.5) 方法等待一定时间(0.5 秒)。接着, 该函数返回一个包含三个键值对的表:

html: 包含当前页面的 HTML 内容
png: 当前页面的截图(PNG 格式)
har: 当前页面的 HAR(HTTP Archive)数据, 记录了页面加载过程中的网络请求和响应信息

这个脚本适用于在 Splash 中执行网页渲染并提取相关数据, 可以通过 Splash 的 API 将其与其他系统集成, 如 Scrapy 框架, 实现自动化的网页截图、数据抓取等操作

# http api (opens new window)

scrapy shell 'http://localhost:8050/render.html?url=http://example.com/page-with-javascript.html&timeout=10&wait=0.5': play with splash
curl 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5': curl example

# link

上次更新: 2025/07/21, 21:50:24

← scrapy pytorch→