Scrapy学习笔记

官网

参考文档

https://www.osgeo.cn/scrapy/intro/tutorial.html
https://www.xncoding.com/2016/04/10/scrapy-10.html Scrapy笔记10- 动态配置爬虫
https://segmentfault.com/a/1190000022681330 scrapy配置参数(settings.py)

安装

1	pip3 install scrapy

DEMO

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://www.zyte.com/blog/']

    def parse(self, response):
        for title in response.css('.oxy-post-title'):
            yield {'title': title.css('::text').get()}

        for next_page in response.css('a.next'):
            yield response.follow(next_page, self.parse)
EOF

1	scrapy runspider myspider.py

CLT

1 2	scrapy startproject hydrabot scrapy genspider httpbin httpbin.org

周边生态

scrapy-splash
scrapy-redis
scrapyd
scrapyd-client
python-scrapyd-api
scrapyrt
gerapy

scrapyrt

参考文档

https://github.com/scrapinghub/scrapyrt/pull/120/commits/b0302994e1df2a4784f6342bddebea91ccfd7a72
https://scrapyrt.readthedocs.io/en/stable/api.html#post

运行

当前目录下运行scrapyrt
curl http://localhost:9080/crawl.json -d ‘{“start_requests”: “true”,”spider_name”: “teacher”,”crawl_args”: {“task_id”: “1”,”entry_id”: “3070”}}’

curl http://localhost:9080/crawl.json -d ‘{“start_requests”: “true”,”spider_name”: “policy_site”,”crawl_args”: {“task_id”: “5”}}’

conda

https://blog.csdn.net/weixin_43840215/article/details/89599559
wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
wget -c https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh

安装

pip install scrapyd

vi /etc/scrapyd/scrapyd.conf
[scrapyd]
eggs_dir    = /data/scrapyd/eggs
logs_dir    = /data/scrapyd/logs
items_dir   = /data/scrapyd/items
jobs_to_keep = 100
dbs_dir     = /data/scrapyd/dbs
max_proc    = 0
max_proc_per_cpu = 10
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus


nohup scrapyd>scrapyd.log 2>&1 &

open http://10.211.55.101:6800


# scrapyd服务器提前安装依赖
pip install -r requirements.txt

API

curl http://10.211.55.101:6800/daemonstatus.json
curl http://10.211.55.101:6800/addversion.json -F project=hydrabot -F version=1.0.0 -F egg=@hydrabot.egg
curl http://10.211.55.101:6800/schedule.json -d project=hydrabot -d spider=teacher -d task_id=1 -d entry_id=3070
curl http://10.211.55.101:6800/cancel.json -d project=hydrabot -d job=6487ec79947edab326d6db28a2d86S11e8247444

curl http://10.211.55.101:6800/listprojects.json
curl http://10.211.55.101:6800/listversions.json?project=hydrabot
curl http://10.211.55.101:6800/listspiders.json?project=hydrabot
curl http://10.211.55.101:6800/listjobs.json?project=hydrabot

curl http://10.211.55.101:6800/delversion.json -d project=hydrabot -d version=1.0.0
curl http://10.211.55.101:6800/delproject.json -d project=hydrabot

gerapy

官网

https://github.com/Gerapy/Gerapy

安装

pip3 install gerapy
gerapy init
cd gerapy
gerapy migrate

gerapy createsuperuser  #hydrabot/hydrabot
gerapy runserver
open http://127.0.0.1:8000/

crawspider

参考文档

https://cloud.tencent.com/developer/article/1518542

shell

1	scrapy shell http://wjw.yangzhou.gov.cn/yzwshjh/csdt/wjw_list.shtml