官网
参考文档
安装
DEMO
1 2 3 4 5 6 7 8 9 10 11 12 13
| import scrapy
class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://www.zyte.com/blog/']
def parse(self, response): for title in response.css('.oxy-post-title'): yield {'title': title.css('::text').get()}
for next_page in response.css('a.next'): yield response.follow(next_page, self.parse) EOF
|
1
| scrapy runspider myspider.py
|
CLT
1 2
| scrapy startproject hydrabot scrapy genspider httpbin httpbin.org
|
周边生态
- scrapy-splash
- scrapy-redis
- scrapyd
- scrapyd-client
- python-scrapyd-api
- scrapyrt
- gerapy
scrapyrt
参考文档
https://github.com/scrapinghub/scrapyrt/pull/120/commits/b0302994e1df2a4784f6342bddebea91ccfd7a72
https://scrapyrt.readthedocs.io/en/stable/api.html#post
运行
当前目录下运行scrapyrt
curl http://localhost:9080/crawl.json -d ‘{“start_requests”: “true”,”spider_name”: “teacher”,”crawl_args”: {“task_id”: “1”,”entry_id”: “3070”}}’
curl http://localhost:9080/crawl.json -d ‘{“start_requests”: “true”,”spider_name”: “policy_site”,”crawl_args”: {“task_id”: “5”}}’
conda
https://blog.csdn.net/weixin_43840215/article/details/89599559
wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
wget -c https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh
安装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
| pip install scrapyd
vi /etc/scrapyd/scrapyd.conf [scrapyd] eggs_dir = /data/scrapyd/eggs logs_dir = /data/scrapyd/logs items_dir = /data/scrapyd/items jobs_to_keep = 100 dbs_dir = /data/scrapyd/dbs max_proc = 0 max_proc_per_cpu = 10 finished_to_keep = 100 poll_interval = 5.0 bind_address = 0.0.0.0 http_port = 6800 debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher webroot = scrapyd.website.Root
[services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs daemonstatus.json = scrapyd.webservice.DaemonStatus
nohup scrapyd>scrapyd.log 2>&1 &
open http://10.211.55.101:6800
# scrapyd服务器提前安装依赖 pip install -r requirements.txt
|
API
1 2 3 4 5 6 7 8 9 10 11 12
| curl http://10.211.55.101:6800/daemonstatus.json curl http://10.211.55.101:6800/addversion.json -F project=hydrabot -F version=1.0.0 -F egg=@hydrabot.egg curl http://10.211.55.101:6800/schedule.json -d project=hydrabot -d spider=teacher -d task_id=1 -d entry_id=3070 curl http://10.211.55.101:6800/cancel.json -d project=hydrabot -d job=6487ec79947edab326d6db28a2d86S11e8247444
curl http://10.211.55.101:6800/listprojects.json curl http://10.211.55.101:6800/listversions.json?project=hydrabot curl http://10.211.55.101:6800/listspiders.json?project=hydrabot curl http://10.211.55.101:6800/listjobs.json?project=hydrabot
curl http://10.211.55.101:6800/delversion.json -d project=hydrabot -d version=1.0.0 curl http://10.211.55.101:6800/delproject.json -d project=hydrabot
|
gerapy
官网
https://github.com/Gerapy/Gerapy
安装
1 2 3 4 5 6 7 8
| pip3 install gerapy gerapy init cd gerapy gerapy migrate
gerapy createsuperuser #hydrabot/hydrabot gerapy runserver open http://127.0.0.1:8000/
|
crawspider
参考文档
https://cloud.tencent.com/developer/article/1518542
shell
1
| scrapy shell http://wjw.yangzhou.gov.cn/yzwshjh/csdt/wjw_list.shtml
|