Scrapy Startproject Genspider Templates

"scrapy startproject genspider templates"

Posted by Stephen on June 14, 2020

前言

Scrapy爬虫框架,在写多了genspider 爬虫模板的时候,就有点觉得有点繁琐的情况。

本文主要解释scrapy的模板创建的流程,最后实战一下创建属于自己的模板。

本文适合于对scrapy框架有一定的了解的读者。

环境

系统环境

Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.4 LTS
Release:	18.04
Codename:	bionic
Linux version :       5.3.0-46-generic ( buildd@lcy01-amd64-013 ) 
Gcc version:         7.5.0  ( Ubuntu 7.5.0-3ubuntu1~18.04 )

软件信息

version : 	
     Scrapy 1.7.3
     python是虚拟环境
     在~/software/anaconda3/envs/YS/lib/python3.7

正文

回顾一下:

Scrapy的”Hello World”语句:

$ scrapy startproject helloScrapy
New Scrapy project 'helloScrapy', using template directory '~/software/anaconda3/envs/YS/lib/python3.7/site-packages/scrapy/templates/project', created in:
    ~/helloScrapy
You can start your first spider with:
    cd helloScrapy
    scrapy genspider example example.com
$ cd helloScrapy/
$ scrapy genspider helloscrapy baidu.com
Created spider 'helloscrapy' using template 'basic' in module:
  helloScrapy.spiders.helloscrapy

从第一行代码开始看起,细心的朋友就会发现在scrapy startproject helloScrapy这句运行时打印了一个地址:

New Scrapy project 'helloScrapy', using template directory '~/software/anaconda3/envs/YS/lib/python3.7/site-packages/scrapy/templates/project', 

字面意思:startproject使用了在~/software/anaconda3/envs/YS/lib/python3.7/site-packages/scrapy/templates/目录下面的project

~/software/anaconda3/envs/YS/lib/python3.7这个是我自己的python环境包所在目录,按照自己的python环境来看

stephen@stephen-PC:~/software/anaconda3/envs/YS/lib/python3.7/site-packages/scrapy/templates$ tree
.
├── project
│   ├── module
│   │   ├── __init__.py
│   │   ├── items.py.tmpl
│   │   ├── middlewares.py.tmpl
│   │   ├── pipelines.py.tmpl
│   │   ├── __pycache__
│   │   │   └── __init__.cpython-37.pyc
│   │   ├── settings.py.tmpl
│   │   └── spiders
│   │       ├── __init__.py
│   │       └── __pycache__
│   │           └── __init__.cpython-37.pyc
│   └── scrapy.cfg
└── spiders
    ├── basic.tmpl
    ├── crawl.tmpl
    ├── csvfeed.tmpl
    └── xmlfeed.tmpl

6 directories, 13 files

查看下面的结构,这有点熟悉呀,project下面的目录,这不是我们平时建立scrapy框架吗?只是后面多了tmpl的后缀。这里先记住这个project目录文件架构和spider目录的文件。

再看Scrapy的”Hello World”下一句(省去cd)

$ scrapy genspider helloscrapy baidu.com
Created spider 'helloscrapy' using template 'basic' in module:
  helloScrapy.spiders.helloscrapy

熟悉scrapy genspider的朋友都知道,(不是很了解的朋友可以使用scrapy genspider -h查看)他的全貌命令行是

$ scrapy genspider -t basic helloscrapy baidu.com

省去默认参数 -t basic 这个basic是不是有点熟悉,这不是上面的spider目录的文件的basic.tmpl吗?

有点意思。

怀着有点意思的心情去看scrapy框架的源码。看源码是一件非常有意思的事情。

经过一轮review源码,看到

Image text

先对genspider.py生成一个spiders爬虫文件这个小单位看

from __future__ import print_function
import os
import shutil
import string

from importlib import import_module
from os.path import join, dirname, abspath, exists, splitext

import scrapy
from scrapy.commands import ScrapyCommand
from scrapy.utils.template import render_templatefile, string_camelcase
from scrapy.exceptions import UsageError


def sanitize_module_name(module_name):
    """Sanitize the given module name, by replacing dashes and points
    with underscores and prefixing it with a letter if it doesn't start
    with one
    """
    module_name = module_name.replace('-', '_').replace('.', '_')
    if module_name[0] not in string.ascii_letters:
        module_name = "a" + module_name
    return module_name


class Command(ScrapyCommand):

    requires_project = False
    default_settings = {'LOG_ENABLED': False}

    def syntax(self):
        return "[options] <name> <domain>"

    def short_desc(self):
        return "Generate new spider using pre-defined templates"

    def add_options(self, parser):
        ScrapyCommand.add_options(self, parser)
        parser.add_option("-l", "--list", dest="list", action="store_true",
            help="List available templates")
        parser.add_option("-e", "--edit", dest="edit", action="store_true",
            help="Edit spider after creating it")
        parser.add_option("-d", "--dump", dest="dump", metavar="TEMPLATE",
            help="Dump template to standard output")
        parser.add_option("-t", "--template", dest="template", default="basic",
            help="Uses a custom template.")
        parser.add_option("--force", dest="force", action="store_true",
            help="If the spider already exists, overwrite it with the template")

    def run(self, args, opts):
        if opts.list:
            self._list_templates()
            return
        if opts.dump:
            template_file = self._find_template(opts.dump)
            if template_file:
                with open(template_file, "r") as f:
                    print(f.read())
            return
        if len(args) != 2:
            raise UsageError()

        name, domain = args[0:2]
        module = sanitize_module_name(name)

        if self.settings.get('BOT_NAME') == module:
            print("Cannot create a spider with the same name as your project")
            return

        try:
            spidercls = self.crawler_process.spider_loader.load(name)
        except KeyError:
            pass
        else:
            # if spider already exists and not --force then halt
            if not opts.force:
                print("Spider %r already exists in module:" % name)
                print("  %s" % spidercls.__module__)
                return
        template_file = self._find_template(opts.template)
        if template_file:
            self._genspider(module, name, domain, opts.template, template_file)
            if opts.edit:
                self.exitcode = os.system('scrapy edit "%s"' % name)

    def _genspider(self, module, name, domain, template_name, template_file):
        """Generate the spider module, based on the given template"""
        tvars = {
            'project_name': self.settings.get('BOT_NAME'),
            'ProjectName': string_camelcase(self.settings.get('BOT_NAME')),
            'module': module,
            'name': name,
            'domain': domain,
            'classname': '%sSpider' % ''.join(s.capitalize() \
                for s in module.split('_'))
        }
        if self.settings.get('NEWSPIDER_MODULE'):
            spiders_module = import_module(self.settings['NEWSPIDER_MODULE'])
            spiders_dir = abspath(dirname(spiders_module.__file__))
        else:
            spiders_module = None
            spiders_dir = "."
        spider_file = "%s.py" % join(spiders_dir, module)
        shutil.copyfile(template_file, spider_file)
        render_templatefile(spider_file, **tvars)
        print("Created spider %r using template %r " % (name, \
            template_name), end=('' if spiders_module else '\n'))
        if spiders_module:
            print("in module:\n  %s.%s" % (spiders_module.__name__, module))

    def _find_template(self, template):
        template_file = join(self.templates_dir, '%s.tmpl' % template)
        if exists(template_file):
            return template_file
        print("Unable to find template: %s\n" % template)
        print('Use "scrapy genspider --list" to see all available templates.')

    def _list_templates(self):
        print("Available templates:")
        for filename in sorted(os.listdir(self.templates_dir)):
            if filename.endswith('.tmpl'):
                print("  %s" % splitext(filename)[0])

    @property
    def templates_dir(self):
        _templates_base_dir = self.settings['TEMPLATES_DIR'] or \
            join(scrapy.__path__[0], 'templates')
        return join(_templates_base_dir, 'spiders')

过一下这个函数的逻辑,关键一个函数templates_dir()查找spider的模板文件,去python环境打印路径是

import scrapy
scrapy.__path__[0]
Out[3]: '/home/stephen/software/anaconda3/envs/YS/lib/python3.7/site-packages/scrapy'

这个不是上面的路径

以-t basic命令行为例。略一下逻辑,他是先找到basic.tmpl文件再对该模板文件进行渲染render_templatefile(spider_file, **tvars)生成spider文件,底层是用了string.Template这个函数进行模板渲染。移动重命名py文件到指定位置。

备注:**tvars就是

 tvars = {
            'project_name': self.settings.get('BOT_NAME'),
            'ProjectName': string_camelcase(self.settings.get('BOT_NAME')),
            'module': module,
            'name': name,
            'domain': domain,
            'classname': '%sSpider' % ''.join(s.capitalize() \
                for s in module.split('_'))
        }

去细看basic.tmpl文件的内容

# -*- coding: utf-8 -*-
import scrapy


class $classname(scrapy.Spider):
    name = '$name'
    allowed_domains = ['$domain']
    start_urls = ['http://$domain/']

    def parse(self, response):
        pass

这个不就是我们刚刚创建的helloscrapy.py文件吗

# -*- coding: utf-8 -*-
import scrapy


class HelloscrapySpider(scrapy.Spider):
    name = 'helloscrapy'
    allowed_domains = ['baidu.com']
    start_urls = ['http://baidu.com/']

    def parse(self, response):
        pass

而那些参数变量,就是命令行传入。

$ scrapy genspider -t basic helloscrapy baidu.com

略清楚scrapy genspider的代码逻辑。接下来就是要创建属于自己的spider template,这个需要多写spider才有感觉,想到自己反复写重用的代码,就可以创建属于自己的spider template。

在’/home/stephen/software/anaconda3/envs/YS/lib/python3.7/site-packages/scrapy/spider’目录下创建test_basic.tmpl文件,文件情况如下。

# -*- coding: utf-8 -*-
"""
	属于自己的spider模板
	更新日期:
	作者:
"""
import scrapy



class $classname(scrapy.Spider):
    name = '$name'
    allowed_domains = ['$domain']

    custom_settings = {

    }

    def parse(self, response):
        pass

测试

$ scrapy genspider -t test_basic test test.com
Created spider 'test' using template 'test_basic' in module:
  helloScrapy.spiders.test

后记

@TOC