V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
fusae
V2EX  ›  Python

scrapy spider 不能输出到 pipeline 里

  •  
  •   fusae · 2016-07-18 16:28:55 +08:00 · 5473 次点击
    这是一个创建于 3080 天前的主题,其中的信息可能已经有所发展或是发生改变。

    不知道为什么,注释了这一段就能输出到 pipeline

    代码内容其实就是先进一个 url ,获得 cookie ,再带着 cookie 访问,把验证码下下来解析,然后再把数据 post 上去,要是验证码错了,就重复下验证码再解析,再 post ,直到 post 成功。 bug 代码在最后,重复下载验证码的部分,各位熟悉 scrapy 的大牛知道为什么吗?

    
    # -*- coding: gbk -*-
    import scrapy
    from scrapy.http import FormRequest
    import json
    import os
    from datetime import datetime
    from scrapy.selector import Selector
    from teacherCourse.handlePic import handle
    from teacherCourse.items import DetailProfItem
    from teacherCourse.items import DetailProfCourseItem
    from teacherCourse.items import containItem
    
    class GetTeacherCourseSpider(scrapy.Spider):
        name = 'TeacherCourse'
    #    custom_settings = {
    #            'ITEM_PIPELINES': {
    #                'teacherCourse.pipelines.TeacherCoursePipeline': 300,
    #                }
    #            }
    
        def __init__(self, selXNXQ='', titleCode=''):
            self.getUrl = 'http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB.aspx' # first
            self.vcodeUrl = 'http://jwxt.dgut.edu.cn/jwweb/sys/ValidateCode.aspx' # second
            self.postUrl = 'http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB_rpt.aspx' # third
            self.findSessionId = None # to save the cookies
            self.XNXQ = selXNXQ
            self.titleCode = titleCode
    
        def start_requests(self):
            request = scrapy.Request(self.getUrl,
                   callback = self.downloadPic)
            yield request
    
        def downloadPic(self, response):
            # download the picture
            # find the session id
            self.findSessionId = response.headers.getlist('Set-Cookie')[0].decode().split(";")[0].split("=")
            request = scrapy.Request(self.vcodeUrl,
                    cookies= {self.findSessionId[0]: self.findSessionId[1]},
                    callback = self.getAndHandleYzm)
            yield request
        
        def getAndHandleYzm(self, response):
            yzm = handle(response.body)
            
            yield FormRequest(self.postUrl,
                    formdata={'Sel_XNXQ': '20151',
                              'sel_zc': '011',
                              'txt_yzm': yzm,
                              'type': '2'},
                    headers={
                        'Referer': 'http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB.aspx',
                        'Cookie': self.findSessionId[0] + '=' + self.findSessionId[1],
                        },
    
    
                    callback=self.parse)
    
        def parse(self, response):
            body = response.body.decode('gbk')
    # bug code begin
            num = body.find('alert')
            if num != -1:
                # means CAPTCHA validation fails, need to re-request the CAPTCHA
                yield scrapy.Request(self.vcodeUrl+'?t='+'%.f' % (datetime.now().microsecond / 1000),
                headers={
                        'Referer': 'http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB.aspx',
                        'Cookie': self.findSessionId[0]+'='+self.findSessionId[1]
                        },
                callback=self.getAndHandleYzm) # re request the url to solve the validation code fail problem
    # bug code done
            else:
                # parse data
    #            self.parseData(body)
                item = containItem()
                item['first'] = len(body)
                return item
    
    
    1 条回复    2016-07-21 16:42:50 +08:00
    fusae
        1
    fusae  
    OP
       2016-07-21 16:42:50 +08:00
    自己答,把`return`改成`yield`
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2844 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 21ms · UTC 14:39 · PVG 22:39 · LAX 06:39 · JFK 09:39
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.