继续请教 scrapy 的问题。

def parse(self, response):
    sel = Selector(response)
    for link in sel.xpath('//h2/a/@href').extract():
        request = scrapy.Request(link, callback=self.parse_item)
        yield request

    pages = sel.xpath("//div[@class='navigation']/div[@id='wp_page_numbers']/ul/li/a/@href").extract()
    print('pages: %s' % pages)
    if len(pages) > 2:
        page_link = pages[-2]
        page_link = page_link.replace('/a/', '')
        request = scrapy.Request('http://www.meizitu.com/a/%s' % page_link, callback=self.parse)
        yield request

def parse_item(self, response):
    l = ItemLoader(item=MeizituItem(), response=response)
    l.add_xpath('name', '//h2/a/text()')
    l.add_xpath('tags', "//div[@id='maincontent']/div[@class='postmeta  clearfix']/div[@class='metaRight']/p")
    l.add_xpath('image_urls', "//div[@id='picture']/p/img/@src", Identity())
    l.add_value('url', response.url)
    return l.load_item()
    
    
    
    我找了一份 scrapy 代码进行研究。 
    第一个 parse 抓取的是链接。
    第二个 parse 抓取的是内容。
    我想知道怎么让他把每一个 parse 抓取出来的内容都生成一个输出。我想看一下是什么样子的。以便于自己改造？

request

response

xpath

Scrapy

8 条回复 • 2016-08-12 15:59:48 +08:00

laoyur

2016-08-12 11:59:26 +08:00

楼主一人撑起了 Python 节点的半壁江山……

好吧，我也是 Python 新人：）
Scrapy 我也学不久

parse 中，如果 return/yield 出来一个 Request ，那就加入调度器中，等候处理；如果 return/yield 出来一个 item ，那就进 pipelines ，你可以在 pipelines 里面自己对 item 进行处理，进数据库或者啥的。
所以你的问题有点模糊，『输出』到底是什么鬼？自己在 pipelines 中打 log ，或者直接在 parse_item 里面 log ，不都可以吗？

recall704

2016-08-12 12:01:10 +08:00

额。。这个看文档比较好吧。

如果你要调试，你用 shell 好了

```
scrapy shell www.xxoo.com
```

如果你要这 parse 中查看 html 内容，直接用 response.text 就好了

```
def parse(self, response):
self.log(response.text)
```

laoyur

2016-08-12 12:02:25 +08:00

抱歉，一楼记错了，我把你当成另外一个站友了，因为经常看见他在 Python 节点发帖

knightdf

2016-08-12 12:49:56 +08:00

item pipeline

KoleHank

2016-08-12 13:24:48 +08:00

弄一个 jsonpipeline ，然后再 setting.py 里面加上就好了，在 jsonpipeline 里面将抓取的 item 存储起来就可以了

代码类似
```
import json
import codecs

class JsonWriterPipeline(object):

def __init__(self):
self.file = codecs.open('item.json', 'wb','utf-8')

def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line.decode("unicode_escape"))
return item

```

yutian2211

2016-08-12 14:04:56 +08:00

直接 print(response.text)

xiaoyu9527

2016-08-12 15:34:45 +08:00

@laoyur 我就是不知道他保存的数据格式是什么才想看一下呀。输出为 text 就行了。

@yutian2211 打印的话不太好最好还是保存一下。

zanpen2000

2016-08-12 15:59:48 +08:00

我记得是可以自己写中间件进行额外的处理的，要在 settings.py 里的 ITEM_PIPELINES 添加你定义好的处理方法，还要在 spider 的入口处定义你的处理方法，下面的项目是我去年写的，代码比较垃圾，凑合看吧

http://git.oschina.net/zanpen2000/gov_website_crawler_checker