V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
my8100
V2EX  ›  Python

LogParser v0.8.0 发布:一个用于定期增量式解析 Scrapy 爬虫日志的 Python 库,配合 ScrapydWeb 使用可实现爬虫进度可视化

  •  
  •   my8100 ·
    my8100 · 2019-01-21 18:16:30 +08:00 · 1692 次点击
    这是一个创建于 2179 天前的主题,其中的信息可能已经有所发展或是发生改变。

    GitHub 开源

    https://github.com/my8100/logparser

    安装

    • 通过 pip:
    pip install logparser
    
    • 通过 git:
    git clone https://github.com/my8100/logparser.git
    cd logparser
    python setup.py install
    

    使用方法

    作为 service 运行

    1. 请先确保当前主机已经安装和启动 Scrapyd
    2. 通过命令 logparser 启动 LogParser
    3. 访问 http://127.0.0.1:6800/logs/stats.json (假设 Scrapyd 运行于端口 6800)
    4. 访问 http://127.0.0.1:6800/logs/projectname/spidername/jobid.json 以获取某个爬虫任务的日志分析详情

    配合 ScrapydWeb 实现爬虫进度可视化

    详见 https://github.com/my8100/scrapydweb visualization

    在 Python 代码中使用

    In [1]: from logparser import parse
    
    In [2]: log = """2018-10-23 18:28:34 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: demo)
       ...: 2018-10-23 18:29:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
       ...: {'downloader/exception_count': 3,
       ...:  'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 3,
       ...:  'downloader/request_bytes': 1336,
       ...:  'downloader/request_count': 7,
       ...:  'downloader/request_method_count/GET': 7,
       ...:  'downloader/response_bytes': 1669,
       ...:  'downloader/response_count': 4,
       ...:  'downloader/response_status_count/200': 2,
       ...:  'downloader/response_status_count/302': 1,
       ...:  'downloader/response_status_count/404': 1,
       ...:  'dupefilter/filtered': 1,
       ...:  'finish_reason': 'finished',
       ...:  'finish_time': datetime.datetime(2018, 10, 23, 10, 29, 41, 174719),
       ...:  'httperror/response_ignored_count': 1,
       ...:  'httperror/response_ignored_status_count/404': 1,
       ...:  'item_scraped_count': 2,
       ...:  'log_count/CRITICAL': 5,
       ...:  'log_count/DEBUG': 14,
       ...:  'log_count/ERROR': 5,
       ...:  'log_count/INFO': 75,
       ...:  'log_count/WARNING': 3,
       ...:  'offsite/domains': 1,
       ...:  'offsite/filtered': 1,
       ...:  'request_depth_max': 1,
       ...:  'response_received_count': 3,
       ...:  'retry/count': 2,
       ...:  'retry/max_reached': 1,
       ...:  'retry/reason_count/twisted.internet.error.TCPTimedOutError': 2,
       ...:  'scheduler/dequeued': 7,
       ...:  'scheduler/dequeued/memory': 7,
       ...:  'scheduler/enqueued': 7,
       ...:  'scheduler/enqueued/memory': 7,
       ...:  'start_time': datetime.datetime(2018, 10, 23, 10, 28, 35, 70938)}
       ...: 2018-10-23 18:29:42 [scrapy.core.engine] INFO: Spider closed (finished)"""
    
    In [3]: d = parse(log, headlines=1, taillines=1)
    
    In [4]: d
    Out[4]:
    OrderedDict([('head',
                  '2018-10-23 18:28:34 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: demo)'),
                 ('tail',
                  '2018-10-23 18:29:42 [scrapy.core.engine] INFO: Spider closed (finished)'),
                 ('first_log_time', '2018-10-23 18:28:34'),
                 ('latest_log_time', '2018-10-23 18:29:42'),
                 ('elapsed', '0:01:08'),
                 ('first_log_timestamp', 1540290514),
                 ('latest_log_timestamp', 1540290582),
                 ('datas', []),
                 ('pages', 3),
                 ('items', 2),
                 ('latest_matches',
                  {'resuming_crawl': '',
                   'latest_offsite': '',
                   'latest_duplicate': '',
                   'latest_crawl': '',
                   'latest_scrape': '',
                   'latest_item': '',
                   'latest_stat': ''}),
                 ('latest_crawl_timestamp', 0),
                 ('latest_scrape_timestamp', 0),
                 ('log_categories',
                  {'critical_logs': {'count': 5, 'details': []},
                   'error_logs': {'count': 5, 'details': []},
                   'warning_logs': {'count': 3, 'details': []},
                   'redirect_logs': {'count': 1, 'details': []},
                   'retry_logs': {'count': 2, 'details': []},
                   'ignore_logs': {'count': 1, 'details': []}}),
                 ('shutdown_reason', 'N/A'),
                 ('finish_reason', 'finished'),
                 ('last_update_timestamp', 1547559048),
                 ('last_update_time', '2019-01-15 21:30:48')])
    
    In [5]: d['elapsed']
    Out[5]: '0:01:08'
    
    In [6]: d['pages']
    Out[6]: 3
    
    In [7]: d['items']
    Out[7]: 2
    
    In [8]: d['finish_reason']
    Out[8]: 'finished'
    
    
    2 条回复    2019-01-24 09:48:52 +08:00
    ospider
        1
    ospider  
       2019-01-21 19:17:30 +08:00
    为啥不用 grafana ?
    15399905591
        2
    15399905591  
       2019-01-24 09:48:52 +08:00
    膜拜下大佬,以前一直用 spiderkeeper,可那东西坑实在太多了。。
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1232 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 24ms · UTC 23:24 · PVG 07:24 · LAX 15:24 · JFK 18:24
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.