通过 robots.txt 的 sitemap 来抓取一个网站当天新产生的 URL 很方便 - V2EX

Home Sign Up Sign In

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

Sign Up Now

For Existing Member Sign In

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 2675 days ago, the information mentioned may be changed or developed.

有的网站有 sitemap 文件，本就是给百度这类爬虫看的，会把当天新产生的 URL 和最近更新的 URL 放在 sitemap 文件里，这样就不用整过遍历网站，节约单宽，尤其是大型网站有上千万，过亿页面这类网站，通过抓取 sitemap 来发现新增 URL 比较有效。就像这篇文章说的一样： https://www.yuanrenxue.com/crawler/crawler-tricks-3.html

1 replies

1

bighead22

OP

Jan 10, 2019

这算不算是个骚操作？

About · Help · Advertise · Blog · API · FAQ · Solana · 1275 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 38ms · UTC 17:33 · PVG 01:33 · LAX 10:33 · JFK 13:33
♥ Do have faith in what you're doing.