V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
Gary_Cheung
V2EX  ›  Python

怎么爬 iframe 里的内容?

  •  1
     
  •   Gary_Cheung · 2016-04-21 11:01:35 +08:00 · 3907 次点击
    这是一个创建于 3120 天前的主题,其中的信息可能已经有所发展或是发生改变。
    目标网址[1]: http://www.shian.gov.cn/web/jghq.aspx
    里面“批发市场商品价格汇总统计”的内容是 iframe 嵌进去的,想爬里面的菜价。

    查看框架源码,发现含有菜价的目标网页[2]是: www.shian.gov.cn/web/jghq_static.aspx

    爬目标网页[2],得到的内容如下。
    [问题]里面的菜名、菜价内容完全丢失。这个该怎么处理呀?
    <html>
    <head>
    <title>价格行情</title>
    <meta content="zh-cn" http-equiv="Content-Language"/>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
    <link href="indexcss.css" rel="stylesheet" type="text/css"/>
    </head>
    <body bgcolor="#ffffff" leftmargin="0" topmargin="0">
    <br/>
    <form action="jghq_static.aspx" id="Form1" method="post">
    <input id="__VIEWSTATE" name="__VIEWSTATE" type="hidden" value="/wEPDwUJOTkzMTA4NzM4D2QWAgIBD2QWAgIFD2QWEgIBDzwrAAsBAA8WDB4IUGFnZVNpemUCFB4QQ3VycmVudFBhZ2VJbmRleGYeCERhdGFLZXlzFgAeC18hSXRlbUNvdW50Zh4JUGFnZUNvdW50AgEeFV8hRGF0YVNvdXJjZUl0ZW1Db3VudGZkZAIDDw8WAh4EVGV4dAUBMWRkAgUPDxYCHwYFATFkZAIHDw8WAh8GBQIyMGRkAhUPDxYCHwZlZGQCFg8PFgIfBmVkZAIXDw8WAh8GZWRkAhgPDxYCHwZlZGQCGQ8PFgIfBmVkZGQ54k8xC1bweBsA6y8dJk8MPrrcbeg01u/XNx8eMcBHPA=="/>
    <input id="__EVENTVALIDATION" name="__EVENTVALIDATION" type="hidden" value="/wEdAAfRcPnPSVRcgynXhDGg9xqU4kXHexmHTU3XFH1VXAJoLKE9sXUIGLUYn9CF6aOsrFQY207xRgN32GhpklrIeNb1k9q+Dvz5GhUZi/1U8wQNg6SRIWS4Ty/Jk88HkugWH7zcouhQiaDF9I9OFtqm0AqvH7do95Mjb5DMi5nDzW0lYuiIxoUfHaCQbffhBrlC0Nc="/>
    <table align="center" border="0" cellpadding="0" cellspacing="0" width="550">
    <tr>
    <td align="middle" valign="top">
    <table align="center" border="0" id="table3" width="100%">
    <tr>
    <td align="middle" valign="top" width="70%">
    <p>
    <span id="goodstypename"></span>
    批发市场商品价格汇总统计
    </p><table border="0" cellpadding="0" cellspacing="0" width="95%">
    <tr>
    <td align="right" width="5"><img src="images/scgl_89.gif"/></td>
    <td align="middle" style="BACKGROUND-POSITION: right bottom; BACKGROUND-IMAGE: url(images/scgl_90.gif); BACKGROUND-REPEAT: repeat-x"><font face="宋体">
    <table border="0" cellpadding="0" cellspacing="0" class="unnamed1" id="Table5" width="100%">
    <tr>
    <td width="50%"> 日期:
    <span id="fromdate"></span></td>
    <td align="middle" width="50%"><font face="宋体">单位: 元 /公斤</font></td>
    </tr>
    </table>
    </font>
    </td>
    <td align="left" width="5"><img src="images/scgl_91.gif"/></td>
    </tr>
    <tr>
    <td align="right" background="images/left.gif" width="5"></td>
    <td align="middle" valign="top">
    <p>
    </p><table border="0" cellpadding="0" cellspacing="0" id="Table1" width="100%">
    <tr>
    <td align="middle"><table cellpadding="3" cellspacing="0" id="PriceStaticControl1_DataGrid1" rules="all" width="100%">
    <tr align="center">
    <td class="pricet">品种名称</td><td class="pricet">最高价</td><td class="pricet">最低价</td><td class="pricet">平均价</td>
    </tr>
    </table><p></p>
    </td>
    </tr>
    <tr>
    <td align="middle" width="100%">
    <table border="0" cellpadding="0" cellspacing="0" id="Table2" width="100%">
    <tr>
    <td align="middle" class="black"><font face="宋体"> <img alt="" border="0" src="/web/images/multipage_icon.gif"/>
    第 </font><font color="red">
    <span id="PriceStaticControl1_currentpage">1</span></font><font face="宋体"> 页 /共
    </font><font color="red">
    <span id="PriceStaticControl1_pagenum">1</span></font><font face="宋体"> 页
    </font><font color="red">
    <span id="PriceStaticControl1_sizeperpage">20</span></font><font face="宋体"> 条 /页
    </font>
    <a href="javascript:__doPostBack('PriceStaticControl1$firstpage','')" id="PriceStaticControl1_firstpage">首页</a><font face="宋体">
    </font>
    <a href="javascript:__doPostBack('PriceStaticControl1$prepage','')" id="PriceStaticControl1_prepage">前页</a><font face="宋体">
    </font>
    <a href="javascript:__doPostBack('PriceStaticControl1$nextpage','')" id="PriceStaticControl1_nextpage">后页</a><font face="宋体">
    </font>
    <a href="javascript:__doPostBack('PriceStaticControl1$rearpage','')" id="PriceStaticControl1_rearpage">尾页</a><font face="宋体">
    转到第 </font>
    <input border="1" id="PriceStaticControl1_pageno" name="PriceStaticControl1:pageno" type="text"/><font face="宋体">页
    </font>
    <input class="button" id="PriceStaticControl1_gotopage" name="PriceStaticControl1:gotopage" type="submit" value="Go"/></td>
    </tr>
    </table>
    </td>
    </tr>
    <tr style="DISPLAY: none">
    <td align="middle" width="100%"></td>
    </tr>
    </table>
    <p><font face="宋体"></font> </p>
    </td>
    <td align="left" background="images/right.gif" width="5"></td>
    </tr>
    <tr>
    <td width="5"><img src="images/scgl_106.gif"/></td>
    <td style="BACKGROUND-POSITION-Y: top; BACKGROUND-IMAGE: url(images/scgl_107.gif); BACKGROUND-REPEAT: repeat-x"></td>
    <td width="5"><img src="images/scgl_108.gif"/></td>
    </tr>
    </table>
    </td>
    </tr>
    </table>
    </td>
    </tr>
    </table>
    </form>
    </body>
    </html>
    4 条回复    2016-04-21 12:10:23 +08:00
    caspartse
        1
    caspartse  
       2016-04-21 11:18:50 +08:00
    用 Firebug 看了下,需要 post 数据的。
    goodsname=&goodstype=00&beginyear=2016&beginmonth=4&beginday=20&endyear=2016&endmonth=4&endday=21
    liaowu
        2
    liaowu  
       2016-04-21 11:24:23 +08:00
    1 、把 iframe 左右下一个 url 加入 queue ;
    2 、用 phantomjs 类似技术
    Gary_Cheung
        3
    Gary_Cheung  
    OP
       2016-04-21 11:50:43 +08:00
    @caspartse 可能我做的有点问题, post 也没有效果。我的代码如下,帮我看看吧 :)

    from bs4 import BeautifulSoup
    import requests

    headers = {
    'goodsname':'',
    'goodstype':'00',
    'beginyear':'2016',
    'beginmonth':'4',
    'beginday':'20',
    'endyear':'2016',
    'endmonth':'4',
    'endmonth':'21'
    }

    url = 'http://www.shian.gov.cn/web/jghq_static.aspx'
    web_data = requests.post(url,headers=headers)
    web_data.encoding = 'gb2312'
    soup = BeautifulSoup(web_data.text,'lxml')
    caspartse
        4
    caspartse  
       2016-04-21 12:10:23 +08:00   ❤️ 1
    @Gary_Cheung

    data = {
    'goodsname':'',
    'goodstype':'00',
    'beginyear':'2016',
    'beginmonth':'4',
    'beginday':'20',
    'endyear':'2016',
    'endmonth':'4',
    'endmonth':'21'
    }
    url = 'http://www.shian.gov.cn/web/jghq_static.aspx'
    web_data = requests.post(url, data=data)

    是 data ,不是 headers 哦。
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   4522 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 28ms · UTC 04:07 · PVG 12:07 · LAX 20:07 · JFK 23:07
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.