WebHarvest 抓取火影忍者漫画

现在的位置: 首页 > 综合 > 正文

WebHarvest 抓取火影忍者漫画

2012年10月07日 ⁄ 综合 ⁄ 共 2549字 ⁄ 字号小中大 ⁄ 评论关闭

觉得火影更新的慢么？觉得那些漫画网站不让下载很可恶么？看看这个^_^

ps: Web-Harvest http://web-harvest.sourceforge.net

1、逻辑文件

<?xml version="1.0" encoding="UTF-8"?> <config> <include path="functions.xml"/> <var-def name="num" overwrite="false">1</var-def> <loop index="i" item="url">  <list> <var-def name="imagelinks"> <call name="download-multipage-list"> <call-param name="pageUrl"><template>http://www.narutom.com/comic/index.html</template></call-param> <call-param name="nextXPath">//div[@class='pagenav']/a[last()-1]/@href</call-param> <call-param name="itemXPath">//div[@id='dm_name']/ul/li/a/text()</call-param> <call-param name="maxloops"><template>${num}</template></call-param> </call> </var-def> </list> <body> <empty>  <var-def name="ordinal"> <regexp> <regexp-pattern>^/D*(/d*)?/D*$</regexp-pattern> <regexp-source><template>${url}</template></regexp-source> <regexp-result> <template>${_1}</template> </regexp-result>- </regexp> </var-def>  <call name="getComic"> <call-param name="fromNum"><template>${ordinal}</template></call-param> <call-param name="directory"><template>${url}</template></call-param> </call> </empty> </body> </loop> </config>

2、函数库文件

<?xml version="1.0" encoding="UTF-8"?> <config>  <function name="download-multipage-list"> <return> <while condition="${pageUrl.toString().length() != 0}" maxloops="${maxloops}" index="i"> <empty> <var-def name="content"> <html-to-xml> <http url="${pageUrl}" charset="gb2312"/> </html-to-xml> </var-def> <var-def name="nextLinkUrl"> <xpath expression="${nextXPath}"> <var name="content"/> </xpath> </var-def> <var-def name="pageUrl">  <template>${nextLinkUrl.toString()}</template> </var-def> </empty> <xpath expression="${itemXPath}"> <var name="content"/> </xpath> </while> </return> </function>  <function name="getComic"> <while index="j" condition="${j.toInt() != 20}" > <var-def name="pageUrl"> <template>http://wt2.narutom.com/d/manhua/naruto/${fromNum}/${j}.png</template> </var-def> <file action="write" path='/home/xyzqing/webharvest/naruto/naruto/${directory}/${j}.png' type="binary"> <http url="${pageUrl}"/> </file> </while> </function> </config>

3、效果截图