现在的位置: 首页 > 综合 > 正文

php抓取网页内容汇总

2014年09月05日 ⁄ 综合 ⁄ 共 15964字 ⁄ 字号 评论关闭

①、使用php获取网页内容

http://hi.baidu.com/quqiufeng/blog/item/7e86fb3f40b598c67d1e7150.html

header("Content-type: text/html; charset=utf-8");
1、
$xhr = new COM("MSXML2.XMLHTTP");
$xhr->open("GET","http://localhost/xxx.php?id=2",false);
$xhr->send();
echo $xhr->responseText

2、file_get_contents实现
<?php
$url="http://www.blogjava.net/pts";
echo file_get_contents( $url );
?>

3、fopen()实现
<?
if ($stream = fopen('http://www.sohu.com', 'r')) {
    // print all the page starting at the offset 10
    echo stream_get_contents($stream, -1, 10);
    fclose($stream);
}

if ($stream = fopen('http://www.sohu.net', 'r')) {
    // print the first 5 bytes
    echo stream_get_contents($stream, 5);
    fclose($stream);
}
?>

②、使用php获取网页内容

http://www.blogjava.net/pts/archive/2007/08/26/99188.html

简单的做法:
<?php
$url="http://www.blogjava.net/pts";
echo file_get_contents( $url );
?>
或者:
<?
if ($stream = fopen('http://www.sohu.com', 'r')) {
    // print all the page starting at the offset 10
    echo stream_get_contents($stream, -1, 10);
    fclose($stream);
}

if ($stream = fopen('http://www.sohu.net', 'r')) {
    // print the first 5 bytes
    echo stream_get_contents($stream, 5);
    fclose($stream);
}
?>

③、PHP获取网站内容,保存为TXT文件源码

http://blog.chinaunix.net/u1/44325/showart_348444.html

<?
$my_book_url='http://book.yunxiaoge.com/files/article/html/4/4550/index.html';
ereg("http://book.yunxiaoge.com/files/article/html/[0-9]+/[0-9]+/",$my_book_url,$myBook);
$my_book_txt=$myBook[0];
$file_handle = fopen($my_book_url, "r");//读取文件
unlink("test.txt");
while (!feof($file_handle)) { //循环到文件结束
    $line = fgets($file_handle); //读取一行文件
    $line1=ereg("href=\"[0-9]+.html",$line,$reg); //分析文件内部书的文章页面
       $handle = fopen("test.txt", 'a');
   if ($line1) {
     $my_book_txt_url=$reg[0]; //另外赋值,给抓取分析做准备
   $my_book_txt_url=str_replace("href=\"","",$my_book_txt_url);
      $my_book_txt_over_url="$my_book_txt$my_book_txt_url"; //转换为抓取地址
      echo "$my_book_txt_over_url</p>"; //显示工作状态
      $file_handle_txt = fopen($my_book_txt_over_url, "r"); //读取转换后的抓取地址
      while (!feof($file_handle_txt)) {
       $line_txt = fgets($file_handle_txt);
       $line1=ereg("^&nbsp.+",$line_txt,$reg); //根据抓取内容标示抓取
       $my_over_txt=$reg[0];
       $my_over_txt=str_replace("&nbsp;&nbsp;&nbsp;&nbsp;","    ",$my_over_txt); //过滤字符
       $my_over_txt=str_replace("<br />","",$my_over_txt);
       $my_over_txt=str_replace("<script. language=\"javascript\">","",$my_over_txt);
       $my_over_txt=str_replace("&quot;","",$my_over_txt);
       if ($line1) {
         $handle1=fwrite($handle,"$my_over_txt\n"); //写入文件
       }
      }
    }
}
fclose($file_handle_txt);
fclose($handle);
fclose($file_handle); //关闭文件
echo "完成</p>";
?>

下面是比较嚣张的方法。
这里使用一个名叫Snoopy的类。
先是在这里看到的:
PHP中获取网页内容的Snoopy
http://blog.declab.com/read.php/27.htm
然后是Snoopy的官网:
http://sourceforge.net/projects/snoopy/
这里有一些简单的说明:
代码收藏-Snoopy类及简单的使用方法
http://blog.passport86.com/?p=161
下载:http://sourceforge.net/projects/snoopy/

今天才发现这个好东西,赶紧去下载了来看看,是用的parse_url
还是比较习惯curl

snoopy是一个php类,用来模仿web浏览器的功能,它能完成获取网页内容和发送表单的任务。
下面是它的一些特征:
1、方便抓取网页的内容
2、方便抓取网页的文字(去掉HTML代码)
3、方便抓取网页的链接
4、支持代理主机
5、支持基本的用户/密码认证模式
6、支持自定义用户agent,referer,cookies和header内容
7、支持浏览器转向,并能控制转向深度
8、能把网页中的链接扩展成高质量的url(默认)
9、方便提交数据并且获取返回值
10、支持跟踪HTML框架(v0.92增加)
11、支持再转向的时候传递cookies

具体使用请看下载文件中的说明。

<?php
includeSnoopy.class.php;
$snoopy=newSnoopy;
$snoopy->fetchform(http://www.phpx.com/happy/logging.php?action=login);
print$snoopy->results;
?>
<?php
includeSnoopy.class.php;
$snoopy=newSnoopy;
$submit_url=http://www.phpx.com/happy/logging.php?action=login;$submit_vars["loginmode"]=normal;
$submit_vars["styleid"]=1;
$submit_vars["cookietime"]=315360000;
$submit_vars["loginfield"]=username;
$submit_vars["username"]=********;//你的用户名
$submit_vars["password"]=*******;//你的密码
$submit_vars["questionid"]=0;
$submit_vars["answer"]=“”;
$submit_vars["loginsubmit"]=
&nbsp; 交
;
$snoopy->submit($submit_url,$submit_vars);
print$snoopy->results;?>

下面是SnoopyReadme
NAME:

    Snoopy - the PHP net client v1.2.4
   
SYNOPSIS:

    include "Snoopy.class.php";
    $snoopy = new Snoopy;
   
    $snoopy->fetchtext("http://www.php.net/");
    print $snoopy->results;
   
    $snoopy->fetchlinks("http://www.phpbuilder.com/");
    print $snoopy->results;
   
    $submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";
   
    $submit_vars["q"] = "amiga";
    $submit_vars["submit"] = "Search!";
    $submit_vars["searchhost"] = "Altavista";
       
    $snoopy->submit($submit_url,$submit_vars);
    print $snoopy->results;
   
    $snoopy->maxframes=5;
    $snoopy->fetch("http://www.ispi.net/");
    echo "<PRE>\n";
    echo htmlentities($snoopy->results[0]);
    echo htmlentities($snoopy->results[1]);
    echo htmlentities($snoopy->results[2]);
    echo "</PRE>\n";

    $snoopy->fetchform("http://www.altavista.com");
    print $snoopy->results;

DESCRIPTION:

    What is Snoopy?
   
    Snoopy is a PHP class that simulates a web browser. It automates the
    task of retrieving web page content and posting forms, for example.

    Some of Snoopy's features:
   
    * easily fetch the contents of a web page
    * easily fetch the text from a web page (strip html tags)
    * easily fetch the the links from a web page
    * supports proxy hosts
    * supports basic user/pass authentication
    * supports setting user_agent, referer, cookies and header content
    * supports browser redirects, and controlled depth of redirects
    * expands fetched links to fully qualified URLs (default)
    * easily submit form. data and retrieve the results
    * supports following html frames (added v0.92)
    * supports passing cookies on redirects (added v0.92)
   
   
REQUIREMENTS:

    Snoopy requires PHP with PCRE (Perl Compatible Regular Expressions),
    which should be PHP 3.0.9 and up. For read timeout support, it requires
    PHP 4 Beta 4 or later. Snoopy was developed and tested with PHP 3.0.12.

CLASS METHODS:

    fetch($URI)
    -----------
   
    This is the method used for fetching the contents of a web page.
    $URI is the fully qualified URL of the page to fetch.
    The results of the fetch are stored in $this->results.
    If you are fetching frames, then $this->results
    contains each frame. fetched in an array.
       
    fetchtext($URI)
    ---------------   
   
    This behaves exactly like fetch() except that it only returns
    the text from the page, stripping out html tags and other
    irrelevant data.       

    fetchform($URI)
    ---------------   
   
    This behaves exactly like fetch() except that it only returns
    the form. elements from the page, stripping out html tags and other
    irrelevant data.       

    fetchlinks($URI)
    ----------------

    This behaves exactly like fetch() except that it only returns
    the links from the page. By default, relative links are
    converted to their fully qualified URL form.

    submit($URI,$formvars)
    ----------------------
   
    This submits a form. to the specified $URI. $formvars is an
    array of the form. variables to pass.
       
       
    submittext($URI,$formvars)
    --------------------------

    This behaves exactly like submit() except that it only returns
    the text from the page, stripping out html tags and other
    irrelevant data.       

    submitlinks($URI)
    ----------------

    This behaves exactly like submit() except that it only returns
    the links from the page. By default, relative links are
    converted to their fully qualified URL form.

CLASS VARIABLES:    (default value in parenthesis)

    $host            the host to connect to
    $port            the port to connect to
    $proxy_host        the proxy host to use, if any
    $proxy_port        the proxy port to use, if any
    $agent            the user agent to masqerade as (Snoopy v0.1)
    $referer        referer information to pass, if any
    $cookies        cookies to pass if any
    $rawheaders        other header info to pass, if any
    $maxredirs        maximum redirects to allow. 0=none allowed. (5)
    $offsiteok        whether or not to allow redirects off-site. (true)
    $expandlinks    whether or not to expand links to fully qualified URLs (true)
    $user            authentication username, if any
    $pass            authentication password, if any
    $accept            http accept types (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)
    $error            where errors are sent, if any
    $response_code    responde code returned from server
    $headers        headers returned from server
    $maxlength        max return data length
    $read_timeout    timeout on read operations (requires PHP 4 Beta 4+)
                    set to 0 to disallow timeouts
    $timed_out        true if a read operation timed out (requires PHP 4 Beta 4+)
    $maxframes        number of frames we will follow
    $status            http status of fetch
    $temp_dir        temp directory that the webserver can write to. (/tmp)
    $curl_path        system path to cURL binary, set to false if none
   

EXAMPLES:

    Example:     fetch a web page and display the return headers and
                the contents of the page (html-escaped):
   
    include "Snoopy.class.php";
    $snoopy = new Snoopy;
   
    $snoopy->user = "joe";
    $snoopy->pass = "bloe";
   
    if($snoopy->fetch("http://www.slashdot.org/"))
    {
        echo "response code: ".$snoopy->response_code."<br>\n";
        while(list($key,$val) = each($snoopy->headers))
            echo $key.": ".$val."<br>\n";
        echo "<p>\n";
       
        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
    }
    else
        echo "error fetching document: ".$snoopy->error."\n";

    Example:    submit a form. and print out the result headers
                and html-escaped page:

    include "Snoopy.class.php";
    $snoopy = new Snoopy;
   
    $submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";
   
    $submit_vars["q"] = "amiga";
    $submit_vars["submit"] = "Search!";
    $submit_vars["searchhost"] = "Altavista";

       
    if($snoopy->submit($submit_url,$submit_vars))
    {
        while(list($key,$val) = each($snoopy->headers))
            echo $key.": ".$val."<br>\n";
        echo "<p>\n";
       
        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
    }
    else
        echo "error fetching document: ".$snoopy->error."\n";

    Example:    showing functionality of all the variables:
   

    include "Snoopy.class.php";
    $snoopy = new Snoopy;

    $snoopy->proxy_host = "my.proxy.host";
    $snoopy->proxy_port = "8080";
   
    $snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)";
    $snoopy->referer = "http://www.microsnot.com/";
   
    $snoopy->cookies["SessionID"] = 238472834723489l;
    $snoopy->cookies["favoriteColor"] = "RED";
   
    $snoopy->rawheaders["Pragma"] = "no-cache";
   
    $snoopy->maxredirs = 2;
    $snoopy->offsiteok = false;
    $snoopy->expandlinks = false;
   
    $snoopy->user = "joe";
    $snoopy->pass = "bloe";
   
    if($snoopy->fetchtext("http://www.phpbuilder.com"))
    {
        while(list($key,$val) = each($snoopy->headers))
            echo $key.": ".$val."<br>\n";
        echo "<p>\n";
       
        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
    }
    else
        echo "error fetching document: ".$snoopy->error."\n";

    Example:     fetched framed content and display the results
   
    include "Snoopy.class.php";
    $snoopy = new Snoopy;
   
    $snoopy->maxframes = 5;
   
    if($snoopy->fetch("http://www.ispi.net/"))
    {
        echo "<PRE>".htmlspecialchars($snoopy->results[0])."</PRE>\n";
        echo "<PRE>".htmlspecialchars($snoopy->results[1])."</PRE>\n";
        echo "<PRE>".htmlspecialchars($snoopy->results[2])."</PRE>\n";
    }
    else
        echo "error fetching document: ".$snoopy->error."\n";

 

 

Php代码  收藏代码
  1. <?php  
  2.   
  3. //获取所有内容url保存到文件  
  4. function get_index($save_file$prefix="index_"){  
  5.     $count = 68;  
  6.     $i = 1;  
  7.     if (file_exists($save_file)) @unlink($save_file);  
  8.     $fp = fopen($save_file"a+"or die("Open "$save_file ." failed");  
  9.     while($i<$count){  
  10.         $url = $prefix . $i .".htm";  
  11.         echo "Get "$url ."...";  
  12.         $url_str = get_content_url(get_url($url));  
  13.         echo " OKn";  
  14.         fwrite($fp$url_str);  
  15.         ++$i;  
  16.     }  
  17.     fclose($fp);  
  18. }  
  19.   
  20. //获取目标多媒体对象  
  21. function get_object($url_file$save_file$split="|--:**:--|"){  
  22.     if (!file_exists($url_file)) die($url_file ." not exist");  
  23.     $file_arr = file($url_file);  
  24.     if (!is_array($file_arr) || emptyempty($file_arr)) die($url_file ." not content");  
  25.     $url_arr = array_unique($file_arr);  
  26.     if (file_exists($save_file)) @unlink($save_file);  
  27.     $fp = fopen($save_file"a+"or die("Open save file "$save_file ." failed");  
  28.     foreach($url_arr as $url){  
  29.         if (emptyempty($url)) continue;  
  30.         echo "Get "$url ."...";  
  31.         $html_str = get_url($url);  
  32.         echo $html_str;  
  33.         echo $url;  
  34.         exit;  
  35.         $obj_str = get_content_object($html_str);  
  36.         echo " OKn";  
  37.         fwrite($fp$obj_str);  
  38.     }  
  39.     fclose($fp);  
  40. }  
  41.   
  42. //遍历目录获取文件内容  
  43. function get_dir($save_file$dir){  
  44.     $dp = opendir($dir);  
  45.     if (file_exists($save_file)) @unlink($save_file);  
  46.     $fp = fopen($save_file"a+"or die("Open save file "$save_file ." failed");  
  47.     while(($file = readdir($dp)) != false){  
  48.         if ($file!="." && $file!=".."){  
  49.             echo "Read file "$file ."...";  
  50.             $file_content = file_get_contents($dir . $file);  
  51.             $obj_str = get_content_object($file_content);  
  52.             echo " OKn";  
  53.             fwrite($fp$obj_str);  
  54.         }  
  55.     }  
  56.     fclose($fp);  
  57. }  
  58.   
  59.   
  60. //获取指定url内容  
  61. function get_url($url){  
  62.     $reg = '/^http://[^/].+$/';  
  63.     if (!preg_match($reg$url)) die($url ." invalid");  
  64.     $fp = fopen($url"r"or die("Open url: "$url ." failed.");  
  65.     while($fc = fread($fp, 8192)){  
  66.         $content .= $fc;  
  67.     }  
  68.     fclose($fp);  
  69.     if (emptyempty($content)){  
  70.         die("Get url: "$url ." content failed.");  
  71.     }  
  72.     return $content;  
  73. }  
  74.   
  75. //使用socket获取指定网页  
  76. function get_content_by_socket($url$host){  
  77.     $fp = fsockopen($host, 80) or die("Open "$url ." failed");  
  78.     $header = "GET /".$url ." HTTP/1.1rn";  
  79.     $header .= "Accept: */*rn";  
  80.     $header .= "Accept-Language: zh-cnrn";  
  81.     $header .= "Accept-Encoding: gzip, deflatern";  
  82.     $header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)rn";  
  83.     $header .= "Host: "$host ."rn";  
  84.     $header .= "Connection: Keep-Alivern";  
  85.     //$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-rnrn";  
  86.     $header .= "Connection: Closernrn";  
  87.   
  88.     fwrite($fp$header);  
  89.     while (!feof($fp)) {  
  90.         $contents .= fgets($fp, 8192);  
  91.     }  
  92.     fclose($fp);  
  93.     return $contents;  
  94. }  
  95.   
  96.   
  97. //获取指定内容里的url  
  98. function get_content_url($host_url$file_contents){  
  99.   
  100.     //$reg = '/^(#|javascript.*?|ftp://.+|http://.+|.*?href.*?|play.*?|index.*?|.*?asp)+$/i';  
  101.     //$reg = '/^(down.*?.html|d+_d+.htm.*?)$/i';  
  102.     $rex = "/([hH][rR][eE][Ff])s*=s*['"]*([^>'"s]+)["'>]*s*/i";  
  103.     $reg = '/^(down.*?.html)$/i';  
  104.     preg_match_all ($rex$file_contents$r);  
  105.     $result = ""//array();  
  106.     foreach($r as $c){  
  107.         if (is_array($c)){  
  108.             foreach($c as $d){  
  109.                 if (preg_match($reg$d)){ $result .= $host_url . $d."n"; }  
  110.             }  
  111.         }  
  112.     }  
  113.     return $result;  
  114. }  
  115.   
  116. //获取指定内容中的多媒体文件  
  117. function get_content_object($str$split="|--:**:--|"){      
  118.     $regx = "/hrefs*=s*['"]*([^>'"s]+)["'>]*s*(<b>.*?</b>)/i";  
  119.     preg_match_all($regx$str$result);  
  120.   
  121.     if (count($result) == 3){  
  122.         $result[2] = str_replace("<b>多媒体: """$result[2]);  
  123.         $result[2] = str_replace("</b>"""$result[2]);  
  124.         $result = $result[1][0] . $split .$result[2][0] . "n";  
  125.     }  
  126.     return $result;  
  127. }  
  128.   
  129. ?>   

php抓取网页特定div区块及图片

(2009-06-05 09:56:23)

标签:

php

抓取

图片

it

分类:PHP

1. 取得指定網頁內的所有圖片:
<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/');

//取得第一個img標籤,並儲存至陣列match(regex語法與上述同義)
preg_match('/<img[^>]*>/Ui',$text, $match);

//印出match
print_r($match);
?>

-----------------
2. 取得指定網頁內的第一張圖片:
<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/');

//取得第一個img標籤,並儲存至陣列match(regex語法與上述同義)
preg_match('/<img[^>]*>/Ui',$text, $match);

//印出match
print_r($match);
?>

------------------------------------

3. 取得指定網頁內的特定div區塊(藉由id判斷):
<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/');

//去除換行及空白字元(序列化內容才需使用)
//$text=str_replace(array("\r","\n","\t","\s"),'', $text);   

//取出div標籤且id為PostContent的內容,並儲存至陣列match
preg_match('/<div[^>]*id="PostContent"[^>]*>(.*?)<\/div>/si',$text,$match);

//印出match[0]
print($match[0]);
?>

-------------------------------------------
4. 上述2及3的結合:
<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/');   

//取出div標籤且id為PostContent的內容,並儲存至陣列match
preg_match('/<div[^>]*id="PostContent"[^>]*>(.*?)<\/div>/si',$text,$match);  

//取得第一個img標籤,並儲存至陣列match2

抱歉!评论已关闭.