當前位置：首頁 > 编程语言 > php >内容正文

php

php屏幕抓取,关于屏幕抓取：如何在PHP中实现Web scraper？

發布時間：2025/4/16 php 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 php屏幕抓取,关于屏幕抓取：如何在PHP中实现Web scraper？小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

哪些內置的PHP函數對Web抓取有用？有什么好的資源(web或print)可以提高PHP的web抓取速度？

我想推薦我最近遇到的這門課。簡單HTML DOM分析器

對于這一點，PHP是一種特別糟糕的語言。它缺少一個事件驅動的框架，這對于這個任務幾乎是必要的。你能用它爬一個地方嗎？是的。你會爬很多網站嗎？不。

@Evancarroll將curl和domdocument適用于從多個網站獲取產品的價格和圖像(輸出到我的網站上)？例如，這個stackoverflow鏈接如果沒有，您會建議什么？

試試看，如果管用的話，對你來說就足夠了。節點是一個更好的選擇，為建立一個網絡刮刀。另外，phantom.js(如果您需要一個真正擁有DOM并在其上運行JavaScript的現代系統)。

刮削一般包括3個步驟：

首先你得到或發布你的請求到指定的URL

下一次你收到返回的HTML響應

最后你分析出你想要的文本刮擦。

為了完成步驟1和2，下面是一個簡單的PHP類，它使用curl來獲取使用get或post的網頁。在返回HTML之后，您只需使用正則表達式通過解析出您想要獲取的文本來完成步驟3。

對于正則表達式，我最喜歡的教程站點如下：正則表達式教程

我最喜歡使用regex的程序是regex buddy。我建議你試試那個產品的演示，即使你不想買它。它是一個非常寶貴的工具，甚至可以為您選擇的語言(包括PHP)中的regex生成代碼。

用途：

$curl = new Curl();

$html = $curl->get("http://www.google.com");

// now, do your regex work against $html

代碼>

PHP類：

class Curl

{

public $cookieJar ="";

public function __construct($cookieJarFile = 'cookies.txt') {

$this->cookieJar = $cookieJarFile;

}

function setup()

{

$header = array();

$header[0] ="Accept: text/xml,application/xml,application/xhtml+xml,";

$header[0] .="text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";

$header[] = "Cache-Control: max-age=0";

$header[] = "Connection: keep-alive";

$header[] ="Keep-Alive: 300";

$header[] ="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";

$header[] ="Accept-Language: en-us,en;q=0.5";

$header[] ="Pragma:"; // browsers keep this blank.

curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');

curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);

curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar);

curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);

curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);

curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);

curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);

}

function get($url)

{

$this->curl = curl_init($url);

$this->setup();

return $this->request();

}

function getAll($reg,$str)

{

preg_match_all($reg,$str,$matches);

return $matches[1];

}

function postForm($url, $fields, $referer='')

{

$this->curl = curl_init($url);

$this->setup();

curl_setopt($this->curl, CURLOPT_URL, $url);

curl_setopt($this->curl, CURLOPT_POST, 1);

curl_setopt($this->curl, CURLOPT_REFERER, $referer);

curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);

return $this->request();

}

function getInfo($info)

{

$info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);

return $info;

}

function request()

{

return curl_exec($this->curl);

}

嗯，用regex解析HTML是…好吧，我讓這家伙解釋一下：stackoverflow.com/questions/1732348/&hellip；

curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);

另外，如果一個人需要從同一個網站上獲取或發布多個表單，那么$this->curl = curl_init($url);會出現問題，每次都會打開一個新的會話。此init用于get函數和postform函數

優秀的代碼。但庫克賈爾和庫克葉爾的錯誤是，用愛德華1〔3〕替換了愛德華1〔2〕。

我推薦Goutte，一個簡單的PHP網頁刮刀。示例用法：

創建一個Goutte客戶端實例(擴展Symfony\Component\BrowserKit\Client：

use Goutte\Client;

$client = new Client();

用request()方法提出要求：

$crawler = $client->request('GET', 'http://www.symfony-project.org/');

request方法返回Crawler對象(Symfony\Component\DomCrawler\Crawler號)。

點擊鏈接：

$link = $crawler->selectLink('Plugins')->link();

$crawler = $client->click($link);

提交表單：

$form = $crawler->selectButton('sign in')->form();

$crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx'));

提取數據：

$nodes = $crawler->filter('.error_list');

if ($nodes->count())

{

die(sprintf("Authentification error: %s

", $nodes->text()));

}

printf("Nb tasks: %d

", $crawler->filter('#nb_tasks')->text());

Scraperwiki是一個非常有趣的項目。幫助您在python、ruby或php中在線構建scraper——幾分鐘后我就可以進行一次簡單的嘗試了。

如果您需要易于維護而不是快速執行的東西，那么使用可編寫腳本的瀏覽器(如SimpleTest)可能會有所幫助。

刮削可能相當復雜，這取決于您想要做什么。閱讀本系列教程，了解用PHP編寫scraper的基礎知識，看看您是否能夠掌握它。

您可以使用類似的方法來自動進行表單注冊、登錄，甚至是假點擊廣告！不過，使用curl的主要限制是它不支持使用javascript，因此，如果您試圖抓取一個使用Ajax進行分頁的站點，例如，它可能會變得有點棘手……但還有一些方法可以解決這個問題！

這里還有一個：一個簡單的沒有regex的php scraper。

我的框架中的scraper類：

Example:

$site = $this->load->cls('scraper', 'http://www.anysite.com');

$excss = $site->getExternalCSS();

$incss = $site->getInternalCSS();

$ids = $site->getIds();

$classes = $site->getClasses();

$spans = $site->getSpans();

print '[cc lang="php"]';

print_r($excss);

print_r($incss);

print_r($ids);

print_r($classes);

print_r($spans);

class scraper

{

private $url = '';

public function __construct($url)

{

$this->url = file_get_contents("$url");

}

public function getInternalCSS()

{

$tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns);

$result = array();

array_push($result, $patterns[2]);

array_push($result, count($patterns[2]));

return $result;

}

public function getExternalCSS()

{

$tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns);

$result = array();

array_push($result, $patterns[2]);

array_push($result, count($patterns[2]));

return $result;

}

public function getIds()

{

$tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns);

$result = array();

array_push($result, $patterns[2]);

array_push($result, count($patterns[2]));

return $result;

}

public function getClasses()

{

$tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns);

$result = array();

array_push($result, $patterns[2]);

array_push($result, count($patterns[2]));

return $result;

}

public function getSpans(){

$tmp = preg_match_all('/()(.*)()/', $this->url, $patterns);

$result = array();

array_push($result, $patterns[2]);

array_push($result, count($patterns[2]));

return $result;

}

我要么使用libcurl，要么使用perl的lwp(libwww表示perl)。有PHP的libwww嗎？

如果要使用lwp，請使用www:：mechanical，它用方便的助手函數將其包裝起來。

如果您對PHP以外的東西開放的話，Ruby也可以使用機械化。

file_get_contents()可以獲取遠程URL并提供源代碼。然后，您可以使用正則表達式(使用與Perl兼容的函數)來獲取所需的內容。

出于好奇，你想刮什么？

curl庫允許您下載網頁。您應該研究執行刮擦的正則表達式。

-1用于推薦Regex！使用HTML分析器。

總結

以上是生活随笔為你收集整理的php屏幕抓取,关于屏幕抓取：如何在PHP中实现Web scraper？的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： php调用shell执行scp,Shel
下一篇：动态规划算法php,php算法学习之动态