當(dāng)前位置：首頁(yè) > 编程语言 > java >内容正文

java

Java爬虫技术(一)普通网站爬取图片

發(fā)布時(shí)間：2025/3/20 java 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 Java爬虫技术(一)普通网站爬取图片小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

爬蟲(chóng)簡(jiǎn)單介紹

用戶和網(wǎng)站服務(wù)器的操作如下
而爬蟲(chóng)需要做的是模擬仿照用戶機(jī),去向服務(wù)器發(fā)送請(qǐng)求數(shù)據(jù),并接受響應(yīng)數(shù)據(jù),接著去解析數(shù)據(jù),獲得我們想要的數(shù)據(jù)

步驟大致分為

準(zhǔn)備好要爬取的網(wǎng)址
定義爬蟲(chóng)的參數(shù)
開(kāi)始爬
獲取爬取的數(shù)據(jù)
使用xpath技術(shù)去解析數(shù)據(jù)
獲取我們想要的數(shù)據(jù)

準(zhǔn)備

新建一個(gè)maven項(xiàng)目,并配置pom.xml

爬蟲(chóng)jar包工具,jsoup

<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.13.1</version></dependency>

IO流傳輸下載jar包 commons-io

<dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.4</version></dependency>

爬取網(wǎng)站圖片練習(xí)

爬取當(dāng)前頁(yè)面的圖片

https://dou.yuanmazg.com/doutu?page=1

主要操作步驟和代碼

選定一張圖片復(fù)制它的selector

#pic-detail > div > div.col-sm-9 > div.page-content > a:nth-child(2) > img

a:nth-child(2)

a 后面的字符代表該圖片是該頁(yè)面下面的第幾張圖片

那么把后面的字符去掉,就可以代表全部的圖片了

Elements select =dom.select("#pic-detail > div > div.col-sm-9 > div.page-content > a > img");

定義爬蟲(chóng)的參數(shù)

Connection.Response response = Jsoup.connect("https://dou.yuanmazg.com/doutu?page=1").header("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0").ignoreContentType(true).timeout(10000).execute(); .header("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0")

這一句意思是

告訴目標(biāo)服務(wù)器我們是以用戶瀏覽器去訪問(wèn)的

execute()相當(dāng)于一個(gè)回車鍵;

輸出文件到目標(biāo)文件夾下

byte[] bytes = imgResponse.bodyAsBytes();IOUtils.write(bytes,new FileOutputStream(new File("d://斗圖啦//"+filename)));

String img_url = element.attr(“data-original”);

提取"data-original"里面的數(shù)據(jù)

完整代碼

package com.zygxy.parse;import org.apache.commons.io.IOUtils; import org.jsoup.Connection; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;import java.io.BufferedInputStream; import java.io.File; import java.io.FileOutputStream; import java.io.IOException;public class Jsoup_Study {public static void main(String[] args) throws IOException {//Jsoup 模擬瀏覽器發(fā)起請(qǐng)求String website="http://dou.yuanmazg.com";Connection.Response response = Jsoup.connect("https://dou.yuanmazg.com/doutu?page=1").header("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36").ignoreContentType(true).timeout(10000).execute();//System.out.println(response.header("Content-Type")); //響應(yīng)頭System.out.println(response.body()); //響應(yīng)體String html=response.body();//Jsoup 解析HTMLDocument dom = Jsoup.parse(html);//選擇器//獲取多個(gè)//#pic-detail > div > div.col-sm-9 > div.page-content > a:nth-child(2) > imgElements select = dom.select("#pic-detail > div > div.col-sm-9 > div.page-content > a > img");//獲取單個(gè)for (Element element:select){String img_url=element.attr("data-original");String realurl=website+img_url;int i = img_url.lastIndexOf("/");String filename=img_url.substring(i+1);System.out.println(filename);System.out.println(realurl);Connection.Response imgResponse = Jsoup.connect(realurl).ignoreContentType(true).timeout(10000).maxBodySize(10 * 1024 * 1024) //10M的緩沖區(qū).execute();//因?yàn)閳D片是二進(jìn)制音頻視頻圖片都用byte[] bytes = imgResponse.bodyAsBytes();IOUtils.write(bytes,new FileOutputStream(new File("d://斗圖啦//"+filename)));}} }

總結(jié)

以上是生活随笔為你收集整理的Java爬虫技术(一)普通网站爬取图片的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Echarts开源可视化库学习(三)主题
下一篇： Java爬虫技术(二)爬取京东iPhon