當(dāng)前位置：首頁 > 编程语言 > java >内容正文

java

大数据互联网架构阶段 Java爬虫

發(fā)布時間：2024/4/30 java 59 豆豆

生活随笔收集整理的這篇文章主要介紹了大数据互联网架构阶段 Java爬虫小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Java爬蟲

一、爬蟲簡介

http://www.lete.com , 樂貸網(wǎng)其實就是爬蟲的簡單應(yīng)用，發(fā)送一個商品連接，獲取商品信息

目標(biāo)

爬取京東所有商品的信息

封裝在自己的Item實體類中

分析：

京東允許爬蟲爬取數(shù)據(jù)么？

京東是允許爬蟲的，沒有反爬蟲技術(shù)

爬蟲產(chǎn)品：

httpClient ：但是httpClient抓取的是整個頁面，整夜字符串的處理、解析比較繁瑣，數(shù)據(jù)的定位非常不準(zhǔn)確。

htmlUnit ：也獲取整個頁面，抓取頁面也可以包含二次提交，數(shù)據(jù)定位也比較準(zhǔn)確，但是爬取過程不穩(wěn)定，在爬取過程中需要斷點續(xù)爬代碼的編寫。

jsoup：是一款比較穩(wěn)定，定位準(zhǔn)確，包含二次提交的java爬蟲技術(shù) 。

python也可以做爬蟲，使用beautifulSoup技術(shù) ，底層原理與jsoup是一樣的。只是語言不同。

jsoup

抓取整個頁面

抓取整個網(wǎng)站（以京東為列，抓取從首頁能獲取所有的連接地址）

抓取頁面中某一個定位的數(shù)據(jù)

抓取二次提交ajax（如： price）

抓取其他的jsonp數(shù)據(jù) （如：商品描述）

以上五種問題，如果都能解決，那么使用jsoup爬取任何網(wǎng)站都是可行的。

案例

整個頁面

與httpclient無異

/*** 爬取網(wǎng)頁* @throws IOException * */@Testpublic void testt_01() throws IOException {String url = "http://www.jd.com"; Connection connect = Jsoup.connect(url);Response execute = connect.execute();System.out.println(execute.body());}

整個網(wǎng)站

抓取絕大部分的連接地址

觀察網(wǎng)站的連接大部分都是使用的a標(biāo)簽，連接在href中

使用jsoup定位a標(biāo)簽，獲取所有a標(biāo)簽，然后獲取href的值

/*** 爬取整個網(wǎng)站* @throws IOException * */@Testpublic void test_02() throws IOException {String url = "http://www.jd.com";Document document = Jsoup.connect(url).get();//尋找a標(biāo)簽Elements elementsByTag = document.getElementsByTag("a");for(Element element :elementsByTag) {String href = element.attr("href");String val = element.val();System.out.println("連接地址："+href + "---"+val);}}

定位信息

/*** 爬取一個網(wǎng)頁中的信息* 定位具體標(biāo)簽中的數(shù)據(jù)* @throws IOException * */@Testpublic void test_03() throws IOException {String url= "http://item.jd.com/4329035.html";//get請求獲取的是返回結(jié)構(gòu)的document樹//excute獲取的是返回的所有數(shù)據(jù)Document doc = Jsoup.connect(url).get();//選擇器與jQ中的選擇器使用一致//為了定位準(zhǔn)確，使用父子選擇器，確定唯一的定位Element select = doc.select("ul li .p-img a").get(0);System.out.println(select.attr("href"));}

json二次提交獲取信息

需要自己尋找頁面中發(fā)起 ajax的請求地址

/*** 抓取二次提交* 商品價格是頁面加載之后又通過ajax獲取的* @throws IOException * */@Testpublic void test_04() throws IOException {String url = "http://p.3.cn/prices/mgets?skuIds=J_5089253";Response response = Jsoup.connect(url).ignoreContentType(true).execute();String json = response.body();System.out.println(json);ObjectMapper mp = new ObjectMapper();JsonNode jn = mp.readTree(json);//[{"op":"8388.00","m":"9999.00","id":"J_5089253","p":"8388.00"}]//直接獲取到的是數(shù)組，需要獲取到第一個元素String price = jn.get(0).get("p").asText();System.out.println(price);}

jsonp數(shù)據(jù)

/*** 獲取jsonp請求數(shù)據(jù)* @throws IOException * */@Testpublic void test_05() throws IOException {String url = "http://d.3.cn/desc/4329035";String jsonDesc = Jsoup.connect(url).ignoreContentType(true).execute().body();System.out.println(jsonDesc);String data = jsonDesc.substring(jsonDesc.indexOf("(")+1, jsonDesc.lastIndexOf(")"));System.out.println(data);ObjectMapper mp = new ObjectMapper();JsonNode jn = mp.readTree(data);String desc = jn.get("date").asText();System.out.println(desc);}

爬取京東商品信息

/*** 爬取京東商品的所有商品信息* @author outman * 2018 - 1 - 31 - 17:48* 步驟： * 1. 先獲取所有的商品三級分類鏈接* 2. 訪問商品分類鏈接后獲取一個分類下所有商品的鏈接（可能存在分頁的情況）* 3. 訪問商品鏈接后獲取商品信息 * * 過程中要十分注意異常的處理* 在爬取過程中一旦出現(xiàn)異常，后續(xù)的過程也將受到影響，導(dǎo)致整個數(shù)據(jù)錯亂* */ public class JDCrawler {private static SqlSession session ; static {//獲取一個數(shù)據(jù)流InputStream in;try {in = Resources.getResourceAsStream("mybatis-config.xml");//創(chuàng)建一個工廠SqlSessionFactory factory = new SqlSessionFactoryBuilder().build(in);//創(chuàng)建一個會話session = factory.openSession(true);//true表示自動提交，默認(rèn)為false ，需要手動提交} catch (Exception e) {// TODO Auto-generated catch blocke.printStackTrace();}}/*** 入口函數(shù)* @throws Exception * */public static void main (String[] args) throws Exception {//測試//http://www.jd.com/allSort.aspx 商品分類頁面 // getItemCatUrls("http://www.jd.com/allSort.aspx");//list.jd.com/list.html?cat=12379,13302,13313 某一分類下的商品展示頁面 // getItemsPageUrls("http://list.jd.com/list.html?cat=12379,13302,13313");//http://list.jd.com/list.html?cat=12379,13302,13313&page=2 商品展示頁面 // getItemUrls("http://list.jd.com/list.html?cat=12379,13302,13313&page=2");//item.jd.com/12017077901.html 商品信息頁面 // getItem("http://item.jd.com/12017077901.html");// 12017077901某一個商品的ID // getPrice(new Long("12017077901"));//完整測試List<String> itemCatUrls = getItemCatUrls("http://www.jd.com/allSort.aspx");for(String itemCaturl :itemCatUrls) {System.out.println("商品分類鏈接:"+itemCaturl);List<String> itemsPageUrls = getItemsPageUrls(itemCaturl);for(String itemsPageUrl : itemsPageUrls) {System.out.println("商品展示頁面鏈接:"+itemsPageUrl);List<String> itemUrls = getItemUrls(itemsPageUrl);for(String itemUrl : itemUrls) {System.out.println("商品鏈接:"+itemUrls);Item item = getItem(itemUrl);saveItem(item);System.out.println(item);}}}}/*** 獲取京東商品的所有分類鏈接* @throws Exception * */public static List<String> getItemCatUrls(String url) throws Exception{//記錄數(shù)據(jù)數(shù)量Integer hrefPreNum = 0 ;List<String> itemCatUrls = new ArrayList<String>();//這里選擇拋出異常，這里如果拋出異常，說明url有問題，或者網(wǎng)絡(luò)有問題，后續(xù)的操作沒有任何意義Document doc = Jsoup.connect(url).get();Elements eles = doc.select("dl dd a");for(Element ele : eles) {String href = ele.attr("href");hrefPreNum += 1;if(href.startsWith("//list.jd.com/")) {itemCatUrls.add("http:"+href); // System.out.println(href);}}System.out.println("獲取到的總?cè)壏诸愭溄恿?#xff1a;"+hrefPreNum);System.out.println("數(shù)據(jù)清洗后的數(shù)量："+itemCatUrls.size());return itemCatUrls;}/*** 獲取三級分類下所有商品頁面的鏈接* 商品展示可能存在分頁的情況* 所以在獲取所有的商品鏈接之前需要先獲取所有的商品分類頁* */public static List<String> getItemsPageUrls(String url){List<String> itemsPages = new ArrayList<String>();//從商品展示頁面獲取分頁信息String num;try {//拋出異常，如果出現(xiàn)異常則繼續(xù)執(zhí)行，丟失一點信息是正常的num = Jsoup.connect(url).get().select("#J_topPage span i").get(0).text();Long numL = new Long(num);for(int i = 1 ; i<=numL ; i++) {String pageUrl = url+"&page="+i; // System.out.println(pageUrl);itemsPages.add(pageUrl);}} catch (Exception e) {e.printStackTrace();}return itemsPages;}/*** 獲取每個商品分類頁面的商品鏈接* */public static List<String> getItemUrls(String url){List<String> itemUrls = new ArrayList<String>();try {Elements eles = Jsoup.connect(url).get().select(" li div .p-img a");for(Element ele : eles) {String itemUrl = ele.attr("href");itemUrls.add("http:"+itemUrl);}} catch (Exception e) {System.out.println("獲取商品展示頁面的商品鏈接出錯："+url);}return itemUrls;}/*** 訪問商品鏈接，獲取商品數(shù)據(jù)* */public static Item getItem (String url) {Item item = new Item();Long id = null;try {Document doc = Jsoup.connect(url).get();//獲取id //item.jd.com/12016709876.htmlid = new Long(url.substring(url.lastIndexOf("/")+1, url.indexOf(".html")));//獲取titleString title = doc.select("#name h1").get(0).text();//獲取賣點獲取到的值為"" 說明頁面時是通過ajax方式請求需要json格式的數(shù)據(jù) // String sellPoint = doc.select("#p-ad").get(0).text();String sellPoint = getSellPoint(id);//獲取價格價格是通過ajax二次請求的 // Long price = new Long(doc.select(".dd .p-price .price").get(0).text());Long price = getPrice(id);//獲取圖片 // String img = doc.select("#spec-n1 img").attr("src");String img = getImg(url); // System.out.println(img);//獲取商品詳情 // String desc = doc.select("J-detail-content").get(0).text();String desc = getDesc(id);//封裝屬性item.setId(id);item.setTitle(title);item.setSellPoint(sellPoint);item.setPrice(price);item.setImg(img);item.setDesc(desc);System.out.println(item);} catch (Exception e) {// TODO Auto-generated catch blockSystem.out.println("獲取商品信息失敗");}return item;}/*** 爬取賣點* 由于商品價格是頁面加載完成之后，有通過ajax獲取的，所以單獨爬取json格式的數(shù)據(jù)* 通過頁面分析得到賣點的url* http://ad.3.cn/ads/mgets?skuids=AD_ +12017077901* */public static String getSellPoint(Long id) {String sellPoint = null;try {Response resp = Jsoup.connect("http://ad.3.cn/ads/mgets?skuids=AD_"+id).ignoreContentType(true).execute();ObjectMapper mapper = new ObjectMapper();sellPoint = mapper.readTree(resp.body()).get(0).get("ad").asText();} catch (Exception e) {// TODO Auto-generated catch blockSystem.out.println("獲取賣點失敗");}return sellPoint;}/*** 爬取商品價格* 由于商品價格是頁面加載完成之后，有通過ajax獲取的，所以單獨爬取* 通過頁面分析得到商品價格的鏈接 //p.3.cn/prices/get?skuid=id* */public static Long getPrice(Long id) {Long price = null;try {Response resp = Jsoup.connect("http://p.3.cn/prices/get?skuid="+id).ignoreContentType(true).execute();ObjectMapper mapper = new ObjectMapper();JsonNode jsonNode = mapper.readTree(resp.body()).get(0);price = jsonNode.get("m").asLong(); // System.out.println(price);} catch (Exception e) {System.out.println("獲取價格失敗");}return price;}/*** 獲取商品圖片 * 通過分析頁面，得到圖片的請求地址* */public static String getImg(String url) {String img = "";Document doc;try {doc = Jsoup.connect(url).get();//獲取頁面大圖的地址String bigsrc = doc.select("#spec-n1 img").attr("src"); // System.out.println("大圖地址:"+bigsrc);//獲取小圖地址Elements smallsrcs = doc.select("#spec-list div ul li img");for(Element ele : smallsrcs) {String src = ele.attr("src"); // System.out.println("小圖地址："+src);//將小圖地址替換成大圖String newSrc = src.replace("n5", "n1");img+=newSrc+";"; // System.out.println(newSrc);}} catch (Exception e) {// TODO Auto-generated catch blockSystem.out.println("獲取圖片失敗");}img = img.substring(0 , img.length()-1);return img;}/*** 爬取商品詳情* 商品詳情是頁面加載完成之后，通過jsonp獲取的，需要單獨獲取* http://dx.3.cn/desc/10316672107* */public static String getDesc(Long id) {String desc = null;try {Response resp = Jsoup.connect("http://dx.3.cn/desc/"+id).ignoreContentType(true).execute();ObjectMapper mapper = new ObjectMapper();String body = resp.body();body = body.substring(body.indexOf("(")+1, body.lastIndexOf(")"));desc = mapper.readTree(body).get("content").asText();} catch (Exception e) {// TODO Auto-generated catch blockSystem.out.println("獲取不到"+id+"的商品描述");}return desc;}/*** 數(shù)據(jù)入庫* */public static void saveItem(Item item) {session.insert("ItemMapper.saveItem" , item);} }

爬蟲的注意事項

網(wǎng)絡(luò)不穩(wěn)定，最好使用完整的嚴(yán)謹(jǐn) 的邏輯（斷點續(xù)爬）

爬蟲代碼量不大（邏輯種類不多），最重要的是頁面結(jié)構(gòu)的分析

網(wǎng)站改版導(dǎo)致爬蟲的代碼更新。

反爬蟲技術(shù)

頻繁修改樣式關(guān)鍵字（最簡單的反爬蟲機制）

nginx就可以反爬蟲（使用nginx黑名單）

jsoup的連接請求頭和瀏覽器請求頭不一樣

jsoup可以用代碼模擬請求頭—偽裝請求頭參考： http://jilongliang.iteye.com/blog/2048459

查看訪問頻率，如果頻率過高，則封ip一段時間

問題

數(shù)據(jù)是會每天更新或添加的，怎樣在原有的基礎(chǔ)上爬取最新的數(shù)據(jù)

總結(jié)

以上是生活随笔為你收集整理的大数据互联网架构阶段 Java爬虫的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：大数据互联网架构阶段全文检索技术
下一篇： Java 利用InetAddress类确