日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

jsoup爬虫

發(fā)布時(shí)間:2023/12/10 编程问答 35 豆豆
生活随笔 收集整理的這篇文章主要介紹了 jsoup爬虫 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

    • 1、jsoup爬蟲簡(jiǎn)單介紹
    • 2、相關(guān)代碼
        • 2.1導(dǎo)入pom依賴
        • 2.2、圖片爬取
        • 2.3、圖片本地化
    • 3、百度云鏈接爬蟲

1、jsoup爬蟲簡(jiǎn)單介紹

jsoup 是一款 Java 的HTML 解析器,可通過DOM,CSS選擇器以及類似于JQuery的操作方法來提取和操作Html文檔數(shù)據(jù)。

這兩個(gè)涉及到的點(diǎn)有以下幾個(gè):

1、httpclient獲取網(wǎng)頁(yè)內(nèi)容
2、Jsoup解析網(wǎng)頁(yè)內(nèi)容
3、要達(dá)到增量爬取的效果,那么需要利用緩存ehcache對(duì)重復(fù)URL判重
4、將爬取到的數(shù)據(jù)存入數(shù)據(jù)庫(kù)
5、為解決某些網(wǎng)站防盜鏈的問題,那么需要將對(duì)方網(wǎng)站的靜態(tài)資源(這里只處理了圖片)本地化

2、相關(guān)代碼

2.1導(dǎo)入pom依賴

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.zrh</groupId><artifactId>T226_jsoup</artifactId><version>0.0.1-SNAPSHOT</version><packaging>jar</packaging><name>T226_jsoup</name><url>http://maven.apache.org</url><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding></properties><dependencies><!-- jdbc驅(qū)動(dòng)包 --><dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><version>5.1.44</version></dependency><!-- 添加Httpclient支持 --><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.2</version></dependency><!-- 添加jsoup支持 --><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.10.1</version></dependency><!-- 添加日志支持 --><dependency><groupId>log4j</groupId><artifactId>log4j</artifactId><version>1.2.16</version></dependency><!-- 添加ehcache支持 --><dependency><groupId>net.sf.ehcache</groupId><artifactId>ehcache</artifactId><version>2.10.3</version></dependency><!-- 添加commons io支持 --><dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.5</version></dependency><dependency><groupId>com.alibaba</groupId><artifactId>fastjson</artifactId><version>1.2.47</version></dependency></dependencies> </project>

2.2、圖片爬取

需要修改你要爬取的圖片地址

private static String URL = "http://www.yidianzhidao.com/UploadFiles/img_1_446119934_1806045383_26.jpg";

2.3、圖片本地化

crawler.properties

dbUrl=jdbc:mysql://localhost:3306/zrh?autoReconnect=true dbUserName=root dbPassword=123 jdbcName=com.mysql.jdbc.Driver ehcacheXmlPath=C://blogCrawler/ehcache.xml blogImages=D://blogCrawler/blogImages/

log4j.properties

log4j.rootLogger=INFO, stdout,D #Console log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.Target = System.out log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=[%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n#D log4j.appender.D = org.apache.log4j.RollingFileAppender log4j.appender.D.File = C://blogCrawler/bloglogs/log.log log4j.appender.D.MaxFileSize=100KB log4j.appender.D.MaxBackupIndex=100 log4j.appender.D.Append = true log4j.appender.D.layout = org.apache.log4j.PatternLayout log4j.appender.D.layout.ConversionPattern = %-d{yyyy-MM-dd HH:mm:ss} [ %t:%r ] - [ %p ] %m%n

DbUtil.java

package com.zrh.util;import java.sql.Connection; import java.sql.DriverManager;/*** 數(shù)據(jù)庫(kù)工具類* @author user**/ public class DbUtil {/*** 獲取連接* @return* @throws Exception*/public Connection getCon()throws Exception{Class.forName(PropertiesUtil.getValue("jdbcName"));Connection con=DriverManager.getConnection(PropertiesUtil.getValue("dbUrl"), PropertiesUtil.getValue("dbUserName"), PropertiesUtil.getValue("dbPassword"));return con;}/*** 關(guān)閉連接* @param con* @throws Exception*/public void closeCon(Connection con)throws Exception{if(con!=null){con.close();}}public static void main(String[] args) {DbUtil dbUtil=new DbUtil();try {dbUtil.getCon();System.out.println("數(shù)據(jù)庫(kù)連接成功");} catch (Exception e) {e.printStackTrace();System.out.println("數(shù)據(jù)庫(kù)連接失敗");}} }

PropertiesUtil.java

package com.zrh.util;import java.io.IOException; import java.io.InputStream; import java.util.Properties;/*** properties工具類* @author user**/ public class PropertiesUtil {/*** 根據(jù)key獲取value值* @param key* @return*/public static String getValue(String key){Properties prop=new Properties();InputStream in=new PropertiesUtil().getClass().getResourceAsStream("/crawler.properties");try {prop.load(in);} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}return prop.getProperty(key);} }

最重要的代碼來了:
BlogCrawlerStarter.java(核心代碼)

package com.zrh.crawler; import java.io.File; import java.io.IOException; import java.sql.Connection; import java.sql.PreparedStatement; import java.sql.SQLException; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.UUID;import org.apache.commons.io.FileUtils; import org.apache.http.HttpEntity; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import org.apache.log4j.Logger; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;import com.zrh.util.DateUtil; import com.zrh.util.DbUtil; import com.zrh.util.PropertiesUtil;import net.sf.ehcache.Cache; import net.sf.ehcache.CacheManager; import net.sf.ehcache.Status;/*** @author Administrator**/ public class BlogCrawlerStarter {private static Logger logger = Logger.getLogger(BlogCrawlerStarter.class); // https://www.csdn.net/nav/newarticlesprivate static String HOMEURL = "https://www.cnblogs.com/";private static CloseableHttpClient httpClient;private static Connection con;private static CacheManager cacheManager;private static Cache cache;/*** httpclient解析首頁(yè),獲取首頁(yè)內(nèi)容*/public static void parseHomePage() {logger.info("開始爬取首頁(yè):" + HOMEURL);cacheManager = CacheManager.create(PropertiesUtil.getValue("ehcacheXmlPath"));cache = cacheManager.getCache("cnblog");httpClient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(HOMEURL);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info(HOMEURL + ":爬取無(wú)響應(yīng)");return;}if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String homePageContent = EntityUtils.toString(entity, "utf-8");// System.out.println(homePageContent);parseHomePageContent(homePageContent);}} catch (ClientProtocolException e) {logger.error(HOMEURL + "-ClientProtocolException", e);} catch (IOException e) {logger.error(HOMEURL + "-IOException", e);} finally {try {if (response != null) {response.close();}if (httpClient != null) {httpClient.close();}} catch (IOException e) {logger.error(HOMEURL + "-IOException", e);}}if(cache.getStatus() == Status.STATUS_ALIVE) {cache.flush();}cacheManager.shutdown();logger.info("結(jié)束爬取首頁(yè):" + HOMEURL);}/*** 通過網(wǎng)絡(luò)爬蟲框架jsoup,解析網(wǎng)頁(yè)類容,獲取想要數(shù)據(jù)(博客的連接)* * @param homePageContent*/private static void parseHomePageContent(String homePageContent) {Document doc = Jsoup.parse(homePageContent);//#feedlist_id .list_con .title h2 aElements aEles = doc.select("#post_list .post_item .post_item_body h3 a");for (Element aEle : aEles) { // 這個(gè)是首頁(yè)中的博客列表中的單個(gè)鏈接URLString blogUrl = aEle.attr("href");if (null == blogUrl || "".equals(blogUrl)) {logger.info("該博客未內(nèi)容,不再爬取插入數(shù)據(jù)庫(kù)!");continue;}if(cache.get(blogUrl) != null) {logger.info("該數(shù)據(jù)已經(jīng)被爬取到數(shù)據(jù)庫(kù)中,數(shù)據(jù)庫(kù)不再收錄!");continue;} // System.out.println("************************"+blogUrl+"****************************");parseBlogUrl(blogUrl);}}/*** 通過博客地址獲取博客的標(biāo)題,以及博客的類容* * @param blogUrl*/private static void parseBlogUrl(String blogUrl) {logger.info("開始爬取博客網(wǎng)頁(yè):" + blogUrl);httpClient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(blogUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info(blogUrl + ":爬取無(wú)響應(yīng)");return;}if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String blogContent = EntityUtils.toString(entity, "utf-8");parseBlogContent(blogContent, blogUrl);}} catch (ClientProtocolException e) {logger.error(blogUrl + "-ClientProtocolException", e);} catch (IOException e) {logger.error(blogUrl + "-IOException", e);} finally {try {if (response != null) {response.close();}} catch (IOException e) {logger.error(blogUrl + "-IOException", e);}}logger.info("結(jié)束爬取博客網(wǎng)頁(yè):" + HOMEURL);}/*** 解析博客類容,獲取博客中標(biāo)題以及所有內(nèi)容* * @param blogContent*/private static void parseBlogContent(String blogContent, String link) {Document doc = Jsoup.parse(blogContent);if(!link.contains("ansion2014")) {System.out.println(blogContent);}Elements titleEles = doc//#mainBox main .blog-content-box .article-header-box .article-header .article-title-box h1.select("#topics .post h1 a");System.out.println(titleEles.toString());if (titleEles.size() == 0) {logger.info("博客標(biāo)題為空,不插入數(shù)據(jù)庫(kù)!");return;}String title = titleEles.get(0).html();Elements blogContentEles = doc.select("#cnblogs_post_body ");if (blogContentEles.size() == 0) {logger.info("博客內(nèi)容為空,不插入數(shù)據(jù)庫(kù)!");return;}String blogContentBody = blogContentEles.get(0).html();// Elements imgEles = doc.select("img"); // List<String> imgUrlList = new LinkedList<String>(); // if(imgEles.size() > 0) { // for (Element imgEle : imgEles) { // imgUrlList.add(imgEle.attr("src")); // } // } // // if(imgUrlList.size() > 0) { // Map<String, String> replaceUrlMap = downloadImgList(imgUrlList); // blogContent = replaceContent(blogContent,replaceUrlMap); // }String sql = "insert into `t_jsoup_article` values(null,?,?,null,now(),0,0,null,?,0,null)";try {PreparedStatement pst = con.prepareStatement(sql);pst.setObject(1, title);pst.setObject(2, blogContentBody);pst.setObject(3, link);if(pst.executeUpdate() == 0) {logger.info("爬取博客信息插入數(shù)據(jù)庫(kù)失敗");}else {cache.put(new net.sf.ehcache.Element(link, link));logger.info("爬取博客信息插入數(shù)據(jù)庫(kù)成功");}} catch (SQLException e) {logger.error("數(shù)據(jù)異常-SQLException:",e);}}/*** 將別人博客內(nèi)容進(jìn)行加工,將原有圖片地址換成本地的圖片地址* @param blogContent* @param replaceUrlMap* @return*/private static String replaceContent(String blogContent, Map<String, String> replaceUrlMap) {for(Map.Entry<String, String> entry: replaceUrlMap.entrySet()) {blogContent = blogContent.replace(entry.getKey(), entry.getValue());}return blogContent;}/*** 別人服務(wù)器圖片本地化* @param imgUrlList* @return*/private static Map<String, String> downloadImgList(List<String> imgUrlList) {Map<String, String> replaceMap = new HashMap<String, String>();for (String imgUrl : imgUrlList) {CloseableHttpClient httpClient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(imgUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info(HOMEURL + ":爬取無(wú)響應(yīng)");}else {if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String blogImagesPath = PropertiesUtil.getValue("blogImages");String dateDir = DateUtil.getCurrentDatePath();String uuid = UUID.randomUUID().toString();String subfix = entity.getContentType().getValue().split("/")[1];String fileName = blogImagesPath + dateDir + "/" + uuid + "." + subfix;FileUtils.copyInputStreamToFile(entity.getContent(), new File(fileName));replaceMap.put(imgUrl, fileName);}}} catch (ClientProtocolException e) {logger.error(imgUrl + "-ClientProtocolException", e);} catch (IOException e) {logger.error(imgUrl + "-IOException", e);} catch (Exception e) {logger.error(imgUrl + "-Exception", e);} finally {try {if (response != null) {response.close();}} catch (IOException e) {logger.error(imgUrl + "-IOException", e);}}}return replaceMap;}public static void start() {while(true) {DbUtil dbUtil = new DbUtil();try {con = dbUtil.getCon();parseHomePage();} catch (Exception e) {logger.error("數(shù)據(jù)庫(kù)連接勢(shì)失敗!");} finally {try {if (con != null) {con.close();}} catch (SQLException e) {logger.error("數(shù)據(jù)關(guān)閉異常-SQLException:",e);}}try {Thread.sleep(1000*60);} catch (InterruptedException e) {logger.error("主線程休眠異常-InterruptedException:",e);}}}public static void main(String[] args) {start();} }


再看下我們的數(shù)據(jù)庫(kù)的數(shù)據(jù)都插入了:

3、百度云鏈接爬蟲

PanZhaoZhaoCrawler3.java

package com.zrh.crawler;import java.io.IOException; import java.sql.Connection; import java.sql.PreparedStatement; import java.sql.SQLException; import java.util.LinkedList; import java.util.List; import java.util.UUID;import org.apache.http.HttpEntity; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import org.apache.log4j.Logger; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;import com.zrh.util.DbUtil; import com.zrh.util.PropertiesUtil;import net.sf.ehcache.Cache; import net.sf.ehcache.CacheManager; import net.sf.ehcache.Status;public class PanZhaoZhaoCrawler3 {private static Logger logger = Logger.getLogger(PanZhaoZhaoCrawler3.class);private static String URL = "http://www.13910.com/daren/";private static String PROJECT_URL = "http://www.13910.com";private static Connection con;private static CacheManager manager;private static Cache cache;private static CloseableHttpClient httpClient;private static long total = 0;/*** httpclient獲取首頁(yè)內(nèi)容*/public static void parseHomePage() {logger.info("開始爬取:" + URL);manager = CacheManager.create(PropertiesUtil.getValue("ehcacheXmlPath"));cache = manager.getCache("cnblog");httpClient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(URL);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info("鏈接超時(shí)!");} else {if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String pageContent = EntityUtils.toString(entity, "utf-8");parsePageContent(pageContent);}}} catch (ClientProtocolException e) {logger.error(URL + "-解析異常-ClientProtocolException", e);} catch (IOException e) {logger.error(URL + "-解析異常-IOException", e);} finally {try {if (response != null) {response.close();}if (httpClient != null) {httpClient.close();}} catch (IOException e) {logger.error(URL + "-解析異常-IOException", e);}}// 最終將數(shù)據(jù)緩存到硬盤中if (cache.getStatus() == Status.STATUS_ALIVE) {cache.flush();}manager.shutdown();logger.info("結(jié)束爬取:" + URL);}/*** Jsoup解析首頁(yè)內(nèi)容* @param pageContent*/private static void parsePageContent(String pageContent) {Document doc = Jsoup.parse(pageContent);Elements aEles = doc.select(".showtop .key-right .darenlist .list-info .darentitle a");for (Element aEle : aEles) {String aHref = aEle.attr("href");logger.info("提取個(gè)人代理分享主頁(yè):"+aHref);String panZhaoZhaoUserShareUrl = PROJECT_URL + aHref;List<String> panZhaoZhaoUserShareUrls = getPanZhaoZhaoUserShareUrls(panZhaoZhaoUserShareUrl);for (String singlePanZhaoZhaoUserShareUrl : panZhaoZhaoUserShareUrls) { // System.out.println("**********************************************************"+singlePanZhaoZhaoUserShareUrl+"**********************************************************"); // continue;parsePanZhaoZhaoUserShareUrl(singlePanZhaoZhaoUserShareUrl);}}}/*** 收集個(gè)人主頁(yè)的前15條記錄* @param panZhaoZhaoUserShareUrl* @return*/private static List<String> getPanZhaoZhaoUserShareUrls(String panZhaoZhaoUserShareUrl){List<String> list = new LinkedList<String>();list.add(panZhaoZhaoUserShareUrl);for (int i = 2; i < 16; i++) {list.add(panZhaoZhaoUserShareUrl+"page-"+i+".html");}return list;}/*** 解析盤找找加工后的用戶URL* 原:http://yun.baidu.com/share/home?uk=1949795117* 現(xiàn)在:http://www.13910.com/u/1949795117/share/* @param panZhaoZhaoUserShareUrl 現(xiàn)在的url*/private static void parsePanZhaoZhaoUserShareUrl(String panZhaoZhaoUserShareUrl) {logger.info("開始爬取個(gè)人代理分享主頁(yè)::"+panZhaoZhaoUserShareUrl);HttpGet httpGet = new HttpGet(panZhaoZhaoUserShareUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config );CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if(response == null) {logger.info("鏈接超時(shí)!");}else {if(response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String pageContent = EntityUtils.toString(entity, "utf-8");parsePanZhaoZhaoUserSharePageContent(pageContent,panZhaoZhaoUserShareUrl);}}} catch (ClientProtocolException e) {logger.error(panZhaoZhaoUserShareUrl+"-解析異常-ClientProtocolException",e);} catch (IOException e) {logger.error(panZhaoZhaoUserShareUrl+"-解析異常-IOException",e);}finally {try {if(response != null) {response.close();}} catch (IOException e) {logger.error(panZhaoZhaoUserShareUrl+"-解析異常-IOException",e);}}logger.info("結(jié)束爬取個(gè)人代理分享主頁(yè)::"+URL);}/*** 通過用戶分享的百度云主頁(yè)URL獲取的內(nèi)容,得到所有加工后的鏈接* @param pageContent* @param panZhaoZhaoUserShareUrl 加工后的用戶分享主頁(yè)鏈接*/private static void parsePanZhaoZhaoUserSharePageContent(String pageContent, String panZhaoZhaoUserShareUrl) {Document doc = Jsoup.parse(pageContent);Elements aEles = doc.select("#flist li a");if(aEles.size() == 0) {logger.info("沒有爬取到百度云地址");return;}for (Element aEle : aEles) {String ahref = aEle.attr("href");parseUserHandledTargetUrl(PROJECT_URL + ahref);} // System.out.println("***********************************"+aEles.size()+"***********************"+ahref+"**********************************************************");}/*** 解析地址* @param handledTargetUrl 這個(gè)地址中包含了加工后的百度云地址*/private static void parseUserHandledTargetUrl(String handledTargetUrl) {logger.info("開始爬取blog::"+handledTargetUrl);if(cache.get(handledTargetUrl) != null) {logger.info("數(shù)據(jù)庫(kù)已存在該記錄");return;}HttpGet httpGet = new HttpGet(handledTargetUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config );CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if(response == null) {logger.info("鏈接超時(shí)!");}else {if(response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String pageContent = EntityUtils.toString(entity, "utf-8"); // System.out.println("**********************************************************"+pageContent+"**********************************************************");parseHandledTargetUrlPageContent(pageContent,handledTargetUrl);}}} catch (ClientProtocolException e) {logger.error(handledTargetUrl+"-解析異常-ClientProtocolException",e);} catch (IOException e) {logger.error(handledTargetUrl+"-解析異常-IOException",e);}finally {try {if(response != null) {response.close();}} catch (IOException e) {logger.error(handledTargetUrl+"-解析異常-IOException",e);}}logger.info("結(jié)束爬取blog::"+URL);}/*** 解析加工后的百度云地址內(nèi)容* @param pageContent* @param handledTargetUrl 加工后的百度云地址*/private static void parseHandledTargetUrlPageContent(String pageContent, String handledTargetUrl) {Document doc = Jsoup.parse(pageContent);Elements aEles = doc.select(".fileinfo .panurl a");if(aEles.size() == 0) {logger.info("沒有爬取到百度云地址");return;}String ahref = aEles.get(0).attr("href"); // System.out.println("**********************************************************"+ahref+"**********************************************************");getUserBaiduYunUrl(PROJECT_URL+ahref);}/*** 獲取被處理過的百度云鏈接內(nèi)容* @param handledBaiduYunUrl 被處理過的百度云鏈接*/private static void getUserBaiduYunUrl(String handledBaiduYunUrl) {logger.info("開始爬取blog::"+handledBaiduYunUrl);HttpGet httpGet = new HttpGet(handledBaiduYunUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config );CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if(response == null) {logger.info("鏈接超時(shí)!");}else {if(response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String pageContent = EntityUtils.toString(entity, "utf-8"); // System.out.println("**********************************************************"+pageContent+"**********************************************************");parseHandledBaiduYunUrlPageContent(pageContent,handledBaiduYunUrl);}}} catch (ClientProtocolException e) {logger.error(handledBaiduYunUrl+"-解析異常-ClientProtocolException",e);} catch (IOException e) {logger.error(handledBaiduYunUrl+"-解析異常-IOException",e);}finally {try {if(response != null) {response.close();}} catch (IOException e) {logger.error(handledBaiduYunUrl+"-解析異常-IOException",e);}}logger.info("結(jié)束爬取blog::"+URL);}/*** 獲取百度云鏈接* @param pageContent* @param handledBaiduYunUrl*/private static void parseHandledBaiduYunUrlPageContent(String pageContent, String handledBaiduYunUrl) {Document doc = Jsoup.parse(pageContent);Elements aEles = doc.select("#check-result-no a");if(aEles.size() == 0) {logger.info("沒有爬取到百度云地址");return;}String ahref = aEles.get(0).attr("href");if((!ahref.contains("yun.baidu.com")) && (!ahref.contains("pan.baidu.com"))) return;logger.info("**********************************************************"+"爬取到第"+(++total)+"個(gè)目標(biāo)對(duì)象:"+ahref+"**********************************************************"); // System.out.println("爬取到第"+(++total)+"個(gè)目標(biāo)對(duì)象:"+ahref);String sql = "insert into `t_jsoup_article` values(null,?,?,null,now(),0,0,null,?,0,null)";try {PreparedStatement pst = con.prepareStatement(sql); // pst.setObject(1, UUID.randomUUID().toString());pst.setObject(1, "測(cè)試類容");pst.setObject(2, ahref);pst.setObject(3, ahref);if(pst.executeUpdate() == 0) {logger.info("爬取鏈接插入數(shù)據(jù)庫(kù)失敗!!!");}else {cache.put(new net.sf.ehcache.Element(handledBaiduYunUrl, handledBaiduYunUrl));logger.info("爬取鏈接插入數(shù)據(jù)庫(kù)成功!!!");}} catch (SQLException e) {logger.error(ahref+"-解析異常-SQLException",e);}}public static void start() {DbUtil dbUtil = new DbUtil();try {con = dbUtil.getCon();parseHomePage();} catch (Exception e) {logger.error("數(shù)據(jù)庫(kù)創(chuàng)建失敗",e);}}public static void main(String[] args) {start();} }

爬這里面的鏈接

爬想要的電影:
MovieCrawlerStarter.java

package com.zrh.crawler;import java.io.IOException; import java.sql.Connection; import java.sql.PreparedStatement; import java.sql.SQLException; import java.util.LinkedList; import java.util.List;import org.apache.http.HttpEntity; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import org.apache.log4j.Logger; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;import com.zrh.util.DbUtil; import com.zrh.util.PropertiesUtil;import net.sf.ehcache.Cache; import net.sf.ehcache.CacheManager; import net.sf.ehcache.Status;public class MovieCrawlerStarter {private static Logger logger = Logger.getLogger(MovieCrawlerStarter.class);private static String URL = "http://www.8gw.com/";private static String PROJECT_URL = "http://www.8gw.com";private static Connection con;private static CacheManager manager;private static Cache cache;private static CloseableHttpClient httpClient;private static long total = 0;/*** 等待爬取的52個(gè)鏈接的數(shù)據(jù)* * @return*/private static List<String> getUrls() {List<String> list = new LinkedList<String>();list.add("http://www.8gw.com/8gli/index8.html");for (int i = 2; i < 53; i++) {list.add("http://www.8gw.com/8gli/index8_" + i + ".html");}return list;}/*** 獲取URL主體類容* * @param url*/private static void parseUrl(String url) {logger.info("開始爬取系列列表::" + url);HttpGet httpGet = new HttpGet(url);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info("鏈接超時(shí)!");} else {if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String pageContent = EntityUtils.toString(entity, "GBK");parsePageContent(pageContent, url);}}} catch (ClientProtocolException e) {logger.error(url + "-解析異常-ClientProtocolException", e);} catch (IOException e) {logger.error(url + "-解析異常-IOException", e);} finally {try {if (response != null) {response.close();}} catch (IOException e) {logger.error(url + "-解析異常-IOException", e);}}logger.info("結(jié)束爬取系列列表::" + url);}/*** 獲取當(dāng)前頁(yè)中的具體影片的鏈接* @param pageContent* @param url*/private static void parsePageContent(String pageContent, String url) { // System.out.println("****************" + url + "***********************");Document doc = Jsoup.parse(pageContent);Elements liEles = doc.select(".span_2_800 #list_con li");for (Element liEle : liEles) {String movieUrl = liEle.select(".info a").attr("href");if (null == movieUrl || "".equals(movieUrl)) {logger.info("該影片未內(nèi)容,不再爬取插入數(shù)據(jù)庫(kù)!");continue;}if(cache.get(movieUrl) != null) {logger.info("該數(shù)據(jù)已經(jīng)被爬取到數(shù)據(jù)庫(kù)中,數(shù)據(jù)庫(kù)不再收錄!");continue;}parseSingleMovieUrl(PROJECT_URL+movieUrl);}}/*** 解析單個(gè)影片鏈接* @param movieUrl*/private static void parseSingleMovieUrl(String movieUrl) {logger.info("開始爬取影片網(wǎng)頁(yè):" + movieUrl);httpClient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(movieUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info(movieUrl + ":爬取無(wú)響應(yīng)");return;}if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String blogContent = EntityUtils.toString(entity, "GBK");parseSingleMovieContent(blogContent, movieUrl);}} catch (ClientProtocolException e) {logger.error(movieUrl + "-ClientProtocolException", e);} catch (IOException e) {logger.error(movieUrl + "-IOException", e);} finally {try {if (response != null) {response.close();}} catch (IOException e) {logger.error(movieUrl + "-IOException", e);}}logger.info("結(jié)束爬取影片網(wǎng)頁(yè):" + movieUrl);}/*** 解析頁(yè)面主體類容(影片名字、影片描述、影片地址)* @param pageContent* @param movieUrl*/private static void parseSingleMovieContent(String pageContent, String movieUrl) { // System.out.println("****************" + movieUrl + "***********************");Document doc = Jsoup.parse(pageContent);Elements divEles = doc.select(".wrapper .main .moviedteail"); // .wrapper .main .moviedteail .moviedteail_tt h1 // .wrapper .main .moviedteail .moviedteail_list .moviedteail_list_short a // .wrapper .main .moviedteail .moviedteail_img img Elements h1Eles = divEles.select(".moviedteail_tt h1");if (h1Eles.size() == 0) {logger.info("影片名字為空,不插入數(shù)據(jù)庫(kù)!");return;}String mname = h1Eles.get(0).html();Elements aEles = divEles.select(".moviedteail_list .moviedteail_list_short a");if (aEles.size() == 0) {logger.info("影片描述為空,不插入數(shù)據(jù)庫(kù)!");return;}String mdesc = aEles.get(0).html();Elements imgEles = divEles.select(".moviedteail_img img");if (null == imgEles || "".equals(imgEles)) {logger.info("影片描述為空,不插入數(shù)據(jù)庫(kù)!");return;}String mimg = imgEles.attr("src");String sql = "insert into movie(mname,mdesc,mimg,mlink) values(?,?,?,99)";try {System.out.println("****************" + mname + "***********************");System.out.println("****************" + mdesc + "***********************");System.out.println("****************" + mimg + "***********************");PreparedStatement pst = con.prepareStatement(sql);pst.setObject(1, mname);pst.setObject(2, mdesc);pst.setObject(3, mimg);if(pst.executeUpdate() == 0) {logger.info("爬取影片信息插入數(shù)據(jù)庫(kù)失敗");}else {cache.put(new net.sf.ehcache.Element(movieUrl, movieUrl));logger.info("爬取影片信息插入數(shù)據(jù)庫(kù)成功");}} catch (SQLException e) {logger.error("數(shù)據(jù)異常-SQLException:",e);}}public static void main(String[] args) {manager = CacheManager.create(PropertiesUtil.getValue("ehcacheXmlPath"));cache = manager.getCache("8gli_movies");httpClient = HttpClients.createDefault();DbUtil dbUtil = new DbUtil();try {con = dbUtil.getCon();List<String> urls = getUrls();for (String url : urls) {try {parseUrl(url);} catch (Exception e) { // urls.add(url);}} } catch (Exception e1) {logger.error("數(shù)據(jù)庫(kù)連接勢(shì)失敗!");} finally {try {if (httpClient != null) {httpClient.close();}if (con != null) {con.close();}} catch (IOException e) {logger.error("網(wǎng)絡(luò)連接關(guān)閉異常-IOException:",e);} catch (SQLException e) {logger.error("數(shù)據(jù)關(guān)閉異常-SQLException:",e);}}// 最終將數(shù)據(jù)緩存到硬盤中if (cache.getStatus() == Status.STATUS_ALIVE) {cache.flush();}manager.shutdown();}}


總結(jié)

以上是生活随笔為你收集整理的jsoup爬虫的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。