java网页解析包_java 网页解析工具包 Jsoup
Jsoup是一個(gè)非常好的解析網(wǎng)頁的包,用java開發(fā)的,提供了類似DOM,CSS選擇器的方式來查找和提取文檔中的內(nèi)容。
相關(guān)資料如下:
今天做了一個(gè)Jsoup解析網(wǎng)站的項(xiàng)目,使用Jsoup.connect(url).get()連接某網(wǎng)站時(shí)偶爾會(huì)出現(xiàn)
java.net.SocketTimeoutException:Read timed out異常。
原因是默認(rèn)的Socket的延時(shí)比較短,而有些網(wǎng)站的響應(yīng)速度比較慢,
所以會(huì)發(fā)生超時(shí)的情況。
解決方法:
鏈接的時(shí)候設(shè)定超時(shí)時(shí)間即可。
doc = Jsoup.connect(url).timeout(5000).get();
5000表示延時(shí)時(shí)間設(shè)置為5s。
測試代碼如下:
1,不設(shè)定timeout時(shí):
package jsoupTest;
import java.io.IOException;
import org.jsoup.*;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String[] args) throws IOException{
String url = "http://www.weather.com.cn/weather/101010400.shtml";
long start = System.currentTimeMillis();
Document doc=null;
try{
doc = Jsoup.connect(url).get();
}
catch(Exception e){
e.printStackTrace();
}
finally{
System.out.println("Time is:"+(System.currentTimeMillis()-start) + "ms");
}
Elements elem = doc.getElementsByTag("Title");
System.out.println("Title is:" +elem.text());
}
}
有時(shí)發(fā)生超時(shí):
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at sun.net.www.http.ChunkedInputStream.fastRead(Unknown Source)
at sun.net.www.http.ChunkedInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(Unknown Source)
at java.util.zip.InflaterInputStream.fill(Unknown Source)
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.GZIPInputStream.read(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at org.jsoup.helper.DataUtil.readToByteBuffer(DataUtil.java:113)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:447)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:393)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:159)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:148)
at jsoupTest.JsoupTest.main(JsoupTest.java:17)
Time is:3885ms
Exception in thread "main" java.lang.NullPointerException
at jsoupTest.JsoupTest.main(JsoupTest.java:25)
2,設(shè)定了則一般不會(huì)超時(shí)
package jsoupTest;
import java.io.IOException;
import org.jsoup.*;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String[] args) throws IOException{
String url = "http://www.weather.com.cn/weather/101010400.shtml";
long start = System.currentTimeMillis();
Document doc=null;
try{
doc = Jsoup.connect(url).timeout(5000).get();
}
catch(Exception e){
e.printStackTrace();
}
finally{
System.out.println("Time is:"+(System.currentTimeMillis()-start) + "ms");
}
Elements elem = doc.getElementsByTag("Title");
System.out.println("Title is:" +elem.text());
}
}
輸出為:
Time is:4158ms Title is:順義天氣預(yù)報(bào)-今日_明日_一周天氣預(yù)報(bào):16日星期五 ?多云轉(zhuǎn)晴 ?11/-4℃
總結(jié)
以上是生活随笔為你收集整理的java网页解析包_java 网页解析工具包 Jsoup的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: [ZJOI2007]时态同步 树形DP
- 下一篇: redis知识归纳