當前位置：首頁 > 编程资源 > 综合教程 >内容正文

综合教程

分享自己小说站小说收录历程

發布時間：2023/12/25 综合教程 34 生活家

生活随笔收集整理的這篇文章主要介紹了分享自己小说站小说收录历程小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

空閑之余自己弄了一個小說網站，也沒有什么目的性，就當是興趣之余的一點專注和對那些小說愛好者的一點貢獻（曾經有朋友讓我為他在小說站點充值看小說的），挺不值的。看上去就一個小說站，無非就幾張頁面，小說詳情、小說章節列表、小說內容頁，核心業務就是小說章節內容收錄。貌似沒多大功能，可是這期間投入了我不少時間和精力，走了不少彎路，把一些坎坷和曲折記錄下，以供有做小說站的朋友一點借鑒，這篇文章只是對小說收錄的業務做簡單說明，關于網站的建設和上線、運行、優化就不陳述了。

在小說收錄的工具編寫中，經歷了三個階段，第一階段直接依賴盜版網站，這個階段是比較幼嫩的，也是比較簡單的，就是找幾個盜版網站，從小說更新列表收錄小說詳情地址，再根據小說詳情地址收錄小說詳情（書名、作者、簡介、類型、封面、章節列表地址等），再根據小說章節列表地址收錄小說章節信息（章節名稱、更新時間、章節字數、章節內容地址），最后根據章節內容地址得到章節內容，功能實現用到的是正則表達式，這個很簡單，相信沒用過正則表達式的朋友用兩天時間也能搞定。簡單的流程圖如下：

這樣收錄小說存在幾個問題：

1、要專門為盜版站編寫正則（小說更新列表、小說詳情、小說章節列表、小說內容），盜版站排版可能會經常變化，正則也要跟著變化。

2、盜版站不穩定，可能投入或者其他的原因，明天或者后天就沒了。

3、指定的盜版站未必是更新速度最快的，直接依賴與具體的盜版站，必須等它更新了你才能更新到最新章節，速度不行。

4、會存在斷章節的情況。這個是比較致命的，很多讀者因為你斷章節，就開溜了。

為了規避這幾個問題，開始了新的思考，首先我要解決致命的問題，不能斷章節，所以要去正版網站收錄章節，正版網站章節是什么樣的順序就應該是什么順序，包括分卷之類的，經過一般思考之后到了第二階段，到正版網站小說更新列表頁面收錄小說詳情地址，根據詳情地址收錄小說詳情，根據章節列表地址收錄小說所有章節列表，根據小說章節內容地址收錄小說非VIP章節內容（vip章節這里是沒辦法收錄的），然后再根據書名去指定的盜版網站查詢這本書的章節列表地址，根據章節列表地址收錄章節列表，比對vip章節名稱來得到盜版的vip章節內容地址，從來得到vip章節內容，簡單的流程圖如下：

相對于第一階段來說是解決了斷章節的問題，可這引入了更繁瑣的邏輯，要依賴正版網站，還得依賴盜版網站，當然這里不會因為某個盜版網站掛掉了，小說章節收錄就進行不下去了，一本小說可以對應多個盜版小說網站的章節列表，我們只要為沒本小說查找更多的盜版小說地址就為更新提高了更穩定的保證，但是這個工作量就增加了，沒增加一個盜版小說網站，你要為這個站點編寫正則，而且還是第一階段的問題，不能保證指定的盜版站點就是更新最快的站點。所以我想，如果要是VIP章節不依賴與具體的盜版網站，而依賴與中間的一個代理地址那多好，那就不用管網站什么結構，那個網站更新快的問題了，于是我想到了搜索引擎，不得不佩服這玩意確實是個好東西，你想到的或者沒想到的它都能給你答案，不得不說搜索引擎是萬能的，一番思考和嘗試之后，總算實現了用搜索引擎來收錄小說VIP章節，從正版網站收錄小說章節列表后，過濾出需要收錄的vip章節，然后根據小說名稱加上一定的搜索關鍵字到搜索引擎搜索盜版站點，從而得到盜版站點的章節列表地址，根據需要收錄的vip章節比對章節名稱得到章節內容地址，從而得到vip章節內容，簡單流程圖如下：

想法是好的，可是怎么實現呢，我們并不知道搜索引擎會給我們那些盜版網站，我們也不知道這些盜版站的頁面結果，怎么去得到章節內容呢？

只要用心，最終功夫還是不負有心人，我查看了很多小說站點的章節內容基本上都是放在<td></td>或者<div></div>之間，或者就是直接輸出，不用任何html標簽包裹，對于第一種情況我們可以用htmlparse把文本轉換成doc對象，遍歷每個標簽的內容，對內容加以判斷，對于第二種情況那還是得用正則，匹配里面的漢子，判斷得到的每組漢子中間的html標簽，如果中間的標簽只是<br>或者<p>或者 那么肯定是章節內容了，下面來看看代碼怎么實現的。

/// <summary> /// 默認章節字數要1000長度 /// </summary> private int m_chapter_content_length = 1000; /// <summary> /// 從google收錄章節 /// </summary> /// <param name="i_book_name">小說名稱</param> /// <param name="i_chapter_list">需要收錄的vip章節列表</param> /// <returns></returns> public List<BookChapterInfo> CollectBookChapterList(string i_book_name,List<BookChapterInfo> i_chapter_list) { i_book_name = Regex.Replace(i_book_name, "[a-zA-Z0-9()（）]", "", RegexOptions.Compiled | RegexOptions.IgnoreCase); if (i_chapter_list == null || i_chapter_list.Count <= 0) return null; Encoding t_encoding = Encoding.GetEncoding("gb2312"); string t_key_word = string.Format("小說{0}最新章節txt", i_book_name); string t_baidu_url = string.Format("http://www.baidu.com/s?wd={0}", HttpUtility.UrlEncode(t_key_word, t_encoding)); string t_list_reg = "<h3\\s*?class=[\'\"]?t[\'\"]?><a[^<>]*?hrefs*=s*[\'\"]*([^\"\']*)[\'\"]*[^<>]*?>(.*?)</a>\\s*?</h3>"; List<BookChapterInfo> t_need_collect_list = new List<BookChapterInfo>(); List<BookChapterInfo> t_collect_chapter_list = new List<BookChapterInfo>(); List<BookChapterInfo> t_vip_chapter_list = new List<BookChapterInfo>(); string t_book_url = string.Empty; try { string t_html = NetSiteCatchManager.ReadUrl(t_baidu_url, t_encoding); if (!string.IsNullOrEmpty(t_html)) { MatchCollection t_ma = Regex.Matches(t_html, t_list_reg, RegexOptions.IgnoreCase | RegexOptions.Compiled); if (t_ma != null) { for(int index=0;index<t_ma.Count;index++) { t_book_url = t_ma[index].Groups[1].Value.ToString(); t_html = NetSiteCatchManager.ReadUrl(t_book_url, Encoding.Default); t_need_collect_list = GetNeedCollectChapter(i_chapter_list, t_vip_chapter_list); t_collect_chapter_list = GetBookChapterList(t_book_url, t_html, t_need_collect_list, i_book_name); if (t_collect_chapter_list != null && t_collect_chapter_list.Count > 0) t_vip_chapter_list.AddRange(t_collect_chapter_list); //就差10個章節退出 if (t_vip_chapter_list != null && t_vip_chapter_list.Count > 0 && i_chapter_list.Count-t_vip_chapter_list.Count<10) break; } } } } catch (Exception ex) { LogHelper.Error("從google收錄章節列表失敗" + ex.ToString()); } return t_vip_chapter_list; } /// <summary> /// 獲取還沒有收錄到的章節列表 /// </summary> /// <param name="i_vip_list"></param> /// <param name="i_have_collect_list"></param> /// <returns></returns> private List<BookChapterInfo> GetNeedCollectChapter(List<BookChapterInfo> i_vip_list, List<BookChapterInfo> i_have_collect_list) { if (i_have_collect_list == null || i_have_collect_list.Count <= 0) return i_vip_list; List<BookChapterInfo> t_list = new List<BookChapterInfo>(); foreach (BookChapterInfo t_chapter in i_vip_list) { List<BookChapterInfo> t_temp = i_have_collect_list.FindAll(delegate(BookChapterInfo t_have_chapter) { return t_chapter.ChapterName == t_have_chapter.ChapterName; }); if (t_temp == null || t_temp.Count <= 0) { t_list.Add(t_chapter); } } return t_list; } /// <summary> /// 獲取章節列表 /// </summary> /// <param name="i_html"></param> /// <param name="i_chapter_list"></param> /// <returns></returns> private List<BookChapterInfo> GetBookChapterList(string i_url,string i_html, List<BookChapterInfo> i_chapter_list,string i_book_name) { if (!NetSiteCatchManager.IsPiraticSite(i_url)) return null; if (string.IsNullOrEmpty(i_html)) return null; string t_chapter_name_reg = "<a[^<>]*?hrefs*=s*[\'\"]*([^\"\']*)[\'\"]*[^<>]*?>(.*?)</a>"; List<BookChapterInfo> t_chapter_list = new List<BookChapterInfo>(); BookChapterInfo t_chapter = null; bool t_is_stop = false; string t_chapter_url = string.Empty; try { MatchCollection t_ma = Regex.Matches(i_html, t_chapter_name_reg, RegexOptions.IgnoreCase | RegexOptions.Compiled); if (t_ma != null) { foreach (BookChapterInfo t_ch in i_chapter_list) { foreach (Match t_mc in t_ma) { if (CompareChapterName(t_mc.Groups[2].Value.ToString().Trim(), t_ch.ChapterName) == true) { t_chapter_url = NetSiteCatchManager.GetFullUrl(i_url, t_mc.Groups[1].Value.ToString().Trim()); if (string.IsNullOrEmpty(t_chapter_url)) { t_is_stop = true; break; } t_chapter = GetBookChapter(t_chapter_url, t_ch, i_book_name); if (t_chapter == null) { t_is_stop = true; break; } if (t_chapter != null) t_chapter_list.Add(t_chapter); break; } } if (t_is_stop) break; } } return t_chapter_list; } catch (Exception ex) { LogHelper.Error("從百度分離章節名稱失敗" + ex.ToString()); return null; } } /// <summary> /// 得到章節信息 /// </summary> /// <param name="i_url"></param> /// <param name="i_chapter_name"></param> /// <param name="i_chapter_list"></param> /// <returns></returns> private BookChapterInfo GetBookChapter(string i_url, BookChapterInfo i_chapter, string i_book_name) { //最后一個章節不一定有1000字 if (i_chapter.ChapterName.IndexOf("完") > -1 || i_chapter.ChapterName.IndexOf("終") > -1 || i_chapter.ChapterName.IndexOf("結") > -1) { m_chapter_content_length = 300; } else { m_chapter_content_length = 1000; } BookChapterInfo t_chapter_info=null; string t_chapter_content = GetContent(i_url, i_chapter.ChapterName); t_chapter_content = NetSiteCatchManager.ReplaceContent(t_chapter_content); if (string.IsNullOrEmpty(t_chapter_content) || t_chapter_content.Length < m_chapter_content_length) { t_chapter_content = GetChapterContentByChapterName(i_book_name, i_chapter.ChapterName); t_chapter_content = NetSiteCatchManager.ReplaceContent(t_chapter_content); if (string.IsNullOrEmpty(t_chapter_content) || t_chapter_content.Length < m_chapter_content_length) return null; } t_chapter_content = string.Format("document.write('{0}');", t_chapter_content); t_chapter_info = new BookChapterInfo(); t_chapter_info.ChapterName = i_chapter.ChapterName; t_chapter_info.ChapterContent = t_chapter_content; t_chapter_info.WordsCount = t_chapter_content.Length; t_chapter_info.Comfrom = i_url; t_chapter_info.IsVip = i_chapter.IsVip; t_chapter_info.UpdateTime = i_chapter.UpdateTime; t_chapter_info.VolumeName = i_chapter.VolumeName; t_chapter_info.BookId = i_chapter.BookId; t_chapter_info.SiteId = i_chapter.SiteId; return t_chapter_info; } /// <summary> /// 獲取章節內容 /// </summary> /// <param name="i_url"></param> /// <param name="i_chapter_name"></param> /// <returns></returns> private string GetContent(string i_url, string i_chapter_name) { Encoding t_encoding = Encoding.Default; string t_chapter_content = string.Empty; string t_charset = string.Empty; try { string t_html = NetSiteCatchManager.ReadUrl(i_url, t_encoding); if (string.IsNullOrEmpty(t_html)) { //重復一次 t_html = NetSiteCatchManager.ReadUrl(i_url, t_encoding); } t_chapter_content = GetChapterContent(t_html); return t_chapter_content; } catch (Exception ex) { LogHelper.Error("獲取頁面內容失敗" + ex.ToString()); return string.Empty; } } /// <summary> /// 獲取html章節內容 /// </summary> /// <param name="i_html"></param> /// <param name="i_chapter_name"></param> /// <returns></returns> private string GetChapterContent(string i_html) { HtmlDocument t_html_doc = HtmlDocument.Create(i_html); string t_content = string.Empty; string t_temp_content = string.Empty; foreach (HtmlElement t_ele in t_html_doc.GetElementsByTagName("td")) { t_temp_content = t_ele.InnerText; t_temp_content = Regex.Replace(t_temp_content, "<.*?>.*?</.*?>", "", RegexOptions.IgnoreCase | RegexOptions.Compiled); t_temp_content = Regex.Replace(t_temp_content, "[a-zA-Z0-9]", "", RegexOptions.IgnoreCase | RegexOptions.Compiled); if (t_temp_content.Length > m_chapter_content_length) { t_content = t_ele.HTML; } } if (!string.IsNullOrEmpty(t_content)) return t_content; foreach (HtmlElement t_ele in t_html_doc.GetElementsByTagName("div")) { t_temp_content = t_ele.InnerText; t_temp_content = Regex.Replace(t_temp_content, "<.*?>.*?</.*?>", "", RegexOptions.IgnoreCase | RegexOptions.Compiled); t_temp_content = Regex.Replace(t_temp_content, "[a-zA-Z0-9,\\/;_()]", "", RegexOptions.IgnoreCase | RegexOptions.Compiled); if (t_temp_content.Length > m_chapter_content_length) { t_content = t_ele.HTML; } } if (string.IsNullOrEmpty(t_content) || t_content.Length < m_chapter_content_length) t_content = GetContentByReg(i_html); if (t_content.Length < m_chapter_content_length) return string.Empty; return t_content; } /// <summary> /// 用正則表達式獲取章節內容 /// </summary> /// <param name="i_html"></param> /// <returns></returns> private string GetContentByReg(string i_html) { StringBuilder t_sb = new StringBuilder(); string t_reg = "([\u4E00-\u9FA5][^<>]*[\u4E00-\u9FA5])"; MatchCollection t_ma = Regex.Matches(i_html, t_reg, RegexOptions.IgnoreCase | RegexOptions.Compiled); string t_sub_html = string.Empty; int t_start_index = 0; int t_length = 0; if (t_ma != null) { int t_total_count=t_ma.Count; for (int index = 0; index < t_total_count-1; index++) { t_start_index = t_ma[index].Index + t_ma[index].Groups[1].Value.ToString().Length; t_length = t_ma[index + 1].Index - t_ma[index].Index - t_ma[index].Groups[1].Value.ToString().Length; t_sub_html = i_html.Substring(t_start_index, t_length); t_sub_html = Regex.Replace(t_sub_html, " ", "", RegexOptions.IgnoreCase | RegexOptions.Compiled); t_sub_html = Regex.Replace(t_sub_html, "<[/]*p[^<>]*>", "", RegexOptions.IgnoreCase | RegexOptions.Compiled); t_sub_html = Regex.Replace(t_sub_html, "<[/]*br>", "", RegexOptions.IgnoreCase | RegexOptions.Compiled); t_sub_html=Regex.Replace(t_sub_html, "[【】（），！？(),!?;；、……]", "", RegexOptions.IgnoreCase | RegexOptions.Compiled); if (t_sub_html.Length < 10) { t_sb.Append(t_ma[index].Groups[1].Value.ToString()); t_sb.Append("<p>    "); } } } return t_sb.ToString(); } /// <summary> /// 判斷是否是相同的章節 /// </summary> /// <param name="i_chapter_source"></param> /// <param name="i_chapter_target"></param> /// <returns></returns> private bool CompareChapterName(string i_chapter_source, string i_chapter_target) { if (i_chapter_source.Equals(i_chapter_target)) return true; //去掉空格 i_chapter_source = Regex.Replace(i_chapter_source, "[\\s【】（），！？(),!?;\\.；、/……]", "", RegexOptions.IgnoreCase | RegexOptions.Compiled); i_chapter_target = Regex.Replace(i_chapter_target, "[\\s【】（），！？(),!?;；\\.、/……]", "", RegexOptions.IgnoreCase | RegexOptions.Compiled); if (i_chapter_source.IndexOf(i_chapter_target) > -1 || i_chapter_target.IndexOf(i_chapter_source) > -1) return true; return false; } /// <summary> /// 通過章節名稱去搜索引擎收錄 /// </summary> /// <param name="i_book_name"></param> /// <param name="i_chapter_name"></param> /// <returns></returns> private string GetChapterContentByChapterName(string i_book_name, string i_chapter_name) { string t_key_word=i_chapter_name; //章節名稱長度小于5加上書名作為關鍵字 if (i_chapter_name.Length < 5) { t_key_word = string.Format("{0} {1}", i_book_name, i_chapter_name); } Encoding t_encoding = Encoding.GetEncoding("gb2312"); string t_baidu_url = string.Format("http://www.baidu.com/s?wd={0}", HttpUtility.UrlEncode(t_key_word, t_encoding)); string t_list_reg = "<h3\\s*?class=[\'\"]?t[\'\"]?><a[^<>]*?hrefs*=s*[\'\"]*([^\"\']*)[\'\"]*[^<>]*?>(.*?)</a>\\s*?</h3>"; string t_chapter_url = string.Empty; string t_chapter_content = string.Empty; try { string t_html = NetSiteCatchManager.ReadUrl(t_baidu_url, t_encoding); if (!string.IsNullOrEmpty(t_html)) { MatchCollection t_ma = Regex.Matches(t_html, t_list_reg, RegexOptions.IgnoreCase | RegexOptions.Compiled); if (t_ma != null) { foreach (Match t_mc in t_ma) { t_chapter_url = t_mc.Groups[1].Value.ToString(); t_html = NetSiteCatchManager.ReadUrl(t_chapter_url, Encoding.Default); if (string.IsNullOrEmpty(t_html)) { //重復一次 t_html = NetSiteCatchManager.ReadUrl(t_chapter_url, Encoding.Default); if (NetSiteCatchManager.IsContainChapterName(i_book_name, i_chapter_name, t_html) == false) continue; t_chapter_content = GetChapterContent(t_html); if (!string.IsNullOrEmpty(t_chapter_content) && t_chapter_content.Length > m_chapter_content_length) break; } } } } return t_chapter_content; } catch (Exception ex) { LogHelper.Error("根據章節名稱收錄章節失敗" + ex.ToString()); return string.Empty; } }

現在第三階段的代碼已經正常運行兩天，激動之余有點迫不及待的跟大家分享，有小說愛好者可以關注下小站（http://www.dazhongxiaoshuo.com)，下班之余我大部分時間都是看小說，接下來會話點時間開發手機閱讀功能，在被窩里看小說是我的追求。

總結

以上是生活随笔為你收集整理的分享自己小说站小说收录历程的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：部署
下一篇：如何给网页标题栏上添加图标(favico