[Java Web]敏感词过滤算法
生活随笔
收集整理的這篇文章主要介紹了
[Java Web]敏感词过滤算法
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
1.DFA算法
DFA算法的原理可以參考?這里?,簡單來說就是通過Map構造出一顆敏感詞樹,樹的每一條由根節點到葉子節點的路徑構成一個敏感詞,例如下圖:
代碼簡單實現如下:
public class TextFilterUtil { //日志 private static final Logger LOG = LoggerFactory.getLogger(TextFilterUtil.class); //敏感詞庫 private static HashMap sensitiveWordMap = null; //默認編碼格式 private static final String ENCODING = "gbk"; //敏感詞庫的路徑 private static final InputStream in = TextFilterUtil.class.getClassLoader().getResourceAsStream("sensitive/keyWords.txt"); /** * 初始化敏感詞庫 */ private static void init() { //讀取文件 Set<String> keyWords = readSensitiveWords(); //創建敏感詞庫 sensitiveWordMap = new HashMap<>(keyWords.size()); for (String keyWord : keyWords) { createKeyWord(keyWord); } } /** * 構建敏感詞庫 * * @param keyWord */ private static void createKeyWord(String keyWord) { if (sensitiveWordMap == null) { LOG.error("sensitiveWordMap 未初始化!"); return; } Map nowMap = sensitiveWordMap; for (Character c : keyWord.toCharArray()) { Object obj = nowMap.get(c); if (obj == null) { Map<String, Object> childMap = new HashMap<>(); childMap.put("isEnd", "false"); nowMap.put(c, childMap); nowMap = childMap; } else { nowMap = (Map) obj; } } nowMap.put("isEnd", "true"); } /** * 讀取敏感詞文件 * * @return */ private static Set<String> readSensitiveWords() { Set<String> keyWords = new HashSet<>(); BufferedReader reader = null; try { reader = new BufferedReader(new InputStreamReader(in, ENCODING)); String line; while ((line = reader.readLine()) != null) { keyWords.add(line.trim()); } } catch (UnsupportedEncodingException e) { LOG.error("敏感詞庫文件轉碼失敗!"); } catch (FileNotFoundException e) { LOG.error("敏感詞庫文件不存在!"); } catch (IOException e) { LOG.error("敏感詞庫文件讀取失敗!"); } finally { if (reader != null) { try { reader.close(); } catch (IOException e) { e.printStackTrace(); } reader = null; } } return keyWords; } /** * 檢查敏感詞 * * @return */ private static List<String> checkSensitiveWord(String text) { if (sensitiveWordMap == null) { init(); } List<String> sensitiveWords = new ArrayList<>(); Map nowMap = sensitiveWordMap; for (int i = 0; i < text.length(); i++) { Character word = text.charAt(i); Object obj = nowMap.get(word); if (obj == null) { continue; } int j = i + 1; Map childMap = (Map) obj; while (j < text.length()) { if ("true".equals(childMap.get("isEnd"))) { sensitiveWords.add(text.substring(i, j)); } obj = childMap.get(text.charAt(j)); if (obj != null) { childMap = (Map) obj; } else { break; } j++; } } return sensitiveWords; } }2.TTMP算法
TTMP算法由網友原創,關于它的起源可以查看?這里?,TTMP算法的原理是將敏感詞拆分成“臟字”的序列,只有待比對字符串完全由“臟字”組成時,才去判斷它是否為敏感詞,減少了比對次數。這個算法的簡單實現如下:
public class TextFilterUtil { //日志 private static final Logger LOG = LoggerFactory.getLogger(TextFilterUtil.class); //默認編碼格式 private static final String ENCODING = "gbk"; //敏感詞庫的路徑 private static final InputStream in = TextFilterUtil.class.getClassLoader().getResourceAsStream("sensitive/keyWords.txt"); //臟字庫 private static Set<Character> sensitiveCharSet = null; //敏感詞庫 private static Set<String> sensitiveWordSet = null; /** * 初始化敏感詞庫 */ private static void init() { //初始化容器 sensitiveCharSet = new HashSet<>(); sensitiveWordSet = new HashSet<>(); //讀取文件 創建敏感詞庫 readSensitiveWords(); } /** * 讀取本地的敏感詞文件 * * @return */ private static void readSensitiveWords() { BufferedReader reader = null; try { reader = new BufferedReader(new InputStreamReader(in, ENCODING)); String line; while ((line = reader.readLine()) != null) { String word = line.trim(); sensitiveWordSet.add(word); for (Character c : word.toCharArray()) { sensitiveCharSet.add(c); } } } catch (UnsupportedEncodingException e) { LOG.error("敏感詞庫文件轉碼失敗!"); } catch (FileNotFoundException e) { LOG.error("敏感詞庫文件不存在!"); } catch (IOException e) { LOG.error("敏感詞庫文件讀取失敗!"); } finally { if (reader != null) { try { reader.close(); } catch (IOException e) { e.printStackTrace(); } reader = null; } } return; } /** * 檢查敏感詞 * * @return */ private static List<String> checkSensitiveWord(String text) { if (sensitiveWordSet == null || sensitiveCharSet == null) { init(); } List<String> sensitiveWords = new ArrayList<>(); for (int i = 0; i < text.length(); i++) { Character word = text.charAt(i); if (!sensitiveCharSet.contains(word)) { continue; } int j = i; while (j < text.length()) { if (!sensitiveCharSet.contains(word)) { break; } String key = text.substring(i, j + 1); if (sensitiveWordSet.contains(key)) { sensitiveWords.add(key); } j++; } } return sensitiveWords; } }注:以上代碼實現僅用于展示思路,在實際使用中還有很多地方可以優化。
總結
以上是生活随笔為你收集整理的[Java Web]敏感词过滤算法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: xp系统怎样添加桌面计算机,如何为XP系
- 下一篇: 高效Java实现敏感词过滤算法工具包