日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

發(fā)布時間:2025/3/21 编程问答 26 豆豆
生活随笔 收集整理的這篇文章主要介紹了 ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

ML之NB:利用樸素貝葉斯NB算法(TfidfVectorizer+不去除停用詞)對20類新聞文本數(shù)據(jù)集進行分類預測、評估

?

目錄

輸出結果

設計思路

核心代碼


?

?

?

?

?

?

?

輸出結果

?

?

設計思路

?

?

核心代碼

class TfidfVectorizer Found at: sklearn.feature_extraction.textclass TfidfVectorizer(CountVectorizer):"""Convert a collection of raw documents to a matrix of TF-IDF features.Equivalent to CountVectorizer followed by TfidfTransformer.Read more in the :ref:`User Guide <text_feature_extraction>`.Parameters----------input : string {'filename', 'file', 'content'}If 'filename', the sequence passed as an argument to fit isexpected to be a list of filenames that need reading to fetchthe raw content to analyze.If 'file', the sequence items must have a 'read' method (file-likeobject) that is called to fetch the bytes in memory.Otherwise the input is expected to be the sequence strings orbytes items are expected to be analyzed directly.encoding : string, 'utf-8' by default.If bytes or files are given to analyze, this encoding is used todecode.decode_error : {'strict', 'ignore', 'replace'}Instruction on what to do if a byte sequence is given to analyze thatcontains characters not of the given `encoding`. By default, it is'strict', meaning that a UnicodeDecodeError will be raised. Othervalues are 'ignore' and 'replace'.strip_accents : {'ascii', 'unicode', None}Remove accents during the preprocessing step.'ascii' is a fast method that only works on characters that havean direct ASCII mapping.'unicode' is a slightly slower method that works on any characters.None (default) does nothing.analyzer : string, {'word', 'char'} or callableWhether the feature should be made of word or character n-grams.If a callable is passed it is used to extract the sequence of featuresout of the raw, unprocessed input.preprocessor : callable or None (default)Override the preprocessing (string transformation) stage whilepreserving the tokenizing and n-grams generation steps.tokenizer : callable or None (default)Override the string tokenization step while preserving thepreprocessing and n-grams generation steps.Only applies if ``analyzer == 'word'``.ngram_range : tuple (min_n, max_n)The lower and upper boundary of the range of n-values for differentn-grams to be extracted. All values of n such that min_n <= n <= max_nwill be used.stop_words : string {'english'}, list, or None (default)If a string, it is passed to _check_stop_list and the appropriate stoplist is returned. 'english' is currently the only supported stringvalue.If a list, that list is assumed to contain stop words, all of whichwill be removed from the resulting tokens.Only applies if ``analyzer == 'word'``.If None, no stop words will be used. max_df can be set to a valuein the range [0.7, 1.0) to automatically detect and filter stopwords based on intra corpus document frequency of terms.lowercase : boolean, default TrueConvert all characters to lowercase before tokenizing.token_pattern : stringRegular expression denoting what constitutes a "token", only usedif ``analyzer == 'word'``. The default regexp selects tokens of 2or more alphanumeric characters (punctuation is completely ignoredand always treated as a token separator).max_df : float in range [0.0, 1.0] or int, default=1.0When building the vocabulary ignore terms that have a documentfrequency strictly higher than the given threshold (corpus-specificstop words).If float, the parameter represents a proportion of documents, integerabsolute counts.This parameter is ignored if vocabulary is not None.min_df : float in range [0.0, 1.0] or int, default=1When building the vocabulary ignore terms that have a documentfrequency strictly lower than the given threshold. This value is alsocalled cut-off in the literature.If float, the parameter represents a proportion of documents, integerabsolute counts.This parameter is ignored if vocabulary is not None.max_features : int or None, default=NoneIf not None, build a vocabulary that only consider the topmax_features ordered by term frequency across the corpus.This parameter is ignored if vocabulary is not None.vocabulary : Mapping or iterable, optionalEither a Mapping (e.g., a dict) where keys are terms and values areindices in the feature matrix, or an iterable over terms. If notgiven, a vocabulary is determined from the input documents.binary : boolean, default=FalseIf True, all non-zero term counts are set to 1. This does not meanoutputs will have only 0/1 values, only that the tf term in tf-idfis binary. (Set idf and normalization to False to get 0/1 outputs.)dtype : type, optionalType of the matrix returned by fit_transform() or transform().norm : 'l1', 'l2' or None, optionalNorm used to normalize term vectors. None for no normalization.use_idf : boolean, default=TrueEnable inverse-document-frequency reweighting.smooth_idf : boolean, default=TrueSmooth idf weights by adding one to document frequencies, as if anextra document was seen containing every term in the collectionexactly once. Prevents zero divisions.sublinear_tf : boolean, default=FalseApply sublinear tf scaling, i.e. replace tf with 1 + log(tf).Attributes----------vocabulary_ : dictA mapping of terms to feature indices.idf_ : array, shape = [n_features], or NoneThe learned idf vector (global term weights)when ``use_idf`` is set to True, None otherwise.stop_words_ : setTerms that were ignored because they either:- occurred in too many documents (`max_df`)- occurred in too few documents (`min_df`)- were cut off by feature selection (`max_features`).This is only available if no vocabulary was given.See also--------CountVectorizerTokenize the documents and count the occurrences of token and returnthem as a sparse matrixTfidfTransformerApply Term Frequency Inverse Document Frequency normalization to asparse matrix of occurrence counts.Notes-----The ``stop_words_`` attribute can get large and increase the model sizewhen pickling. This attribute is provided only for introspection and canbe safely removed using delattr or set to None before pickling."""def __init__(self, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern=r"(?u)\b\w\w+\b", ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False):super(TfidfVectorizer, self).__init__(input=input, encoding=encoding, decode_error=decode_error, strip_accents=strip_accents, lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer, analyzer=analyzer, stop_words=stop_words, token_pattern=token_pattern, ngram_range=ngram_range, max_df=max_df, min_df=min_df, max_features=max_features, vocabulary=vocabulary, binary=binary, dtype=dtype)self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf, smooth_idf=smooth_idf, sublinear_tf=sublinear_tf)# Broadcast the TF-IDF parameters to the underlying transformer instance# for easy grid search and repr@propertydef norm(self):return self._tfidf.norm@norm.setterdef norm(self, value):self._tfidf.norm = value@propertydef use_idf(self):return self._tfidf.use_idf@use_idf.setterdef use_idf(self, value):self._tfidf.use_idf = value@propertydef smooth_idf(self):return self._tfidf.smooth_idf@smooth_idf.setterdef smooth_idf(self, value):self._tfidf.smooth_idf = value@propertydef sublinear_tf(self):return self._tfidf.sublinear_tf@sublinear_tf.setterdef sublinear_tf(self, value):self._tfidf.sublinear_tf = value@propertydef idf_(self):return self._tfidf.idf_def fit(self, raw_documents, y=None):"""Learn vocabulary and idf from training set.Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectsReturns-------self : TfidfVectorizer"""X = super(TfidfVectorizer, self).fit_transform(raw_documents)self._tfidf.fit(X)return selfdef fit_transform(self, raw_documents, y=None):"""Learn vocabulary and idf, return term-document matrix.This is equivalent to fit followed by transform, but more efficientlyimplemented.Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectsReturns-------X : sparse matrix, [n_samples, n_features]Tf-idf-weighted document-term matrix."""X = super(TfidfVectorizer, self).fit_transform(raw_documents)self._tfidf.fit(X)# X is already a transformed view of raw_documents so# we set copy to Falsereturn self._tfidf.transform(X, copy=False)def transform(self, raw_documents, copy=True):"""Transform documents to document-term matrix.Uses the vocabulary and document frequencies (df) learned by fit (orfit_transform).Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectscopy : boolean, default TrueWhether to copy X and operate on the copy or perform in-placeoperations.Returns-------X : sparse matrix, [n_samples, n_features]Tf-idf-weighted document-term matrix."""check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')X = super(TfidfVectorizer, self).transform(raw_documents)return self._tfidf.transform(X, copy=False)

?

?

?

總結

以上是生活随笔為你收集整理的ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: av在线导航 | 午夜高潮视频 | 非洲黑人毛片 | 黄色国产免费 | 久久精品www人人爽人人 | 国产理论av| 日韩成人av一区二区 | 成人激情小视频 | 久久色播 | 强乱中文字幕 | 女性生殖扒开酷刑vk | 六月婷婷激情 | 不用播放器的av网站 | 狠狠干2020 | 免费黄色片视频 | 永久免费看mv网站入口亚洲 | 日本黄色不卡视频 | 免费观看一区二区 | 国产女人高潮视频 | av电影中文字幕 | 久久午夜网站 | 亚洲美女视频在线观看 | 精品人妻午夜一区二区三区四区 | 天天色小说 | 操操影视| 日本中文不卡 | 欧美在线一区二区 | 西西44rtwww国产精品 | 国内精品视频一区二区三区 | 国产精品毛片久久久久久久av | 国产污视频在线看 | 亚洲一区精品在线 | 日本不卡高清视频 | 欧美性生活一区二区三区 | 久久久.www| 亚洲第一黄色网址 | 日韩三级视频在线 | 一级黄色录象 | 成人激情av | 韩漫动漫免费大全在线观看 | 极品福利视频 | 国产福利二区 | 99亚洲欲妇| 97在线免费 | 一级在线观看 | 国产乱码精品一区二区三 | 另类综合视频 | 三级网站国产 | 午夜aa | 麻豆视频免费观看 | 国产午夜精品久久久久久久久久 | 水蜜桃亚洲精品 | 亚洲码视频 | 一级日韩一级欧美 | 7799精品视频天天看 | 免费性网站 | 看av网站 | 妖精视频在线观看免费 | 亚洲精品视频在线观看视频 | 美女脱裤子让男人捅 | 狠狠干男人的天堂 | 欧美一级片观看 | 亚洲成人网在线观看 | 日本特级毛片 | 青娱网电信一区电信二区电信三区 | 91视频最新 | 色人阁av | 精品一区二区三区四区 | 在线免费看毛片 | 亚洲人妻电影一区 | 免费人成在线观看 | 国产毛片毛片毛片毛片毛片 | 国产有码视频 | 成人午夜免费福利 | 糖心av| 国产免费av在线 | 狠狠香蕉 | 伊伊总综合网 | 日韩av在线播放一区 | 欧美激情黑人 | caoprom在线视频 | 嫩草嫩草嫩草 | 寂寞人妻瑜伽被教练日 | 日本激情影院 | 亚av在线| 欧美一区二区三区久久 | 婷婷在线免费视频 | 亚洲春色另类 | 国产欧美日韩视频 | 邻居校草天天肉我h1v1 | 丰满人妻一区二区三区53号 | 日韩性生交大片免费看 | 成人黄色在线免费观看 | 青青草成人免费视频 | 欧美一区| 免费看又黄又无码的网站 | 插吧插吧网 | 日韩av一区二区在线 | 色噜|