當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

發(fā)布時間：2025/3/21 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

ML之NB：利用樸素貝葉斯NB算法(TfidfVectorizer+不去除停用詞)對20類新聞文本數(shù)據(jù)集進行分類預測、評估

輸出結果

設計思路

核心代碼

輸出結果

設計思路

核心代碼

class TfidfVectorizer Found at: sklearn.feature_extraction.textclass TfidfVectorizer(CountVectorizer):"""Convert a collection of raw documents to a matrix of TF-IDF features.Equivalent to CountVectorizer followed by TfidfTransformer.Read more in the :ref:`User Guide <text_feature_extraction>`.Parameters----------input : string {'filename', 'file', 'content'}If 'filename', the sequence passed as an argument to fit isexpected to be a list of filenames that need reading to fetchthe raw content to analyze.If 'file', the sequence items must have a 'read' method (file-likeobject) that is called to fetch the bytes in memory.Otherwise the input is expected to be the sequence strings orbytes items are expected to be analyzed directly.encoding : string, 'utf-8' by default.If bytes or files are given to analyze, this encoding is used todecode.decode_error : {'strict', 'ignore', 'replace'}Instruction on what to do if a byte sequence is given to analyze thatcontains characters not of the given `encoding`. By default, it is'strict', meaning that a UnicodeDecodeError will be raised. Othervalues are 'ignore' and 'replace'.strip_accents : {'ascii', 'unicode', None}Remove accents during the preprocessing step.'ascii' is a fast method that only works on characters that havean direct ASCII mapping.'unicode' is a slightly slower method that works on any characters.None (default) does nothing.analyzer : string, {'word', 'char'} or callableWhether the feature should be made of word or character n-grams.If a callable is passed it is used to extract the sequence of featuresout of the raw, unprocessed input.preprocessor : callable or None (default)Override the preprocessing (string transformation) stage whilepreserving the tokenizing and n-grams generation steps.tokenizer : callable or None (default)Override the string tokenization step while preserving thepreprocessing and n-grams generation steps.Only applies if ``analyzer == 'word'``.ngram_range : tuple (min_n, max_n)The lower and upper boundary of the range of n-values for differentn-grams to be extracted. All values of n such that min_n <= n <= max_nwill be used.stop_words : string {'english'}, list, or None (default)If a string, it is passed to _check_stop_list and the appropriate stoplist is returned. 'english' is currently the only supported stringvalue.If a list, that list is assumed to contain stop words, all of whichwill be removed from the resulting tokens.Only applies if ``analyzer == 'word'``.If None, no stop words will be used. max_df can be set to a valuein the range [0.7, 1.0) to automatically detect and filter stopwords based on intra corpus document frequency of terms.lowercase : boolean, default TrueConvert all characters to lowercase before tokenizing.token_pattern : stringRegular expression denoting what constitutes a "token", only usedif ``analyzer == 'word'``. The default regexp selects tokens of 2or more alphanumeric characters (punctuation is completely ignoredand always treated as a token separator).max_df : float in range [0.0, 1.0] or int, default=1.0When building the vocabulary ignore terms that have a documentfrequency strictly higher than the given threshold (corpus-specificstop words).If float, the parameter represents a proportion of documents, integerabsolute counts.This parameter is ignored if vocabulary is not None.min_df : float in range [0.0, 1.0] or int, default=1When building the vocabulary ignore terms that have a documentfrequency strictly lower than the given threshold. This value is alsocalled cut-off in the literature.If float, the parameter represents a proportion of documents, integerabsolute counts.This parameter is ignored if vocabulary is not None.max_features : int or None, default=NoneIf not None, build a vocabulary that only consider the topmax_features ordered by term frequency across the corpus.This parameter is ignored if vocabulary is not None.vocabulary : Mapping or iterable, optionalEither a Mapping (e.g., a dict) where keys are terms and values areindices in the feature matrix, or an iterable over terms. If notgiven, a vocabulary is determined from the input documents.binary : boolean, default=FalseIf True, all non-zero term counts are set to 1. This does not meanoutputs will have only 0/1 values, only that the tf term in tf-idfis binary. (Set idf and normalization to False to get 0/1 outputs.)dtype : type, optionalType of the matrix returned by fit_transform() or transform().norm : 'l1', 'l2' or None, optionalNorm used to normalize term vectors. None for no normalization.use_idf : boolean, default=TrueEnable inverse-document-frequency reweighting.smooth_idf : boolean, default=TrueSmooth idf weights by adding one to document frequencies, as if anextra document was seen containing every term in the collectionexactly once. Prevents zero divisions.sublinear_tf : boolean, default=FalseApply sublinear tf scaling, i.e. replace tf with 1 + log(tf).Attributes----------vocabulary_ : dictA mapping of terms to feature indices.idf_ : array, shape = [n_features], or NoneThe learned idf vector (global term weights)when ``use_idf`` is set to True, None otherwise.stop_words_ : setTerms that were ignored because they either:- occurred in too many documents (`max_df`)- occurred in too few documents (`min_df`)- were cut off by feature selection (`max_features`).This is only available if no vocabulary was given.See also--------CountVectorizerTokenize the documents and count the occurrences of token and returnthem as a sparse matrixTfidfTransformerApply Term Frequency Inverse Document Frequency normalization to asparse matrix of occurrence counts.Notes-----The ``stop_words_`` attribute can get large and increase the model sizewhen pickling. This attribute is provided only for introspection and canbe safely removed using delattr or set to None before pickling."""def __init__(self, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern=r"(?u)\b\w\w+\b", ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False):super(TfidfVectorizer, self).__init__(input=input, encoding=encoding, decode_error=decode_error, strip_accents=strip_accents, lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer, analyzer=analyzer, stop_words=stop_words, token_pattern=token_pattern, ngram_range=ngram_range, max_df=max_df, min_df=min_df, max_features=max_features, vocabulary=vocabulary, binary=binary, dtype=dtype)self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf, smooth_idf=smooth_idf, sublinear_tf=sublinear_tf)# Broadcast the TF-IDF parameters to the underlying transformer instance# for easy grid search and repr@propertydef norm(self):return self._tfidf.norm@norm.setterdef norm(self, value):self._tfidf.norm = value@propertydef use_idf(self):return self._tfidf.use_idf@use_idf.setterdef use_idf(self, value):self._tfidf.use_idf = value@propertydef smooth_idf(self):return self._tfidf.smooth_idf@smooth_idf.setterdef smooth_idf(self, value):self._tfidf.smooth_idf = value@propertydef sublinear_tf(self):return self._tfidf.sublinear_tf@sublinear_tf.setterdef sublinear_tf(self, value):self._tfidf.sublinear_tf = value@propertydef idf_(self):return self._tfidf.idf_def fit(self, raw_documents, y=None):"""Learn vocabulary and idf from training set.Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectsReturns-------self : TfidfVectorizer"""X = super(TfidfVectorizer, self).fit_transform(raw_documents)self._tfidf.fit(X)return selfdef fit_transform(self, raw_documents, y=None):"""Learn vocabulary and idf, return term-document matrix.This is equivalent to fit followed by transform, but more efficientlyimplemented.Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectsReturns-------X : sparse matrix, [n_samples, n_features]Tf-idf-weighted document-term matrix."""X = super(TfidfVectorizer, self).fit_transform(raw_documents)self._tfidf.fit(X)# X is already a transformed view of raw_documents so# we set copy to Falsereturn self._tfidf.transform(X, copy=False)def transform(self, raw_documents, copy=True):"""Transform documents to document-term matrix.Uses the vocabulary and document frequencies (df) learned by fit (orfit_transform).Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectscopy : boolean, default TrueWhether to copy X and operate on the copy or perform in-placeoperations.Returns-------X : sparse matrix, [n_samples, n_features]Tf-idf-weighted document-term matrix."""check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')X = super(TfidfVectorizer, self).transform(raw_documents)return self._tfidf.transform(X, copy=False)

總結

以上是生活随笔為你收集整理的ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： HighNewTech—AI界消息：20
下一篇： ML之NB：利用NB朴素贝叶斯算法(Co