1 | import pandas as pd |
1 | def read_data(url): |
数据清洗
1 | def clean_text(text): |
对content&title进行分词
1 | def jieba_data(data,i): |
训练集的jieba分词
1 | def train_jieba(data,i): |
测试集的jieba分词
1 | def test_jieba(data,i): |
读取数据
1 | if __name__ == '__main__': |
输出:(2582, 5)
(1107, 5)
获得标签值
1 | unimportant_idx = 44 |
分词并保存分词结果
1 | train_data_clean_x = tokenize_df_content(train_data_x) |
输出:
(2582, 5)
(1107, 5)
(2582, 3)
(1107, 3)
tfidf 之构造权重矩阵
1 | vectorizer = CountVectorizer() #只考虑词汇在文本中出现的频率 |
输出:(2582, 72998)
tfidf 之打印特征文本
1 | weight = train_tfidf.toarray() |
输出:
************************第 0 篇文章的词语tf-idf权重**********************
一致性 0.30716252706000763
仿制 0.21851023802630334
医疗机构 0.27535485762167283
原研 0.2036921591200047
......
二分类【非重要新闻】¶
1 | svmmodel = SVC(C = 1 , kernel= "linear") #kernel:rbf, poly在这里都没有线性的好 |
评判指标
1 | from sklearn.metrics import classification_report |
输出:
precision recall f1-score support
0.0 0.76 0.92 0.83 783
1.0 0.59 0.29 0.39 324
avg / total 0.71 0.73 0.70 1107