网页正文抽取工具 #318

chrislinan · 2018-08-22T14:50:56Z

项目推荐

项目名称：
cx-extractor-python
项目地址：
https://github.com/chrislinan/cx-extractor-python
项目后续更新计划：
添加多语言支持
项目描述：
这是一个对网页正文进行抽取的工具，是cx-extractor算法的python版本，改进了原有算法，使其支持中英文，对新闻类网页正文抽取效果较好
推荐理由：
不需要解析html，抽取网页正文速度快，准确度高

示例代码：

from crawler.cx_extractor_Python import  cx_extractor_Python
cx = cx_extractor_Python()
# test_html = cx.readHtml("E:\\Documents\\123.html")
test_html = cx.getHtml('http://news.163.com/16/0101/10/BC84MRHS00014AED.html')
content = cx.filter_tags(test_html)
s = cx.getText(content)
 print(s)

截图：

The text was updated successfully, but these errors were encountered:

521xueweihan · 2018-09-27T16:20:45Z

@chrislinan 您推荐的项目，已成功收录在 HelloGitHub 第 30 期，并把您添加到了贡献者列表中。

欢迎继续推荐如此优秀的项目、告诉其他小伙伴加入到 HelloGitHub 项目中。谢谢 🙏

521xueweihan added the Python 项目 label Aug 24, 2018

521xueweihan added the 已收录（未发布） label Sep 27, 2018

521xueweihan closed this as completed Sep 27, 2018

521xueweihan added 已发布 and removed 已收录（未发布） labels Sep 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

网页正文抽取工具 #318

网页正文抽取工具 #318

chrislinan commented Aug 22, 2018 •

edited

521xueweihan commented Sep 27, 2018

网页正文抽取工具 #318

网页正文抽取工具 #318

Comments

chrislinan commented Aug 22, 2018 • edited

项目推荐

521xueweihan commented Sep 27, 2018

chrislinan commented Aug 22, 2018 •

edited