Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

网页正文抽取工具 #318

Closed
chrislinan opened this issue Aug 22, 2018 · 1 comment
Closed

网页正文抽取工具 #318

chrislinan opened this issue Aug 22, 2018 · 1 comment

Comments

@chrislinan
Copy link

chrislinan commented Aug 22, 2018

项目推荐

  • 项目名称:
    cx-extractor-python
  • 项目地址:
    https://github.com/chrislinan/cx-extractor-python
  • 项目后续更新计划:
    添加多语言支持
  • 项目描述:
    这是一个对网页正文进行抽取的工具,是cx-extractor算法的python版本,改进了原有算法,使其支持中英文,对新闻类网页正文抽取效果较好
  • 推荐理由:
    不需要解析html,抽取网页正文速度快,准确度高
  • 示例代码:
    from crawler.cx_extractor_Python import  cx_extractor_Python
    cx = cx_extractor_Python()
    # test_html = cx.readHtml("E:\\Documents\\123.html")
    test_html = cx.getHtml('http://news.163.com/16/0101/10/BC84MRHS00014AED.html')
    content = cx.filter_tags(test_html)
    s = cx.getText(content)
     print(s)
    
  • 截图:
    raw
    text
@521xueweihan
Copy link
Owner

@chrislinan 您推荐的项目,已成功收录在 HelloGitHub 第 30 期,并把您添加到了贡献者列表中。

欢迎继续推荐如此优秀的项目、告诉其他小伙伴加入到 HelloGitHub 项目中。谢谢 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants