-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
您好~请问如何计算节点与正文的距离,进一步筛选得到最优的日期 #123
Comments
你好,时间提取模块最近会有一次大的调整,到时候会变得更强大更通用,更新以后你可以看一下新的实现方案。 |
好的 十分感谢您的热心回复~万分期待新的思路 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
思路:
时间节点一般距离正文较近,因此在正文开头的标签附近进行时间匹配。
步骤:
1.获取到正文开头的html标签属性id= "0RSOEB1K"。
2.然后根据属性匹配得到正文开头标签。
3.在正文开头的标签附近(其前后5个标签),在其中匹配日期。
问题:
在定位正文开头标签中,需要手动查看标签的属性名称和属性值id= "0RSOEB1K"。
希望实现:使用代码获取正文开头标签的属性名称和属性值,将其作为变量,传递到代码中,进一步获取正文开头附近的标签,然后进行日期的提取。
具体实现:
from lxml.html import fromstring
网页下载并保存:https://www.163.com/dy/article/H6TFTRQ50514R9KC.html
html = open(r'.\gne\3.html', encoding='utf-8').read()
html
预处理html
from gne.utils import pre_parse, remove_noise_node, config, html2element, normalize_text
normal_html = normalize_text(html)
element = html2element(normal_html)
element = pre_parse(element)
获取得到节点指标信息
from gne.extractor import ContentExtractor
content = ContentExtractor().extract(element)
网页正文
content[0][1]
返回第一个节点(即正文)中的html element对象:node
tree = content[0][1]['node']
len(tree)
html element对象转换为html字符
from lxml.html import tostring
正文开头的标签html:得到标签的完整信息
html_start = tostring(tree[0],encoding='utf-8',pretty_print=True).decode('utf-8')
html_start
# 正文开头标签的属性值
attribute = ''.join(tree[0].xpath('./attribute::*'))
attribute
正文开头的标签附近(其前后30个标签)文本内容
tree.xpath('//[@id= "0RSOEB1K"]/preceding::[position()<30]//text()| //[@id= "0RSOEB1K"]/following::[position()<30]//text()')
提取日期:在正文开头的标签附近(其前后30个标签)
html_date = tree.xpath('//[@id= "0RSOEB1K"]/preceding::[position()<30]| //[@id= "0RSOEB1K"]/following::[position()<30]')
html_date
publish_times = []
from gne.extractor import TimeExtractor
for element in html_date:
publish_time = TimeExtractor().extractor(element)
publish_times.append(publish_time)
publish_times
The text was updated successfully, but these errors were encountered: