Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

您好~请问如何计算节点与正文的距离,进一步筛选得到最优的日期 #123

Open
wzf9 opened this issue May 9, 2022 · 2 comments

Comments

@wzf9
Copy link

wzf9 commented May 9, 2022

思路:
时间节点一般距离正文较近,因此在正文开头的标签附近进行时间匹配。
步骤:
1.获取到正文开头的html标签属性id= "0RSOEB1K"。
2.然后根据属性匹配得到正文开头标签。
3.在正文开头的标签附近(其前后5个标签),在其中匹配日期。

问题:
在定位正文开头标签中,需要手动查看标签的属性名称和属性值id= "0RSOEB1K"。
希望实现:使用代码获取正文开头标签的属性名称和属性值,将其作为变量,传递到代码中,进一步获取正文开头附近的标签,然后进行日期的提取。

具体实现:

from lxml.html import fromstring

网页下载并保存:https://www.163.com/dy/article/H6TFTRQ50514R9KC.html

html = open(r'.\gne\3.html', encoding='utf-8').read()
html

预处理html

from gne.utils import pre_parse, remove_noise_node, config, html2element, normalize_text
normal_html = normalize_text(html)
element = html2element(normal_html)
element = pre_parse(element)

获取得到节点指标信息

from gne.extractor import ContentExtractor
content = ContentExtractor().extract(element)

网页正文

content[0][1]

返回第一个节点(即正文)中的html element对象:node

tree = content[0][1]['node']
len(tree)

html element对象转换为html字符

from lxml.html import tostring

正文开头的标签html:得到标签的完整信息

html_start = tostring(tree[0],encoding='utf-8',pretty_print=True).decode('utf-8')
html_start

# 正文开头标签的属性值

attribute = ''.join(tree[0].xpath('./attribute::*'))

attribute

正文开头的标签附近(其前后30个标签)文本内容

tree.xpath('//[@id= "0RSOEB1K"]/preceding::[position()<30]//text()| //[@id= "0RSOEB1K"]/following::[position()<30]//text()')

提取日期:在正文开头的标签附近(其前后30个标签)

html_date = tree.xpath('//[@id= "0RSOEB1K"]/preceding::[position()<30]| //[@id= "0RSOEB1K"]/following::[position()<30]')
html_date

publish_times = []
from gne.extractor import TimeExtractor
for element in html_date:
publish_time = TimeExtractor().extractor(element)
publish_times.append(publish_time)
publish_times

@kingname
Copy link
Collaborator

kingname commented May 9, 2022

你好,时间提取模块最近会有一次大的调整,到时候会变得更强大更通用,更新以后你可以看一下新的实现方案。

@wzf9
Copy link
Author

wzf9 commented May 9, 2022

好的 十分感谢您的热心回复~万分期待新的思路

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants