您好~请问如何计算节点与正文的距离，进一步筛选得到最优的日期

思路：
时间节点一般距离正文较近，因此在正文开头的标签附近进行时间匹配。
步骤：
1.获取到正文开头的html标签属性id= "0RSOEB1K"。
2.然后根据属性匹配得到正文开头标签。
3.在正文开头的标签附近(其前后5个标签)，在其中匹配日期。

问题：
在定位正文开头标签中，需要手动查看标签的属性名称和属性值id= "0RSOEB1K"。
希望实现：使用代码获取正文开头标签的属性名称和属性值，将其作为变量，传递到代码中，进一步获取正文开头附近的标签，然后进行日期的提取。

具体实现：

from lxml.html import fromstring

# 网页下载并保存：https://www.163.com/dy/article/H6TFTRQ50514R9KC.html
html = open(r'.\gne\3.html', encoding='utf-8').read()
html

# 预处理html
from gne.utils import pre_parse, remove_noise_node, config, html2element, normalize_text
normal_html = normalize_text(html)
element = html2element(normal_html)
element = pre_parse(element)

# 获取得到节点指标信息
from gne.extractor import ContentExtractor
content = ContentExtractor().extract(element)

# 网页正文
content[0][1]
# 返回第一个节点(即正文)中的html element对象：node
tree = content[0][1]['node']
len(tree)
# html element对象转换为html字符
from lxml.html import tostring
# 正文开头的标签html:得到标签的完整信息
html_start = tostring(tree[0],encoding='utf-8',pretty_print=True).decode('utf-8')
html_start
# # 正文开头标签的属性值
# attribute = ''.join(tree[0].xpath('./attribute::*'))
# attribute
# 正文开头的标签附近(其前后30个标签)文本内容
tree.xpath('//*[@id= "0RSOEB1K"]/preceding::*[position()<30]//text()| //*[@id= "0RSOEB1K"]/following::*[position()<30]//text()')
# 提取日期：在正文开头的标签附近(其前后30个标签)
html_date = tree.xpath('//*[@id= "0RSOEB1K"]/preceding::*[position()<30]| //*[@id= "0RSOEB1K"]/following::*[position()<30]')
html_date

publish_times = []
from gne.extractor import TimeExtractor
for element in html_date:
    publish_time = TimeExtractor().extractor(element)
    publish_times.append(publish_time) 
publish_times

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

您好~请问如何计算节点与正文的距离，进一步筛选得到最优的日期 #123

网页下载并保存：https://www.163.com/dy/article/H6TFTRQ50514R9KC.html

预处理html

获取得到节点指标信息

网页正文

返回第一个节点(即正文)中的html element对象：node

html element对象转换为html字符

正文开头的标签html:得到标签的完整信息

# 正文开头标签的属性值

attribute = ''.join(tree[0].xpath('./attribute::*'))

attribute

正文开头的标签附近(其前后30个标签)文本内容

提取日期：在正文开头的标签附近(其前后30个标签)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

您好~请问如何计算节点与正文的距离，进一步筛选得到最优的日期 #123

Description

网页下载并保存：https://www.163.com/dy/article/H6TFTRQ50514R9KC.html

预处理html

获取得到节点指标信息

网页正文

返回第一个节点(即正文)中的html element对象：node

html element对象转换为html字符

正文开头的标签html:得到标签的完整信息

# 正文开头标签的属性值

attribute = ''.join(tree[0].xpath('./attribute::*'))

attribute

正文开头的标签附近(其前后30个标签)文本内容

提取日期：在正文开头的标签附近(其前后30个标签)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions