scrapy 2.3 CSS选择器的扩展

2021-06-02 16:17 更新

根据W3C标准， CSS selectors 不支持选择文本节点或属性值。但是在Web抓取上下文中选择这些是非常重要的，以至于scrappy（parsel）实现了 non-standard pseudo-elements ：

要选择文本节点，请使用 ::text
要选择属性值，请使用 ::attr(name) 在哪里？ name 是要为其值的属性的名称

警告

这些伪元素是特定于scrapy-/parsel的。他们很可能不会与其他类库合作 lxml 或 PyQuery .

实例：

title::text 选择子代的子文本节点 <title> 元素：

>>> response.css('title::text').get()
'Example website'

*::text 选择当前选择器上下文的所有子代文本节点：

>>> response.css('#images *::text').getall()
['\n   ',
 'Name: My image 1 ',
 '\n   ',
 'Name: My image 2 ',
 '\n   ',
 'Name: My image 3 ',
 '\n   ',
 'Name: My image 4 ',
 '\n   ',
 'Name: My image 5 ',
 '\n  ']

foo::text 如果 foo 元素存在，但不包含文本（即文本为空）：

>>> response.css('img::text').getall()
[]

这意味着 .css('foo::text').get() 即使元素存在，也无法返回“无”。使用 default='' 如果您总是想要字符串：

>>> response.css('img::text').get()
>>> response.css('img::text').get(default='')
''

a::attr(href) 选择 href 子链接的属性值：

>>> response.css('a::attr(href)').getall()
['image1.html',
 'image2.html',
 'image3.html',
 'image4.html',
 'image5.html']

注解

参见：选择元素属性 .

注解

不能链接这些伪元素。但在实践中，这没有多大意义：文本节点没有属性，属性值已经是字符串值，也没有子节点。

以上内容是否对您有帮助：

← scrapy 2.3 使用选择器

scrapy 2.3 嵌套选择器 →

写笔记

我要补充

scrapy 2.3 CSS选择器的扩展

推荐文章

推荐教程

推荐课程