Beautiful Soup 4 输出
本章节主要介绍Beautiful Soup 4 输出相关内容,格式化输出、压缩输出以及输出格式与get_text()的用法都有详细介绍。
格式化输出
prettify()
方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行
markup = '<a href="http://example.com/" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
soup.prettify()
# '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >\n...'
print(soup.prettify())
# <html>
# <head>
# </head>
# <body>
# <a href="http://example.com/" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >
# I linked to
# <i>
# example.com
# </i>
# </a>
# </body>
# </html>
BeautifulSoup
对象和它的tag节点都可以调用 prettify()
方法:
print(soup.a.prettify())
# <a href="http://example.com/" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >
# I linked to
# <i>
# example.com
# </i>
# </a>
压缩输出
如果只想得到结果字符串,不重视格式,那么可以对一个 BeautifulSoup
对象或 Tag
对象使用Python的 unicode()
或 str()
方法:
str(soup)
# '<html><head></head><body><a href="http://example.com/" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >I linked to <i>example.com</i></a></body></html>'
unicode(soup.a)
# u'<a href="http://example.com/" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >I linked to <i>example.com</i></a>'
str()
方法返回UTF-8编码的字符串,可以指定 编码 的设置.
还可以调用 encode()
方法获得字节码或调用 decode()
方法获得Unicode.
输出格式
Beautiful Soup输出是会将HTML中的特殊字符转换成Unicode,比如“&lquot;”:
soup = BeautifulSoup("“Dammit!” he said.")
unicode(soup)
# u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'
如果将文档转换成字符串,Unicode编码会被编码成UTF-8.这样就无法正确显示HTML特殊字符了:
str(soup)
# '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'
get_text()
如果只想得到tag中包含的文本内容,那么可以嗲用 get_text()
方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:
markup = '<a href="http://example.com/" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" >\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
可以通过参数指定tag的文本内容的分隔符:
# soup.get_text("|")
u'\nI linked to |example.com|\n'
还可以去除获得文本内容的前后空白:
# soup.get_text("|", strip=True)
u'I linked to|example.com'
或者使用 .stripped_strings 生成器,获得文本列表后手动处理列表:
[text for text in soup.stripped_strings]
# [u'I linked to', u'example.com']
更多建议: