lxml 读取HTML文件进行解析
2021-05-28 09:46 更新
from lxml import etree
html=etree.parse('test.html',etree.HTMLParser()) #指定解析器HTMLParser会根据文件修复HTML文件中缺失的如声明信息
result=etree.tostring(html) #解析成字节
#result=etree.tostringlist(html) #解析成列表
print(type(html))
print(type(result))
print(result)
#
<class 'lxml.etree._ElementTree'>
<class 'bytes'>
b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\n<html><body><div> \n <ul> \n <li class="item-0"><a href="link1.html">first item</a></li>
\n <li class="item-1"><a href="link2.html">second item</a></li>
\n <li class="item-inactive"><a href="link3.html">third item</a></li>
\n <li class="item-1"><a href="link4.html">fourth item</a></li>
\n <li class="item-0"><a href="link5.html">fifth item</a> \n </li></ul> \n </div> \n</body></html>'
以上内容是否对您有帮助:
更多建议: