python 2.7 - Selecting nodes with non-ASCII characters in Scrapy -
i have following simple web scraper written in scrapy:
#!/usr/bin/env python # -*- coding: latin-1 -*- scrapy.http import request scrapy.spider import basespider scrapy.selector import htmlxpathselector class myspidertest(basespider): name = 'myspidertest' allowed_domains = ["boliga.dk"] start_urls = ["http://www.boliga.dk/bbrinfo/3b71489c-aea0-44ca-a0b2-7bd909b35618",] def parse(self, response): hxs = htmlxpathselector(response) item = bbritem() print hxs.select("id('unitcontrol')/div[2]/table/tbody/tr[td//text()[contains(.,'antal badeværelser')]]/td[2]/text()").extract() but when run spider following syntax error:
syntaxerror: non-ascii character '\xe6' in file... on line 32, no encoding declared because of æ in xpath. xpath working in xpath checker firefox. tried url-encoding æ, didn't work. missing?
thanks!
update: have added encoding declaration in beginning of code (latin-1 should support danish characters)
use unicode string xpath expression
hxs.select(u"id('unitcontrol')/div[2]/table/tbody/tr[td//text()[contains(.,'antal badeværelser')]]/td[2]/text()").extract() or
hxs.select(u"id('unitcontrol')/div[2]/table/tbody/tr[td//text()[contains(.,'antal badev\u00e6relser')]]/td[2]/text()").extract()
Comments
Post a Comment