python 2.7 - Selecting nodes with non-ASCII characters in Scrapy -


i have following simple web scraper written in scrapy:

#!/usr/bin/env python # -*- coding: latin-1 -*-  scrapy.http import request scrapy.spider import basespider scrapy.selector import htmlxpathselector  class myspidertest(basespider):     name = 'myspidertest'     allowed_domains = ["boliga.dk"]     start_urls = ["http://www.boliga.dk/bbrinfo/3b71489c-aea0-44ca-a0b2-7bd909b35618",]      def parse(self, response):         hxs = htmlxpathselector(response)         item = bbritem()         print hxs.select("id('unitcontrol')/div[2]/table/tbody/tr[td//text()[contains(.,'antal badeværelser')]]/td[2]/text()").extract() 

but when run spider following syntax error:

syntaxerror: non-ascii character '\xe6' in file... on line 32, no encoding declared 

because of æ in xpath. xpath working in xpath checker firefox. tried url-encoding æ, didn't work. missing?

thanks!

update: have added encoding declaration in beginning of code (latin-1 should support danish characters)

use unicode string xpath expression

hxs.select(u"id('unitcontrol')/div[2]/table/tbody/tr[td//text()[contains(.,'antal badeværelser')]]/td[2]/text()").extract() 

or

hxs.select(u"id('unitcontrol')/div[2]/table/tbody/tr[td//text()[contains(.,'antal badev\u00e6relser')]]/td[2]/text()").extract() 

see unicode literals in python source code


Comments

Popular posts from this blog

c++ - CryptStringToBinary API behavior -

c++ - Correct method for redrawing a layered window -

java.util.scanner - How to read and add only numbers to array from a text file -