java - DTD parsing with Stax -


i want parse xml files declare html 4.01 doctype.

<!doctype html public "-//w3c//dtd html 4.01//en" "http://www.w3.org/tr/html4/strict.dtd"> <html> [...] </html> 

i using stax , xmlresolver load local dtd

xmlinputfactory xmlinputfactory = xmlinputfactory.newinstance(); xmlinputfactory.setxmlresolver(new localxmlresolver()); xmloutputfactory = xmloutputfactory.newinstance(); xmloutputfactory.createxmleventwriter(...)   private static final map<string, string> dtds = new hashmap<string, string>(){{     // xhtml 1.0 dtds     put("-//w3c//dtd xhtml 1.0 strict//en", "xhtml1-strict.dtd");     put("-//w3c//dtd xhtml 1.0 transitional//en", "xhtml1-transitional.dtd");     put("-//w3c//dtd xhtml 1.0 frameset//en", "xhtml1-frameset.dtd");      put("-//w3c//dtd html 4.01//en", "strict.dtd");     put("-//w3c//dtd html 4.01 transitional//en", "loose.dtd");     put("-//w3c//dtd html 4.01 frameset//en", "frameset.dtd"); }};  private static final class localxmlresolver implements xmlresolver {          @override         public object resolveentity(string publicid, string systemid, string baseuri, string namespace) throws xmlstreamexception {             object result = null;              string path = xhtml_dtd_path + dtds.get(publicid);              if (stringutils.isnotblank(path)) {                 result = getclass().getclassloader().getresourceasstream(path);             }             return result;         }     } 

i retrieved dtd (w3c web site). had change file remove comments in nodes below :

 <!entity % contenttype "cdata"     -- media type, per [rfc2045]     -->    <!entity % contenttype "cdata"> 

but after these modifications, have still error :

javax.xml.stream.xmlstreamexception: parseerror @ [row,col]:[184,11] message: element type required in element type declaration.     [...] caused by: javax.xml.stream.xmlstreamexception: parseerror @ [row,col]:[184,11] message: element type required in element type declaration.     @ com.sun.org.apache.xerces.internal.impl.xmlstreamreaderimpl.next(xmlstreamreaderimpl.java:598)     @ com.sun.xml.internal.stream.xmleventreaderimpl.nextevent(xmleventreaderimpl.java:83) 

in dtd file, line 184 :

<!element (%fontstyle;|%phrase;) - - (%inline;)* > 

any idea ?

thanks

html sgml language, has sgml dtd. can find more information sgml here: http://validator.w3.org/docs/sgml.html

sgml bit different xml, it's no wonder xml parser cannot parse it.

the main example is:

comments inside entity declarations (delimited double hyphens: --this comment--) allowed in sgml dtd whereas not on xml dtd.

for more difference please follow http://www.w3.org/tr/note-sgml-xml-971215#null

nevertheless can't disable dtd parsing specific dtd creation own xmlresolver

xmlinput = xmlinputfactory.newinstance(); xmlinput.setxmlresolver(new xmlresolver() {     @override     public object resolveentity(string publicid, string systemid, string baseuri, string namespace) throws xmlstreamexception {         ...         // disable dtd validation         if ("the public id except".equals(publicid)) {             return ioutils.toinputstream("");         }         ...     } }); 

for html parser consider http://jtidy.sourceforge.net/ or http://jsoup.org/ solution


Comments

Popular posts from this blog

c++ - CryptStringToBinary API behavior -

c++ - Correct method for redrawing a layered window -

java.util.scanner - How to read and add only numbers to array from a text file -