java - DTD parsing with Stax -
i want parse xml files declare html 4.01 doctype.
<!doctype html public "-//w3c//dtd html 4.01//en" "http://www.w3.org/tr/html4/strict.dtd"> <html> [...] </html> i using stax , xmlresolver load local dtd
xmlinputfactory xmlinputfactory = xmlinputfactory.newinstance(); xmlinputfactory.setxmlresolver(new localxmlresolver()); xmloutputfactory = xmloutputfactory.newinstance(); xmloutputfactory.createxmleventwriter(...) private static final map<string, string> dtds = new hashmap<string, string>(){{ // xhtml 1.0 dtds put("-//w3c//dtd xhtml 1.0 strict//en", "xhtml1-strict.dtd"); put("-//w3c//dtd xhtml 1.0 transitional//en", "xhtml1-transitional.dtd"); put("-//w3c//dtd xhtml 1.0 frameset//en", "xhtml1-frameset.dtd"); put("-//w3c//dtd html 4.01//en", "strict.dtd"); put("-//w3c//dtd html 4.01 transitional//en", "loose.dtd"); put("-//w3c//dtd html 4.01 frameset//en", "frameset.dtd"); }}; private static final class localxmlresolver implements xmlresolver { @override public object resolveentity(string publicid, string systemid, string baseuri, string namespace) throws xmlstreamexception { object result = null; string path = xhtml_dtd_path + dtds.get(publicid); if (stringutils.isnotblank(path)) { result = getclass().getclassloader().getresourceasstream(path); } return result; } } i retrieved dtd (w3c web site). had change file remove comments in nodes below :
<!entity % contenttype "cdata" -- media type, per [rfc2045] --> <!entity % contenttype "cdata"> but after these modifications, have still error :
javax.xml.stream.xmlstreamexception: parseerror @ [row,col]:[184,11] message: element type required in element type declaration. [...] caused by: javax.xml.stream.xmlstreamexception: parseerror @ [row,col]:[184,11] message: element type required in element type declaration. @ com.sun.org.apache.xerces.internal.impl.xmlstreamreaderimpl.next(xmlstreamreaderimpl.java:598) @ com.sun.xml.internal.stream.xmleventreaderimpl.nextevent(xmleventreaderimpl.java:83) in dtd file, line 184 :
<!element (%fontstyle;|%phrase;) - - (%inline;)* > any idea ?
thanks
html sgml language, has sgml dtd. can find more information sgml here: http://validator.w3.org/docs/sgml.html
sgml bit different xml, it's no wonder xml parser cannot parse it.
the main example is:
comments inside entity declarations (delimited double hyphens: --this comment--) allowed in sgml dtd whereas not on xml dtd.
for more difference please follow http://www.w3.org/tr/note-sgml-xml-971215#null
nevertheless can't disable dtd parsing specific dtd creation own xmlresolver
xmlinput = xmlinputfactory.newinstance(); xmlinput.setxmlresolver(new xmlresolver() { @override public object resolveentity(string publicid, string systemid, string baseuri, string namespace) throws xmlstreamexception { ... // disable dtd validation if ("the public id except".equals(publicid)) { return ioutils.toinputstream(""); } ... } }); for html parser consider http://jtidy.sourceforge.net/ or http://jsoup.org/ solution
Comments
Post a Comment