The application can set a variety of NekoHTML settings to more
precisely control the behavior of the parser. These settings
can be set directly on the HTMLConfiguration class
or on the supplied parser classes by calling the
setFeature and setProperty methods.
For example:
// settings on HTMLConfiguration org.apache.xerces.xni.parser.XMLParserConfiguration config = new org.cyberneko.html.HTMLConfiguration(); config.setFeature("http://cyberneko.org/html/features/augmentations", true); config.setProperty("http://cyberneko.org/html/properties/names/elems", "lower"); // settings on DOMParser org.cyberneko.html.parsers.DOMParser parser = new org.cyberneko.html.parsers.DOMParser(); parser.setFeature("http://cyberneko.org/html/features/augmentations", true); parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
| Feature Id / Description | Default | 
|---|---|
| http://xml.org/sax/features/namespaces Specifies if the NekoHTML parser should perform namespace processing. If enabled, namespace binding attributes are processed and elements and attributes are bound to the defined namespaces. | true | 
| http://cyberneko.org/html/features/balance-tags Specifies if the NekoHTML parser should attempt to balance the tags in the parsed document. Balancing the tags fixes up many common mistakes by adding missing parent elements, automatically closing elements with optional end tags, and correcting unbalanced inline element tags. In order to process HTML documents as XML, this feature should not be turned off. This feature is provided as a performance enhancement for applications that only care about the appearance of specific elements, attributes, and/or content regardless of the document's ill-formed structure. | true | 
| http://cyberneko.org/html/features/override-doctype Specifies whether the NekoHTML parser should override the public and system identifier values specified in the document type declaration. 
   See: http://cyberneko.org/html/properties/doctype/pubid
    | false | 
| http://cyberneko.org/html/features/insert-doctype Specifies whether the NekoHTML parser should insert a document type declaration into the document handler callbacks. The values for the public and system identifiers are taken from the sysid and pubid properties. Therefore, those properties should be set if this feature is turned on. Also, setting this feature to truewill cause the parser to ignore any document type declaration that 
    appears in the document.
    See: http://cyberneko.org/html/properties/doctype/pubid
     | false | 
| http://cyberneko.org/html/features/override-namespaces Specifies whether the NekoHTML parser should override the namespace URI bound to HTML elements and attributes. | false | 
| http://cyberneko.org/html/features/insert-namespaces Specifies whether the NekoHTML parser should insert namespace URI bindings to HTML elements and attributes. The value for the namespace URI is taken from the namespaces property. Therefore, that property should be set if this feature is turned on. | false | 
| http://cyberneko.org/html/features/balance-tags/ignore-outside-content Specifies if the NekoHTML parser should ignore content after the end of the document root element. If this feature is set to true, all elements and character content appearing outside of the document body is consumed. If set to false, the end elements for the <body> and <html> are ignored, allowing content appearing outside of the document to be parsed and communicated to the application. | false | 
| http://cyberneko.org/html/features/balance-tags/document-fragment Specifies if the tag balancer should operate as if a fragment of HTML is being parsed. With this feature set, the tag balancer will not attempt to insert a missing body elements around content and markup. However, proper parents for elements contained within the <body> element will still be inserted. This feature should not be used when using the DOMParserclass. In order to parse a DOMDocumentFragment, use theDOMFragmentParserclass. | false | 
| http://cyberneko.org/html/features/scanner/normalize-attrs Specifies whether attribute values should be normalized according to section 3.3.3 of the XML 1.0 specification. When set to false, only
   end-of-line normalization and expansion of entities are performed.
   When set totrue, leading and trailing whitespace is
   trimmed and consecutive whitespace is normalized to a single space
   character.
   Note:
   The raw attribute values can be queried by turning on the the 
   augmentations feature and using XNI. | false | 
| http://cyberneko.org/html/features/scanner/cdata-sections Specifies whether CDATA sections are reported as character content. If set to false, CDATA sections are reported as comments. 
   When reported as comments, the comment text is prefixed with "[CDATA[" 
   and end with "]]". This prefix and suffix is not
   included when reported as character content. | false | 
| http://apache.org/xml/features/scanner/notify-char-refs Specifies whether character entity references (e.g.  ,  , etc) should be reported to the registered document handler. The name of the entity reported will contain the leading pound sign and optional 'x' character. For example, the name of the character entity reference  will be reported as "#x20". | false | 
| http://apache.org/xml/features/scanner/notify-builtin-refs Specifies whether the XML built-in entity references (e.g. &, <, etc) should be reported to the registered document handler. This only applies to the five pre-defined XML general entities -- specifically, "amp", "lt", "gt", "quot", and "apos". This is done for compatibility with the Xerces feature. To be notified of the built-in entity references in HTML, set the http://cyberneko.org/html/features/scanner/notify-builtin-refsfeature totrue. | false | 
| http://cyberneko.org/html/features/scanner/notify-builtin-refs Specifies whether the HTML built-in entity references (e.g. &nobr;, ©, etc) should be reported to the registered document handler. This includes the five pre-defined XML general entities. | false | 
| http://cyberneko.org/html/features/scanner/fix-mswindows-refs Specifies whether to fix character entity references for Microsoft Windows® characters as described at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html. | false | 
| http://cyberneko.org/html/features/scanner/ignore-specified-charset Specifies whether to ignore the character encoding specified within the <meta http-equiv='Content-Type' content='text/html;charset=...'> tag or in the <?xml … encoding='…'> processing instruction. By default, NekoHTML checks META tags for a charset and changes the character encoding of the scanning reader object. Setting this feature to trueallows the application to override this behavior. | false | 
| http://cyberneko.org/html/features/scanner/script/strip-comment-delims Specifies whether the scanner should strip HTML comment delimiters (i.e. "<!--" and "-->") from <script> element content. 
   See: http://cyberneko.org/html/features/scanner/style/strip-comment-delims
    | false | 
| http://cyberneko.org/html/features/scanner/script/strip-cdata-delims Specifies whether the scanner should strip XHTML CDATA delimiters (i.e. "<![CDATA[" and "]]>") from <script> element content. 
   See: http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
    | false | 
| http://cyberneko.org/html/features/scanner/style/strip-comment-delims Specifies whether the scanner should strip HTML comment delimiters (i.e. "<!--" and "-->") from <style> element content. 
   See: http://cyberneko.org/html/features/scanner/script/strip-comment-delims
    | false | 
| http://cyberneko.org/html/features/scanner/style/strip-cdata-delims Specifies whether the scanner should strip XHTML CDATA delimiters (i.e. "<![CDATA[" and "]]>") from <style> element content. 
   See: http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
    | false | 
| http://cyberneko.org/html/features/augmentations Specifies whether infoset items that correspond to the HTML events are included in the parsing pipeline. If included, the augmented item will implement the HTMLEventInfointerface found in theorg.cyberneko.htmlpackage. The augmentations 
   can be queried in XNI by calling thegetItemmethod with the key 
   "http://cyberneko.org/html/features/augmentations".
   Currently, the HTML event info augmentation can report event
   character boundaries and whether the event is synthesized. | false | 
| http://cyberneko.org/html/features/report-errors Specifies whether errors should be reported to the registered error handler. Since HTML applications are supposed to permit the liberal use (and abuse) of HTML documents, errors should normally be handled silently. However, if the application wants to know about errors in the parsed HTML document, this feature can be set to true. | false | 
| http://cyberneko.org/html/features/parse-noscript-content Specifies whether the content of a <noscript>...</noscript> node should be parsed or not. When set to falsethe content will be considered as plain text whereas when set totrue,
   tags will be parsed normally. | true | 
| http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe Specifies whether a self closing <iframe/> tag should be allowed or not. When set to truethe parser won't look for a corresponding closing </iframe> tag. | false | 
| http://cyberneko.org/html/features/scanner/allow-selfclosing-tags Specifies whether a self closing tag (e.g. <div/>) tag should be allowed or not. When set to truethe parser won't look for a corresponding closing tag. | false | 
| Property Id / Description | Values | Default | 
|---|---|---|
| http://cyberneko.org/html/properties/filters This property allows applications to append custom document processing components to the end of the default NekoHTML parser pipeline. The value of this property must be an array of type org.apache.xerces.xni.parser.XMLDocumentFilterand no value of this array is allowed to be null. The document 
   filters are appended to the parser pipeline in array order.
   Please refer to the filters
   documentation for more information. | null | |
| http://cyberneko.org/html/properties/default-encoding Sets the default encoding the NekoHTML scanner should use when parsing documents. In the absence of an http-equivdirective in the source document,
   this setting is important because the parser does not
   have any support to auto-detect the encoding.See: http://cyberneko.org/html/features/scanner/ignore-specified-charset | IANA encoding names | |
| http://cyberneko.org/html/properties/names/elems Specifies how the NekoHTML components should modify recognized element names. Names can be converted to upper-case, converted to lower-case, or left as-is. The value of "match" specifies that element names are to be left as-is but the end tag name will be modified to match the start tag name. This is required to ensure that the parser generates a well-formed XML document. | "upper" "lower" "match" | "upper" | 
| http://cyberneko.org/html/properties/names/attrs Specifies how the NekoHTML components should modify attribute names of recognized elements. Names can be converted to upper-case, converted to lower-case, or left as-is. | "upper" "lower" | "lower" | 
| http://cyberneko.org/html/properties/doctype/pubid Specifies the document type declaration public identifier if the http://cyberneko.org/html/features/override-doctypefeature is set totrue. The default value is the HTML
   4.01 transitional public identifier, "-//W3C//DTD HTML 4.01 Transitional//EN". | String | HTML 4.01 transitional public identifier | 
| http://cyberneko.org/html/properties/doctype/sysid Specifies the document type declaration system identifier if the http://cyberneko.org/html/features/override-doctypefeature is set totrue. The default value is the HTML
   4.01 transitional system identifier, "http://www.w3.org/TR/html4/loose.dtd". | String | HTML 4.01 transitional system identifier | 
| http://cyberneko.org/html/properties/namespaces-uri Specifies the namespace binding URI if the http://cyberneko.org/html/features/override-namespacesfeature is set totrue. The default value is the XHTML
   1.0 namespace, "http://www.w3.org/1999/xhtml". This property does
   not affect the case of element and attributes names and
   does not ensure that the output of the NekoHTML parser is
   valid according to the XHTML specification. | String | XHTML 1.0 namespaces URI | 
| Experimental
   http://cyberneko.org/html/properties/balance-tags/fragment-context-stack Specifies the stack of elements that should be considered as ancestors while parsing an HTML fragment. For instance when the last item of the context stack is a TABLE (or a TBODY, THEAD, TFOOT) following fragment will be parsed as a new row: <tr><td>hello</td></tr>. When the context doesn't indicate that we are already within a table, TR and TD tags will be discarded. | QName[] | null |