1 // Copyright (c) 2012-2018 Ugorji Nwoke. All rights reserved.
2 // Use of this source code is governed by a MIT license found in the LICENSE file.
12 A strict Non-validating namespace-aware XML 1.0 parser and (en|de)coder.
14 We are attempting this due to perceived issues with encoding/xml:
15 - Complicated. It tried to do too much, and is not as simple to use as json.
16 - Due to over-engineering, reflection is over-used AND performance suffers:
17 java is 6X faster:http://fabsk.eu/blog/category/informatique/dev/golang/
18 even PYTHON performs better: http://outgoing.typepad.com/outgoing/2014/07/exploring-golang.html
20 codec framework will offer the following benefits
21 - VASTLY improved performance (when using reflection-mode or codecgen)
22 - simplicity and consistency: with the rest of the supported formats
23 - all other benefits of codec framework (streaming, codegeneration, etc)
25 codec is not a drop-in replacement for encoding/xml.
26 It is a replacement, based on the simplicity and performance of codec.
27 Look at it like JAXB for Go.
30 - Need to output XML preamble, with all namespaces at the right location in the output.
31 - Each "end" block is dynamic, so we need to maintain a context-aware stack
32 - How to decide when to use an attribute VS an element
33 - How to handle chardata, attr, comment EXPLICITLY.
34 - Should it output fragments?
35 e.g. encoding a bool should just output true OR false, which is not well-formed XML.
37 Extend the struct tag. See representative example:
39 ID uint8 `codec:"http://ugorji.net/x-namespace xid id,omitempty,toarray,attr,cdata"`
40 // format: [namespace-uri ][namespace-prefix ]local-name, ...
43 Based on this, we encode
44 - fields as elements, BUT
45 encode as attributes if struct tag contains ",attr" and is a scalar (bool, number or string)
46 - text as entity-escaped text, BUT encode as CDATA if struct tag contains ",cdata".
49 - XMLHandle is denoted as being namespace-aware.
50 Consequently, we WILL use the ns:name pair to encode and decode if defined, else use the plain name.
51 - *Encoder and *Decoder know whether the Handle "prefers" namespaces.
52 - add *Encoder.getEncName(*structFieldInfo).
53 No one calls *structFieldInfo.indexForEncName directly anymore
54 - OR better yet: indexForEncName is namespace-aware, and helper.go is all namespace-aware
55 indexForEncName takes a parameter of the form namespace:local-name OR local-name
56 - add *Decoder.getStructFieldInfo(encName string) // encName here is either like abc, or h1:nsabc
57 by being a method on *Decoder, or maybe a method on the Handle itself.
58 No one accesses .encName anymore
59 - let encode.go and decode.go use these (for consistency)
60 - only problem exists for gen.go, where we create a big switch on encName.
61 Now, we also have to add a switch on strings.endsWith(kName, encNsName)
62 - gen.go will need to have many more methods, and then double-on the 2 switch loops like:
68 case !nsAware: panic(...)
69 case strings.endsWith(":abc"): x.abc()
70 case strings.endsWith(":def"): x.def()
76 The structure below accommodates this:
78 type typeInfo struct {
79 sfi []*structFieldInfo // sorted by encName
80 sfins // sorted by namespace
81 sfia // sorted, to have those with attributes at the top. Needed to write XML appropriately.
84 type structFieldInfo struct {
92 indexForEncName is now an internal helper function that takes a sorted array
93 (one of ti.sfins or ti.sfi). It is only used by *Encoder.getStructFieldInfo(...)
95 There will be a separate parser from the builder.
96 The parser will have a method: next() xmlToken method. It has lookahead support,
97 so you can pop multiple tokens, make a determination, and push them back in the order popped.
98 This will be needed to determine whether we are "nakedly" decoding a container or not.
99 The stack will be implemented using a slice and push/pop happens at the [0] element.
102 - type uint8: 0 | ElementStart | ElementEnd | AttrKey | AttrVal | Text
106 SEE: http://www.xml.com/pub/a/98/10/guide0.html?page=3#ENTDECL
108 The following are skipped when parsing:
109 - External Entities (from external file)
110 - Notation Declaration e.g. <!NOTATION GIF87A SYSTEM "GIF">
111 - Entity Declarations & References
112 - XML Declaration (assume UTF-8)
113 - XML Directive i.e. <! ... >
114 - Other Declarations: Notation, etc.
116 - Processing Instruction
117 - schema / DTD for validation:
118 We are not a VALIDATING parser. Validation is done elsewhere.
119 However, some parts of the DTD internal subset are used (SEE BELOW).
120 For Attribute List Declarations e.g.
121 <!ATTLIST foo:oldjoke name ID #REQUIRED label CDATA #IMPLIED status ( funny | notfunny ) 'funny' >
122 We considered using the ATTLIST to get "default" value, but not to validate the contents. (VETOED)
124 The following XML features are supported
131 The following DTD (when as an internal sub-set) features are supported:
132 - Internal Entities e.g.
133 <!ELEMENT burns "ugorji is cool" > AND entities for the set: [<>&"']
134 - Parameter entities e.g.
135 <!ENTITY % personcontent "ugorji is cool"> <!ELEMENT burns (%personcontent;)*>
137 At decode time, a structure containing the following is kept
139 - default attribute values
140 - all internal entities (<>&"' and others written in the document)
142 When decode starts, it parses XML namespace declarations and creates a map in the
143 xmlDecDriver. While parsing, that map continuously gets updated.
144 The only problem happens when a namespace declaration happens on the node that it defines.
145 e.g. <hn:name xmlns:hn="http://www.ugorji.net" >
146 To handle this, each Element must be fully parsed at a time,
147 even if it amounts to multiple tokens which are returned one at a time on request.
149 xmlns is a special attribute name.
150 - It is used to define namespaces, including the default
151 - It is never returned as an AttrKey or AttrVal.
152 *We may decide later to allow user to use it e.g. you want to parse the xmlns mappings into a field.*
154 Number, bool, null, mapKey, etc can all be decoded from any xmlToken.
155 This accommodates map[int]string for example.
157 It should be possible to create a schema from the types,
158 or vice versa (generate types from schema with appropriate tags).
159 This is however out-of-scope from this parsing project.
161 We should write all namespace information at the first point that it is referenced in the tree,
162 and use the mapping for all child nodes and attributes. This means that state is maintained
163 at a point in the tree. This also means that calls to Decode or MustDecode will reset some state.
165 When decoding, it is important to keep track of entity references and default attribute values.
166 It seems these can only be stored in the DTD components. We should honor them when decoding.
168 Configuration for XMLHandle will look like this:
173 NS map[string]string // ns URI to key, used for encoding
174 // Decoding: in case ENTITY declared in external schema or dtd, store info needed here
175 Entities map[string]string // map of entity rep to character
178 During encode, if a namespace mapping is not defined for a namespace found on a struct,
179 then we create a mapping for it using nsN (where N is 1..1000000, and doesn't conflict
180 with any other namespace mapping).
182 Note that different fields in a struct can have different namespaces.
183 However, all fields will default to the namespace on the _struct field (if defined).
185 An XML document is a name, a map of attributes and a list of children.
186 Consequently, we cannot "DecodeNaked" into a map[string]interface{} (for example).
187 We have to "DecodeNaked" into something that resembles XML data.
189 To support DecodeNaked (decode into nil interface{}), we have to define some "supporting" types:
190 type Name struct { // Preferred. Less allocations due to conversions.
194 type Element struct {
196 Attrs map[Name]string
197 Children []interface{} // each child is either *Element or string
199 Only two "supporting" types are exposed for XML: Name and Element.
201 // ------------------
203 We considered 'type Name string' where Name is like "Space Local" (space-separated).
204 We decided against it, because each creation of a name would lead to
205 double allocation (first convert []byte to string, then concatenate them into a string).
206 The benefit is that it is faster to read Attrs from a map. But given that Element is a value
207 object, we want to eschew methods and have public exposed variables.
209 We also considered the following, where xml types were not value objects, and we used
210 intelligent accessor methods to extract information and for performance.
211 *** WE DECIDED AGAINST THIS. ***
216 // Element is a ValueObject: There are no accessor methods.
217 // Make element self-contained.
218 type Element struct {
220 attrsMap map[string]string // where key is "Space Local"
224 childrenI []int // each child is a index into T or E.
226 func (x *Element) child(i) interface{} // returns string or *Element
228 // ------------------
230 Per XML spec and our default handling, white space is always treated as
231 insignificant between elements, except in a text node. The xml:space='preserve'
232 attribute is ignored.
234 **Note: there is no xml: namespace. The xml: attributes were defined before namespaces.**
235 **So treat them as just "directives" that should be interpreted to mean something**.
237 On encoding, we support indenting aka prettifying markup in the same way we support it for json.
239 A document or element can only be encoded/decoded from/to a struct. In this mode:
240 - struct name maps to element name (or tag-info from _struct field)
241 - fields are mapped to child elements or attributes
243 A map is either encoded as attributes on current element, or as a set of child elements.
244 Maps are encoded as attributes iff their keys and values are primitives (number, bool, string).
246 A list is encoded as a set of child elements.
248 Primitives (number, bool, string) are encoded as an element, attribute or text
249 depending on the context.
251 Extensions must encode themselves as a text string.
253 Encoding is tough, specifically when encoding mappings, because we need to encode
254 as either attribute or element. To do this, we need to default to encoding as attributes,
255 and then let Encoder inform the Handle when to start encoding as nodes.
256 i.e. Encoder does something like:
259 h.Encode(), h.Encode(), ...
260 h.EncodeMapNotAttrSignal() // this is not a bool, because it's a signal
261 h.Encode(), h.Encode(), ...
264 Only XMLHandle understands this, and will set itself to start encoding as elements.
266 This support extends to maps. For example, if a struct field is a map, and it has
267 the struct tag signifying it should be attr, then all its fields are encoded as attributes.
271 M map[string]int `codec:"m,attr"` // encode keys as attributes named
275 - if encoding a map, what if map keys have spaces in them???
276 Then they cannot be attributes or child elements. Error.
278 Options to consider adding later:
279 - For attribute values, normalize by trimming beginning and ending white space,
280 and converting every white space sequence to a single space.
281 - ATTLIST restrictions are enforced.
282 e.g. default value of xml:space, skipping xml:XYZ style attributes, etc.
283 - Consider supporting NON-STRICT mode (e.g. to handle HTML parsing).
284 Some elements e.g. br, hr, etc need not close and should be auto-closed
285 ... (see http://www.w3.org/TR/html4/loose.dtd)
286 An expansive set of entities are pre-defined.
287 - Have easy way to create a HTML parser:
288 add a HTML() method to XMLHandle, that will set Strict=false, specify AutoClose,
289 and add HTML Entities to the list.
290 - Support validating element/attribute XMLName before writing it.
291 Keep this behind a flag, which is set to false by default (for performance).
292 type XMLHandle struct {
299 - build encoder (1 day)
300 - build decoder (based off xmlParser) (1 day)
301 - implement xmlParser (2 days).
302 Look at encoding/xml for inspiration.
303 - integrate and TEST (1 days)
304 - write article and post it (1 day)
306 // ---------- MORE NOTES FROM 2017-11-30 ------------
309 - parse the attributes first
310 - then parse the nodes
313 - if encoding a field: we use the field name for the wrapper
314 - if encoding a non-field, then just use the element type name
316 map[string]string ==> <map><key>abc</key><value>val</value></map>... or
317 <map key="abc">val</map>... OR
318 <key1>val1</key1><key2>val2</key2>... <- PREFERED
319 []string ==> <string>v1</string><string>v2</string>...
320 string v1 ==> <string>v1</string>
321 bool true ==> <bool>true</bool>
322 float 1.0 ==> <float>1.0</float>
325 F1 map[string]string ==> <F1><key>abc</key><value>val</value></F1>... OR
326 <F1 key="abc">val</F1>... OR
327 <F1><abc>val</abc>...</F1> <- PREFERED
328 F2 []string ==> <F2>v1</F2><F2>v2</F2>...
329 F3 bool ==> <F3>true</F3>
332 - a scalar is encoded as:
333 (value) of type T ==> <T><value/></T>
334 (value) of field F ==> <F><value/></F>
335 - A kv-pair is encoded as:
336 (key,value) ==> <map><key><value/></key></map> OR <map key="value">
337 (key,value) of field F ==> <F><key><value/></key></F> OR <F key="value">
338 - A map or struct is just a list of kv-pairs
339 - A list is encoded as sequences of same node e.g.
344 - we may have to singularize the field name, when entering into xml,
345 and pluralize them when encoding.
346 - bi-directional encode->decode->encode is not a MUST.
347 even encoding/xml cannot decode correctly what was encoded:
349 see https://play.golang.org/p/224V_nyhMS
351 fmt.Println("Hello, playground")
352 v := []interface{}{"hello", 1, true, nil, time.Now()}
353 s, err := xml.Marshal(v)
354 fmt.Printf("err: %v, \ns: %s\n", err, s)
356 err = xml.Unmarshal(s, &v2)
357 fmt.Printf("err: %v, \nv2: %v\n", err, v2)
362 s, err = xml.Marshal(v3)
363 fmt.Printf("err: %v, \ns: %s\n", err, s)
365 err = xml.Unmarshal(s, &v4)
366 fmt.Printf("err: %v, \nv4: %v\n", err, v4)
370 s: <string>hello</string><int>1</int><bool>true</bool><Time>2009-11-10T23:00:00Z</Time>
374 s: <T><V>hello</V><V>1</V><V>true</V><V>2009-11-10T23:00:00Z</V></T>
376 v4: {[<nil> <nil> <nil> <nil>]}
380 // ----------- PARSER -------------------
382 type xmlTokenType uint8
385 _ xmlTokenType = iota << 1
393 type xmlToken struct {
396 Namespace string // blank for AttrVal and Text
399 type xmlParser struct {
401 toks []xmlToken // list of tokens.
402 ptr int // ptr into the toks slice
403 done bool // nothing else to parse. r now returns EOF.
406 func (x *xmlParser) next() (t *xmlToken) {
407 // once x.done, or x.ptr == len(x.toks) == 0, then return nil (to signify finish)
408 if !x.done && len(x.toks) == 0 {
411 // parses one element at a time (into possible many tokens)
412 if x.ptr < len(x.toks) {
415 if x.ptr == len(x.toks) {
423 // nextTag will parses the next element and fill up toks.
424 // It set done flag if/once EOF is reached.
425 func (x *xmlParser) nextTag() {
429 // ----------- ENCODER -------------------
431 type xmlEncDriver struct {
435 b [64]byte // scratch
441 // ----------- DECODER -------------------
443 type xmlDecDriver struct {
446 r decReader // *bytesDecReader decReader
447 ct valueType // container type. one of unset, array or map.
448 bstr [8]byte // scratch used for string \UXXX parsing
449 b [64]byte // scratch
451 // wsSkipped bool // whitespace skipped
458 // DecodeNaked will decode into an XMLNode
460 // XMLName is a value object representing a namespace-aware NAME
461 type XMLName struct {
466 // XMLNode represents a "union" of the different types of XML Nodes.
467 // Only one of fields (Text or *Element) is set.
468 type XMLNode struct {
473 // XMLElement is a value object representing an fully-parsed XML element.
474 type XMLElement struct {
476 Attrs map[XMLName]string
477 // Children is a list of child nodes, each being a *XMLElement or string
481 // ----------- HANDLE -------------------
483 type XMLHandle struct {
488 NS map[string]string // ns URI to key, for encoding
489 Entities map[string]string // entity representation to string, for encoding.
492 func (h *XMLHandle) newEncDriver(e *Encoder) encDriver {
493 return &xmlEncDriver{e: e, w: e.w, h: h}
496 func (h *XMLHandle) newDecDriver(d *Decoder) decDriver {
497 // d := xmlDecDriver{r: r.(*bytesDecReader), h: h}
498 hd := xmlDecDriver{d: d, r: d.r, h: h}
503 func (h *XMLHandle) SetInterfaceExt(rt reflect.Type, tag uint64, ext InterfaceExt) (err error) {
504 return h.SetExt(rt, tag, &extWrapper{bytesExtFailer{}, ext})
507 var _ decDriver = (*xmlDecDriver)(nil)
508 var _ encDriver = (*xmlEncDriver)(nil)