summaryrefslogtreecommitdiff
path: root/README
blob: 6db1b06e707241454697c9b25b621edc704927a0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
README
======

Much of this content I learned from
http://www.ibm.com/developerworks/xml/library/x-matters17.html, and
Dave Mertz is also the one who introduced me to PYX. The authoritative
source, and inventor of this format, is Sean McGrath. His article is
here: http://www.xml.com/pub/a/2000/03/15/feature/index.html. You may
also want to browse a copy of "McGrath - XML Processing with Python,
Prentice Hall, 2000".

The idea behind PYX is to have xml files easily digestible by common
unix textutils, like grep, sed, awk, diff etc. The format itself owes
ideas to SGML's ESIS standard.

This is how the format works: each line has a prefix character, that
determines the content-type of the line. Start and end tags, and
individual attributes each get their own line, to facilitate easy
line-based processing. The prefix characters are:

 (  start-tag
 )  end-tag
 A  attribute
 -  character data (content)
 ?  processing instruction

So this xml snippet

<?xml version="1.0" encoding="utf-8"?>
<office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:smil="urn:oasis:names:tc:opendocument:xmlns:smil-compatible:1.0" xmlns:anim="urn:oasis:names:tc:opendocument:xmlns:animation:1.0" xmlns:rpt="http://openoffice.org/2005/report" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:grddl="http://www.w3.org/2003/g/data-view#" xmlns:officeooo="http://openoffice.org/2009/office" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0" xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0" xmlns:tableooo="http://openoffice.org/2009/table" xmlns:css3t="http://www.w3.org/TR/css3-text/" office:version="1.2" grddl:transformation="http://docs.oasis-open.org/office/1.2/xslt/odf2rdf.xsl">
 <office:body>
  <office:presentation>
   <draw:page>
     blah
   </draw:page>
  </office:presentation>
 </office:body>
</office:document-content>

looks like this in PYX:

({urn:oasis:names:tc:opendocument:xmlns:office:1.0}document-content
A{urn:oasis:names:tc:opendocument:xmlns:office:1.0}version 1.2
A{http://www.w3.org/2003/g/data-view#}transformation http://docs.oasis-open.org/office/1.2/xslt/odf2rdf.xsl
-\n
- 
({urn:oasis:names:tc:opendocument:xmlns:office:1.0}body
-\n
-  
({urn:oasis:names:tc:opendocument:xmlns:office:1.0}presentation
-\n
-   
({urn:oasis:names:tc:opendocument:xmlns:drawing:1.0}page
-\n
-     blah
-\n
-   
){urn:oasis:names:tc:opendocument:xmlns:drawing:1.0}page
-\n
-  
){urn:oasis:names:tc:opendocument:xmlns:office:1.0}presentation
-\n
- 
){urn:oasis:names:tc:opendocument:xmlns:office:1.0}body
-\n
){urn:oasis:names:tc:opendocument:xmlns:office:1.0}document-content

Converting that back to xml via "pyx2xml -ns" yields this:

<?xml version="1.0" encoding="UTF-8"?>
<ns1:document-content xmlns:ns3="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:ns1="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:ns2="http://www.w3.org/2003/g/data-view#" ns1:version="1.2" ns2:transformation="http://docs.oasis-open.org/office/1.2/xslt/odf2rdf.xsl">
 <ns1:body>
  <ns1:presentation>
   <ns3:page>
     blah
   </ns3:page>
  </ns1:presentation>
 </ns1:body>
</ns1:document-content>

You notice that the conversion is not lossless in terms of octets
produced, but should reasonably faithfully conserve the xml info set.

One noticeable gap is the fact that conforming sax parsers will not
report the xml prologue, so this is not converted to pyx, instead
regenerated from scratch inside pyx2xml.py and thus loosing e.g. extra
standalone attributes.


INVOKING
========

xml2pyx.py <file.xml>: will convert xml file to pyx, on stdout

pyx2xml.py [-ns]: will read pyx data from stdin, and output xml to
                  stdout. Without the -ns flag, will operate as a pure
                  filter and output lines as they come in. *With* the
                  -ns flag, will store entire xml file in memory, to
                  later add all used namespaces to the first element.

MISC
====

Feedback to <thb at openoffice dot org>

Have fun hacking,

Thorsten