1 #-*-mode: org;-*- 2 * Purpose 3 "Oh, Ducks!" is an extension to cl-unification to make parsing 4 structured documents easy, using CSS selectors. 5 * Installation 6 ** Prerequisites 7 + cl-unification 8 + cl-ppcre 9 + split-sequence 10 + alexandria 11 + asdf-system-connections 12 * closure-html 13 * cxml 14 * named-readtables 15 [+] Mandatory [*] Optional 16 ** Loading 17 Loading "Oh, Ducks!" is just like loading any other ASDF system. 18 However, because it does not mandate a particular HTML or XML parser, 19 it does not generally become useful until you have also loaded an 20 HTML/XML parsing library such as cxml or closure-html. 21 22 Start with: 23 : (asdf:oos 'asdf:load-op :oh-ducks) 24 If you would like to use the built-in support for parsing via 25 closure-html (which you almost certainly do), you'll also want to load 26 closure-html: 27 : (asdf:oos 'asdf:load-op :closure-html) 28 And, if you want to use DOM objects provided by cxml: 29 : (asdf:oos 'asdf:load-op :cxml) 30 ** Load-order Caveats 31 closure-html and cl-unification each define competing readers on #t. 32 To avoid load-order issues resulting in an indeterminate reader on #t, 33 you'll probably want to add 34 : #.(set-dispatch-macro-character #\# #\T 'unify::|sharp-T-reader|) 35 or 36 : (unify:enable-template-reader) 37 or 38 : (named-readtables:in-readtable unify:template-readtable) 39 to the top of any file which uses cl-unification's reader templates. 40 (The latter two currently only work if you have cl-unification from my 41 darcs repo.) 42 43 Please feel free to submit patches to closure-html and cl-unification 44 to fix this problem. 45 ** Depending Upon in ASDF Systems 46 It doesn't take long before managing your dependencies upon ASDF 47 systems becomes easiest by creating an ASDF system for whatever 48 project you're currently engaged in. It's important to note that, in 49 addition to depending upon oh-ducks, you'll also want to depend upon 50 whichever library provides your desired object model and parser. 51 52 For example, 53 : :depends-on (:oh-ducks :closure-html :cxml) 54 ** Differentiating between LHTML lists and XMLS lists 55 While it would, in theory, be possible to inspect lists and determine if they 56 are LHTML or XMLS lists, this is not currently done. You can, however, choose 57 which type you'd like to work with by pushing =:lists-are-xmls= or 58 =:lists-are-lhtml= to =*features*= before loading "Oh, Ducks!". 59 60 Unfortunately, this means you can only expect to use one list type in a single 61 lisp image. Patches to either automagically detect the list type, or to provide 62 layered functions are welcome. 63 * Usage 64 The combination of oh-ducks and closure-html provides an HTML template 65 for use with cl-unification, and has the following syntax: 66 67 (match (#t(html [(:model <model>)] 68 <selectors>+) 69 <document>) 70 &body) 71 selectors := (<selector> . <binding>) | 72 (<selector> . <template>) | 73 (<selector> <selectors>+) 74 document := <parsed-document> | <document-to-be-parsed> 75 76 :model is only necessary for unparsed documents (e.g., a pathname or string). 77 78 ** Examples 79 80 (match (#T(html (:model lhtml) 81 ("#id" . ?div)) 82 "<div id=\"id\">I <i>like</i> cheese.</div>") 83 (car div)) => 84 (:div ((:id "id")) "I " (:i () "like") " cheese.") 85 86 (match (#T(html (:model dom) 87 ("i" . #t(list ?j ?i)) 88 ("span>i" . ?span)) 89 "<div>I do <i>not</i> like cheese.</div><div><span>I like <i>cheese</i>.</span></div>") 90 (values i span)) => 91 #<ELEMENT i "not">, 92 (#<ELEMENT i "cheese">) 93 94 ** Selectors 95 The goal is to support all CSS-level-3 selectors. See the section 96 [[*improve selector support][To Do > Improve Selector Support]] for a list of currently unsupported 97 simple selectors and combinators. 98 99 Each selector should result in the same elements which would be 100 affected by the same CSS selector. That is, 101 #id => elements with id of "id" 102 .foo.bar => elements with both "foo" and "bar" classes 103 div => all <div>s 104 and so forth. 105 106 NOTE: selectors are currently bound in parallel. That is, given 107 #t(html (<selector-1> ...) 108 (<selector-2> ...)) 109 selector-1 and selector-2 do not interact. If they are both "foo", they'll 110 return identical results. I often find myself wanting to also say something 111 like: 112 #t(html (<selector-1> ...) 113 (<element-after-selector-1> ...)) 114 Ideas for a syntax to distinguish between the two cases are welcome (:mode 115 parallel) vs (:mode sequential), perhaps? (Or even adjacent, sibling?) 116 117 *** Limitations 118 119 Currently, selector terms are limited to alphanumeric characters, and 120 do not support CSS-style character escapes. Patches welcome! 121 122 ** Included Object Models 123 *** LHTML (closure-html) 124 A list-based structure provided by closure-html. Cannot be used by 125 selectors which require asking about parent or sibling objects. 126 *** PT (closure-html) 127 A structure-based structure provided by closure-html. 128 *** DOM (cxml) 129 DOM objects as provided by cxml and defined by the W3C. 130 * Extending 131 ** Adding an object model 132 While the supported models should generally be sufficient, you can add 133 your own fairly easily. All models are expected to implement the 134 generic functions in <traversal/interface.lisp>. See the other files 135 under the traversal/ directory for examples. 136 137 You might also want to see chtml.lisp and cxml.lisp. 138 ** Adding a selector or combinator 139 see <selectors.lisp>. Generally, you should add a class which is a 140 subclass of combinator or simple-selector, augment parse-selector with 141 an appropriate regular expression, and define a method on 142 subject-p. 143 144 I also recommend submitting a patch. Other people might want to use 145 that selector, too! 146 * To Do 147 ** working lhtml/xmls support [2/2] 148 * [X] non-descendant cases (class, id, etc.) 149 * [X] selectors involving descendants 150 CAUTION: Won't produce sane results if the document tree is 151 modified or you use nested (match)es. 152 ** write documentation 153 ** improve selector support 154 *** positional selectors [11/11] 155 * [X] :nth-child 156 * [X] :nth-last-child 157 * [X] :first-child 158 * [X] :last-child 159 * [X] :nth-of-type 160 * [X] :nth-last-of-type 161 * [X] :first-of-type 162 * [X] :last-of-type 163 * [X] :only-child 164 * [X] :only-of-type 165 * [X] :empty 166 *** attribute selectors [2/7] 167 * [X] attribute-present [att] 168 * [X] attribute-equal [att=val] 169 * [ ] attribute-member [att~=val] 170 * [ ] attribute-lang [att|=val] 171 * [ ] attribute-begins [att^=val] 172 * [ ] attribute-ends [att$=val] 173 * [ ] attribute-contains [att*=val] 174 *** :not(...) 175 *** any others? 176 ** namespace support(?) 177 ** Submit patch to cl-unification to add (enable/disable-template-reader) functions 178 Submitted. Was it ever accepted? Man, I don't remember. 179 ** Submit patch to closure-html to add (enable/disable-reader) functions 180 ** non-css templates (e.g., for matching on text of element)? 181 Maybe special-case string/regexp-templates, so for example 182 : #t(html ("div" (#t(regexp "f(o+)bar") . ?div))) 183 would match [<div>foooobar</div>]? 184 185 : #t(html ("div" . #t(regexp "f(o+)bar" (?o)))) 186 might cause some difficulty, however--we should get a list of matched elements 187 for the div selector, but the regexp variable (?o) can only match once (without 188 some wacky environment merging, anyway). 189 ** Element structure templates 190 For instance, sometimes it'd be nice to stuff the value of an attribute into a 191 variable, like so: 192 : (match #t(attr ("href" ?href) ("name" ?name)) "<a href='url' name='link'></a>" 193 : (values href name)) => 194 : "url", "link" 195 While it's certainly easy enough to do that using, say, XMLS-style lists, a 196 general object-model-agnostic method would seem to be preferrable. 197 ** Layered functions so LHTML vs. XMLS support can be switched at runtime