|  | 
Programming languages have poor reading libraries since the lexical
information that can be specified is directly tied to the structure of
the language. For example, in C it's hard to read a rational number
because there is no type rational.  Programs have been written to
circumvent this problem: Lex [Lesk75], for example, is one of them.
We choose to incorporate in Bigloo a set of new functions to assist in
such parsing. The syntax for regular grammar (also known as regular
analyser) of Bigloo 2.0 (the one described in this document) is not
compatible with former Bigloo versions.
 | 11.1 A new way of reading
 | 
There is only one way in Bigloo to read text, regular reading ,
which is done by the new form: 
| 
The first argument is a regular grammar (also known as regular
analyser) and the second a Scheme port.  This way of reading is almost
the same as the Lex's one. The reader tries to match the longest
input, from the stream pointed to by| read/rp regular-grammar input-port | bigloo procedure |  input-port, with one of
several regular expressions contained inregular-grammar. If
many rules match, the reader takes the first one defined in the
grammar. When the regular rule has been found the corresponding Scheme
expression is evaluated.
 remark:  The traditional
 readScheme function is implemented as:
| (define-inline (read port)
   (read/rp scheme-grammar port))
 |  |  | 11.2 The syntax of the regular grammar
 | 
A regular grammar is built by the means of the form  regular-grammar: 
| 
The| regular-grammar (binding ...) rule ... | bigloo syntax |  bindingandruleare defined by 
the following grammar:
 
 
Here is a description of each construction.| <binding>  ==> (<variable> <re>)
             | <option>
<option>   ==> <variable>
<rule>     ==> <define>
             | (<cre> <s-expression> <s-expression> ...)
             | (else<s-expression> <s-expression> ...)
<define>   ==> (define <s-expression>)
<cre>      ==> <re>
             | (context<symbol> <re>)
             | (when<s-expr> <re>)
             | (bol<re>)
             | (eol<re>)
             | (bof<re>)
             | (eof<re>)
<re>       ==> <variable>
             | <char>
             | <string>
             | (:<re> ...)
             | (or<re> ...)
             | (*<re>)
             | (+<re>)
             | (?<re>)
             | (=<integer> <re>)
             | (>=<integer> <re>)
             | (**<integer> <integer> <re>)
             | (...<integer> <re>)
             | (uncase<re>)
             | (in<cset> ...)
             | (out<cset> ...)
             | (and<cset> <cset>)
             | (but<cset> <cset>)
             | (posix<string>)
<variable> ==> <symbol>
<cset>     ==> <string>
             | <char>
             | (<string>)
             | (<char> <char>) |  
 
  (context <symbol> <re>)
This allows us to protect an expression. A protected
expression matches (or accepts) a word only if the grammar has been set to
the corresponding context. See The Semantics Actions, for more details.
 
(when <s-expr> <re>)
This allows us to protect an expression. A protected
expression matches (or accepts) a word only if the evaluation of 
<s-expr>is#t. For instance,
 
 
| (define *g*
   (let ((armed #f))
      (regular-grammar ()
	 ((when (not armed) (: "#!" (+ (or #\/ alpha))))
	  (set! armed #t)
	  (print "start [" (the-string) "]")
	  (ignore))
	 ((+ (in #\Space #\Tab))
	  (ignore))
	 (else
	  (the-failure)))))
   
(define (main argv)
   (let ((port (open-input-string "#!/bin/sh #!/bin/zsh")))
      (print (read/rp *g* port))))
 | (bol <re>)
Matches <re>at the beginning of line.
 
(eol <re>)
Matches <re>at the end of line.
 
(bof <re>)
Matches <re>at the beginning of file.
 
(eof <re>)
Matches <re>at the end of file.
 
<variable>
This is the name of a variable bound by a <binding> construction. In addition 
to user defined variables, some already exist. These are:
 
 
It is a error to reference a variable that it is not bound by a <binding>.
Defining a variable that already exists is acceptable and causes the former
variable definition to be erased. Here is an example of a grammar that binds
two variables, one called ident and one called number. These
two variables are used within the grammar to match identifiers and numbers.| all    <=> (out #\Newline)
lower  <=> (in ("az"))
upper  <=> (in ("AZ"))
alpha  <=> (or lower upper)
digit  <=> (in ("09"))
xdigit <=> (uncase (in ("af09")))
alnum  <=> (uncase (in ("az09")))
punct  <=> (in ".,;!?")
blank  <=> (in #" \t\n")
space  <=> #\Space
 |  
 
 
| (regular-grammar ((ident  (: alpha (* alnum)))
                  (number (+ digit)))
   (ident  (cons 'ident (the-string)))
   (number (cons 'number (the-string)))
   (else   (cons 'else (the-failure))))
 | <char>
The regular language described by one unique character. Here is an example of
a grammar that accepts either the character #\aor the character#\b:
 
 
| (regular-grammar ()
   (#\a (cons 'a (the-string)))
   (#\b (cons 'b (the-string)))
   (else (cons 'else (the-failure))))
 | <string>
This simple form of regular expression denotes the language represented
by the string. For instance the regular expression "Bigloo"matches
only the string composed of#\B #\i #\g #\l #\o #\o. The regular 
expression".*["matches the string#\. #\* #\[.
 
(: <re> ...)
This form constructs sequence of regular expression. That is a form
<re1> <re2> ... <ren>matches the language construction by
concatenation of the language described by<re1>,<re2>,<ren>. Thus,(: "x" all "y")matches all words of three
letters, started by character the#\xand ended with the character#\y.
 
(or <re> ...)
This construction denotes conditions. The language described by
(or re1 re2)accepts words accepted by eitherre1orre2.(* <re>)
This is the Kleene operator, the language described by (* <re>)is
the language containing, 0 or more occurrences of<re>. Thus, 
the language described by(* "abc")accepts the empty word and
any word composed by a repetition of theabc(abc,abcabc,abcabcabc, ...).
 
(+ <re>)
This expression described non empty repetitions. The form (+ re)is
equivalent to(: re (* re)). Thus,(+ "abc")matches the
wordsabc,abcabc, etc.
 
(? <re>)
This expression described one or zero occurrence. Thus, 
(? "abc")matches the empty word or the wordsabc.
 
(= <integer> <re>)
This expression described a fix number of repetitions. The form
(= num re)is equivalent to(: re re ... re). Thus,
the expression(= 3 "abc")matches the only wordabcabcabc.
In order to avoid code size explosion when compiling,<integer>must be smaller than an arbitrary constant. In the current version that 
value is81.
 
(>= <integer> <re>)
The language described by the expression (>= int re)accepts word
that are, at least,intrepetitions ofre. For instance,(>= 10 #\a), accepts words compound of, at least, 10 times the
character#\a. In order to avoid code size explosion when compiling,<integer>must be smaller than an arbitrary constant. In the current 
version that value is81.(** <integer> <integer> <re>)
The language described by the expression (** min max re)accepts
word that are repetitions ofre; the number of repetition is in
the rangemin,max. For instance,(** 10 20 #\a).
In order to avoid code size explosion when compiling,<integer>must be smaller than an arbitrary constant. In the current 
version that value is81.
 
(... <integer> <re>)
The subexpression <re>has to be a sequence
of characters. Sequences are build by the operator:or by string
literals. The language described by(... int re), denotes, the
first letter ofre, or the two first letters ofre, or the
three first letters ofreor theintfirst letters ofre. Thus,(... 3 "begin")is equivalent to(or "b" "be" "beg").
 
(uncase <re>)
The subexpression <re>has to be a sequence
construction. The language described by(uncase re)is the
same asrewhere letters may be upper case or lower case. For
instance,(uncase "begin"), accepts the words"begin","beGin","BEGIN","BegiN", etc.
 
(in <cset> ...)
Denotes union of characters. Characters may be described individually
such as in (in #\a #\b #\c #\d). They may be described by
strings. The expression(in "abcd")is equivalent to(in
#\a #\b #\c #\d).  Characters may also be described using a range
notation that is a list of two characters. The expression(in (#\a
#\d))is equivalent to(in #\a #\b #\c #\d). The Ranges may be
expresses using lists of string. The expression(in ("ad"))is equivalent to(in #\a #\b #\c #\d).
 
(out <cset> ...)
The language described by (out cset ...)is opposite to
the one described by(in cset ...). For instance,(out ("azAZ") (#\0 #\9))accepts all words of one character
that are neither letters nor digits. One should not that if the character
numbered zero may be used inside regular grammar, theoutconstruction never matches it. Thus to write a rule that, for instances,
matches every character but#\Newlineincluding the character
zero, one should write:
 
 
| (or (out #\Newline) #a000)
 | (and <cset> <cset>)
The language described by (and cset1 cset2)accepts words 
made of characters that are in bothcset1andcset2.
 
(but <cset> <cset>)
The language described by (but cset1 cset2)accepts words 
made of characters ofcset1that are not member ofcset2.(posix <string>)
The expression (posix string)allows one to use Posix string
notation for regular expressions. So, for example, the following two
expressions are equivalent:
 
 
| (posix "[az]+|x*|y{3,5}")
 (or (+ (in ("az"))) (* "x") (** 3 5 "y"))
 |  |  
| 
This form dispatches on strings. it opens an input on| string-case string rule ... | bigloo syntax |  stringa read into it according to the regular grammar defined by thebindingandrule. Example:
 
 
| (define (suffix string)
   (string-case string
      ((: (* all) ".")
       (ignore))
      ((+ (out #\.))
       (the-string))
      (else
       "")))
 |  |  | 11.3 The semantics actions
 | 
The semantics actions are regular Scheme expressions. These expressions
appear in an environment where some ``extra procedures'' are defined.
These procedures are: 
| 
Returns the input port currently in used.| the-port | bigloo rgc procedure |  |  
| 
Get the length of the biggest matching string.| the-length | bigloo rgc procedure |  |  
| 
Get a copy of the last matching string. The function| the-string | bigloo rgc procedure |  the-stringreturns a fresh copy of the matching each time it is called. In consequence,
 
 
| (let ((l1 (the-string)) (l2 (the-string)))
   (eq? l1 l2))
   => #f
 |  |  
| 
Get a copy of a substring of the last matching string. If the| the-substring start len | bigloo rgc procedure |  lenis negative, it is subtracted to the whole match length.
Here is an example of a rule extracting a part of a match:
 
 
Which can also be written:| (regular-grammar ()
   ((: #\" (* (out #\")) #\")
    (the-substring 1 (-fx (the-length) 1))))
 |  
 
 
| (regular-grammar ()
   ((: #\" (* (out #\")) #\")
    (the-substring 1 -1)))
 |  |  
| 
| the-character | bigloo rgc procedure |  
Returns the first character of a match (respectively, the first byte).| the-byte | bigloo rgc procedure |  |  
| 
Returns the| the-byte-ref n | bigloo rgc procedure |  n-th bytes of the matching string. |  
| 
| the-symbol | bigloo rgc procedure |  
| the-downcase-symbol | bigloo rgc procedure |  
| the-upcase-symbol | bigloo rgc procedure |  
Convert the last matching string into a symbol. The function| the-subsymbol start length | bigloo rgc procedure |  the-subsymbolobeys the same rules asthe-substring. |  
| 
| the-keyword | bigloo rgc procedure |  
| the-downcase-keyword | bigloo rgc procedure |  
Convert the last matching string into a keyword.| the-upcase-keyword | bigloo rgc procedure |  |  
| 
The conversion of the last matching string to fixnum.| the-fixnum | bigloo rgc procedure |  |  
| 
The conversion of the last matching string to flonum.| the-flonum | bigloo rgc procedure |  |  
| 
Returns the first char that the grammar can't match or the end of file
object.| the-failure | bigloo rgc procedure |  |  
| 
Ignore the parsing, keep reading. It's better to use| ignore | bigloo rgc procedure |  (ignore)rather than an expression like(read/rp in semantics actions since thegrammarport)(ignore)call will be done in a
tail recursive way. For instance,
| (let ((g (regular-grammar ()
            (")" 
             '())
            ("(" 
             (let* ((car (ignore))
                    (cdr (ignore)))
                (cons car cdr)))
            ((+ (out "()"))
             (the-string))))
      (p (open-input-string "(foo(bar(gee)))")))
   (read/rp g p))
   => ("foo" ("bar" ("gee")))
 |  |  
| 
If no| rgc-context [context] | bigloo rgc procedure |  contextis provide, this procedure reset the reader context
state. That is the reader is in no context. With one argument,contextset the reader in the contextcontext. For instance,
 
 
Note that RGC context are preserved across different uses of| (let ((g (regular-grammar ()
            ((context foo "foo") (print 'foo-bis))
            ("foo" (rgc-context 'foo) (print 'foo) (ignore))
            (else 'done)))
      (p (open-input-string "foofoo")))
   (read/rp g p))
   -| foo
      foo-bis
 |  read/rp. |  
| 
Returns the value of the current Rgc context.| the-context | bigloo rgc procedure |  |  | 11.4 Options and user definitions
 | 
Options act as parameters that are transmitted to the parser on the call
to  read/rp. Local defines are user functions inserted in the produced
parser, at the same level as the pre-defined  ignore function. Here is an example of grammar using both 
| (define gram
   (regular-grammar (x y)
      
      (define (foo s)
	 (cons* 'foo x s (ignore)))
      (define (bar s)
	 (cons* 'bar y s (ignore)))
 ((+ #\a) (foo (the-string)))
      ((+ #\b) (bar (the-string)))
      (else '())))
 |  
This grammar uses two options  x and  y. Hence when invokes it
takes two additional values such as: 
| (with-input-from-string "aabb"
   (lambda ()
      (read/rp gram (current-input-port) 'option-x 'option-y)))
   => (foo option-x aa bar option-y bb)
 |  | 11.5 Examples of regular grammar
 | 
The reader who wants to find a real example should read the code of
Bigloo's reader. But here are small examples
 
The first example presents a grammar that simulates the Unix program  wc.
 
| (let ((*char* 0)
      (*word* 0)
      (*line* 0))
   (regular-grammar ()
      ((+ #\Newline)
       (set! *char* (+ *char* (the-length)))
       (set! *line* (+ *line* (the-length)))
       (ignore))
      ((+ (in #\space #\tab))
       (set! *char* (+ *char* (the-length)))
       (ignore))
      ((+ (out #\newline #\space #\tab))
       (set! *char* (+ *char* (the-length)))
       (set! *word* (+ 1 *word*))
       (ignore))))
 | 
The second example presents a grammar that reads Arabic and Roman number.
 
| (let ((par-open 0))
   (regular-grammar ((arabic (in ("09")))
                     (roman  (uncase (in "ivxlcdm"))))
      ((+ (in #" \t\n"))
       (ignore))
      ((+ arabic)
       (string->integer (the-string)))
      ((+ roman)
       (roman->arabic (the-string)))
      (#\(
       (let ((open-key par-open))
          (set! par-open (+ 1 par-open))
          (context 'pair)
          (let loop-pair ((walk (ignore))) 
             (cond
                ((= open-key par-open)
                 '())
                (else
                 (cons walk (loop-pair (ignore))))))))
      (#\)
       (set! par-open (- par-open 1))
       (if (< par-open 0)
           (begin
              (set! par-open 0)
              (ignore))
           #f))
      ((in "+-*\\")
       (string->symbol (the-string)))
      (else
       (let ((char (the-failure)))
          (if (eof-object? char)
              char
              (error "grammar-roman" "Illegal char" char))))))
 |  |