re2c - convert regular expressions to C/C++ code
re2c [OPTIONS] FILE
RE2C(1) RE2C(1)
NAME
re2c - convert regular expressions to C/C++ code
SYNOPSIS
re2c [OPTIONS] FILE
DESCRIPTION
re2c is a lexer generator for C/C++. It finds regular expression speci-
fications inside of C/C++ comments and replaces them with a hard-coded
DFA. The user must supply some interface code in order to control and
customize the generated DFA.
OPTIONS
-? -h --help
Invoke a short help.
-b --bit-vectors
Implies -s. Use bit vectors as well in the attempt to coax bet-
ter code out of the compiler. Most useful for specifications
with more than a few keywords (e.g. for most programming lan-
guages).
-c --conditions
Used to support (f)lex-like condition support.
-d --debug-output
Creates a parser that dumps information about the current posi-
tion and in which state the parser is while parsing the input.
This is useful to debug parser issues and states. If you use
this switch you need to define a macro YYDEBUG that is called
like a function with two parameters: void YYDEBUG (int state,
char current). The first parameter receives the state or -1 and
the second parameter receives the input at the current cursor.
-D --emit-dot
Emit Graphviz dot data. It can then be processed with e.g. dot
-Tpng input.dot > output.png. Please note that scanners with
many states may crash dot.
-e --ecb
Generate a parser that supports EBCDIC. The generated code can
deal with any character up to 0xFF. In this mode re2c assumes
that input character size is 1 byte. This switch is incompatible
with -w, -x, -u and -8.
-f --storable-state
Generate a scanner with support for storable state.
-F --flex-syntax
Partial support for flex syntax. When this flag is active then
named definitions must be surrounded by curly braces and can be
defined without an equal sign and the terminating semi colon.
Instead names are treated as direct double quoted strings.
-g --computed-gotos
Generate a scanner that utilizes GCC's computed goto feature.
That is re2c generates jump tables whenever a decision is of a
certain complexity (e.g. a lot of if conditions are otherwise
necessary). This is only useable with GCC and produces output
that cannot be compiled with any other compiler. Note that this
implies -b and that the complexity threshold can be configured
using the inplace configuration cgoto:threshold.
-i --no-debug-info
Do not output #line information. This is useful when you want
use a CMS tool with the re2c output which you might want if you
do not require your users to have re2c themselves when building
from your source.
-o OUTPUT --output=OUTPUT
Specify the OUTPUT file.
-r --reusable
Allows reuse of scanner definitions with /*!use:re2c */ after
/*!rules:re2c */. In this mode no /*!re2c */ block and exactly
one /*!rules:re2c */ must be present. The rules are being saved
and used by every /*!use:re2c */ block that follows. These
blocks can contain inplace configurations, especially
re2c:flags:e, re2c:flags:w, re2c:flags:x, re2c:flags:u and
re2c:flags:8. That way it is possible to create the same scan-
ner multiple times for different character types, different
input mechanisms or different output mechanisms. The
/*!use:re2c */ blocks can also contain additional rules that
will be appended to the set of rules in /*!rules:re2c */.
-s --nested-ifs
Generate nested ifs for some switches. Many compilers need this
assist to generate better code.
-t HEADER --type-header=HEADER
Create a HEADER file that contains types for the (f)lex-like
condition support. This can only be activated when -c is in use.
-u --unicode
Generate a parser that supports UTF-32. The generated code can
deal with any valid Unicode character up to 0x10FFFF. In this
mode re2c assumes that input character size is 4 bytes. This
switch is incompatible with -e, -w, -x and -8. This implies -s.
-v --version
Show version information.
-V --vernum
Show the version as a number XXYYZZ.
-w --wide-chars
Generate a parser that supports UCS-2. The generated code can
deal with any valid Unicode character up to 0xFFFF. In this
mode re2c assumes that input character size is 2 bytes. This
switch is incompatible with -e, -x, -u and -8. This implies -s.
-x --utf-16
Generate a parser that supports UTF-16. The generated code can
deal with any valid Unicode character up to 0x10FFFF. In this
mode re2c assumes that input character size is 2 bytes. This
switch is incompatible with -e, -w, -u and -8. This implies -s.
-8 --utf-8
Generate a parser that supports UTF-8. The generated code can
deal with any valid Unicode character up to 0x10FFFF. In this
mode re2c assumes that input character size is 1 byte. This
switch is incompatible with -e, -w, -x and -u.
--case-insensitive
All strings are case insensitive, so all "-expressions are
treated in the same way '-expressions are.
--case-inverted
Invert the meaning of single and double quoted strings. With
this switch single quotes are case sensitive and double quotes
are case insensitive.
--no-generation-date
Suppress date output in the generated file.
--no-generation-date
Suppress version output in the generated file.
--encoding-policy POLICY
Specify how re2c must treat Unicode surrogates. POLICY can be
one of the following: fail (abort with error when surrogate
encountered), substitute (silently substitute surrogate with
error code point 0xFFFD), ignore (treat surrogates as normal
code points). By default re2c ignores surrogates (for backward
compatibility). Unicode standard says that standalone surrogates
are invalid code points, but different libraries and programs
treat them differently.
--input INPUT
Specify re2c input API. INPUT can be one of the following:
default, custom.
-S --skeleton
Instead of embedding re2c-generated code into C/C++ source, gen-
erate a self-contained program for the same DFA. Most useful for
correctness and performance testing.
--empty-class POLICY
What to do if user inputs empty character class. POLICY can be
one of the following: match-empty (match empty input: pretty
illogical, but this is the default for backwards compatibility
reason), match-none (fail to match on any input), error (compi-
lation error). Note that there are various ways to construct
empty class, e.g: [], [^\x00-\xFF], [\x00-\xFF][\x00-\xFF].
--dfa-minimization <table | moore>
Internal algorithm used by re2c to minimize DFA (defaults to
moore). Both table filling and Moore's algorithms should pro-
duce identical DFA (up to states relabelling). Table filling
algorithm is much simpler and slower; it serves as a reference
implementation.
-1 --single-pass
Deprecated and does nothing (single pass is by default now).
-W Turn on all warnings.
-Werror
Turn warnings into errors. Note that this option along doesn't
turn on any warnings, it only affects those warnings that have
been turned on so far or will be turned on later.
-W<warning>
Turn on individual warning.
-Wno-<warning>
Turn off individual warning.
-Werror-<warning>
Turn on individual warning and treat it as error (this implies
-W<warning>).
-Wno-error-<warning>
Don't treat this particular warning as error. This doesn't turn
off the warning itself.
-Wcondition-order
Warn if the generated program makes implicit assumptions about
condition numbering. One should use either -t, --type-header
option or /*!types:re2c*/ directive to generate mapping of con-
dition names to numbers and use autogenerated condition names.
-Wempty-character-class
Warn if regular expression contains empty character class. From
the rational point of view trying to match empty character class
makes no sense: it should always fail. However, for backwards
compatibility reasons re2c allows empty character class and
treats it as empty string. Use --empty-class option to change
default behaviour.
-Wmatch-empty-string
Warn if regular expression in a rule is nullable (matches empty
string). If DFA runs in a loop and empty match is unintentional
(input position in not advanced manually), lexer may get stuck
in eternal loop.
-Wswapped-range
Warn if range lower bound is greater that upper bound. Default
re2c behaviour is to silently swap range bounds.
-Wundefined-control-flow
Warn if some input strings cause undefined control flow in lexer
(the faulty patterns are reported). This is the most dangerous
and common mistake. It can be easily fixed by adding default
rule * (this rule has the lowest priority, matches any code unit
and consumes exactly one code unit).
-Wuseless-escape
Warn if a symbol is escaped when it shouldn't be. By default
re2c silently ignores escape, but this may as well indicate a
typo or an error in escape sequence.
INTERFACE CODE
The user must supply interface code either in the form of C/C++ code
(macros, functions, variables, etc.) or in the form of INPLACE CONFIGU-
RATIONS. Which symbols must be defined and which are optional depends
on a particular use case.
YYCONDTYPE
In -c mode you can use -t to generate a file that contains the
enumeration used as conditions. Each of the values refers to a
condition of a rule set.
YYCTXMARKER
l-value of type YYCTYPE *. The generated code saves trailing
context backtracking information in YYCTXMARKER. The user only
needs to define this macro if a scanner specification uses
trailing context in one or more of its regular expressions.
YYCTYPE
Type used to hold an input symbol (code unit). Usually char or
unsigned char for ASCII, EBCDIC and UTF-8, unsigned short for
UTF-16 or UCS-2 and unsigned int for UTF-32.
YYCURSOR
l-value of type YYCTYPE * that points to the current input sym-
bol. The generated code advances YYCURSOR as symbols are
matched. On entry, YYCURSOR is assumed to point to the first
character of the current token. On exit, YYCURSOR will point to
the first character of the following token.
YYDEBUG (state, current)
This is only needed if the -d flag was specified. It allows one
to easily debug the generated parser by calling a user defined
function for every state. The function should have the following
signature: void YYDEBUG (int state, char current). The first
parameter receives the state or -1 and the second parameter
receives the input at the current cursor.
YYFILL (n)
The generated code "calls"" YYFILL (n) when the buffer needs
(re)filling: at least n additional characters should be pro-
vided. YYFILL (n) should adjust YYCURSOR, YYLIMIT, YYMARKER and
YYCTXMARKER as needed. Note that for typical programming lan-
guages n will be the length of the longest keyword plus one. The
user can place a comment of the form /*!max:re2c*/ to insert
YYMAXFILL definition that is set to the maximum length value.
YYGETCONDITION ()
This define is used to get the condition prior to entering the
scanner code when using -c switch. The value must be initialized
with a value from the enumeration YYCONDTYPE type.
YYGETSTATE ()
The user only needs to define this macro if the -f flag was
specified. In that case, the generated code "calls" YYGETSTATE
() at the very beginning of the scanner in order to obtain the
saved state. YYGETSTATE () must return a signed integer. The
value must be either -1, indicating that the scanner is entered
for the first time, or a value previously saved by YYSETSTATE
(s). In the second case, the scanner will resume operations
right after where the last YYFILL (n) was called.
YYLIMIT
Expression of type YYCTYPE * that marks the end of the buffer
YYLIMIT[-1] is the last character in the buffer). The generated
code repeatedly compares YYCURSOR to YYLIMIT to determine when
the buffer needs (re)filling.
YYMARKER
l-value of type YYCTYPE *. The generated code saves backtrack-
ing information in YYMARKER. Some easy scanners might not use
this.
YYMAXFILL
This will be automatically defined by /*!max:re2c*/ blocks as
explained above.
YYSETCONDITION (c)
This define is used to set the condition in transition rules.
This is only being used when -c is active and transition rules
are being used.
YYSETSTATE (s)
The user only needs to define this macro if the -f flag was
specified. In that case, the generated code "calls" YYSETSTATE
just before calling YYFILL (n). The parameter to YYSETSTATE is a
signed integer that uniquely identifies the specific instance of
YYFILL (n) that is about to be called. Should the user wish to
save the state of the scanner and have YYFILL (n) return to the
caller, all he has to do is store that unique identifer in a
variable. Later, when the scannered is called again, it will
call YYGETSTATE () and resume execution right where it left off.
The generated code will contain both YYSETSTATE (s) and YYGET-
STATE even if YYFILL (n) is being disabled.
SYNTAX
Code for re2c consists of a set of RULES, NAMED DEFINITIONS and INPLACE
CONFIGURATIONS.
RULES
Rules consist of a regular expression (see REGULAR EXPRESSIONS) along
with a block of C/C++ code that is to be executed when the associated
regular expression is matched. You can either start the code with an
opening curly brace or the sequence :=. When the code with a curly
brace then re2c counts the brace depth and stops looking for code auto-
matically. Otherwise curly braces are not allowed and re2c stops look-
ing for code at the first line that does not begin with whitespace. If
two or more rules overlap, the first rule is preferred.
regular-expression { C/C++ code }
regular-expression := C/C++ code
There is one special rule: default rule *
* { C/C++ code }
* := C/C++ code
Note that default rule * differs from [^]: default rule has the lowest
priority, matches any code unit (either valid or invalid) and always
consumes one character; while [^] matches any valid code point (not
code unit) and can consume multiple code units. In fact, when vari-
able-length encoding is used, * is the only possible way to match
invalid input character (see ENCODINGS for details).
If -c is active then each regular expression is preceded by a list of
comma separated condition names. Besides normal naming rules there are
two special cases: <*> (such rules are merged to all conditions) and <>
(such the rule cannot have an associated regular expression, its code
is merged to all actions). Non empty rules may further more specify the
new condition. In that case re2c will generate the necessary code to
change the condition automatically. Rules can use :=> as a shortcut to
automatically generate code that not only sets the new condition state
but also continues execution with the new state. A shortcut rule should
not be used in a loop where there is code between the start of the loop
and the re2c block unless re2c:cond:goto is changed to continue. If
code is necessary before all rules (though not simple jumps) you can
doso by using <!> pseudo-rules.
<condition-list> regular-expression { C/C++ code }
<condition-list> regular-expression := C/C++ code
<condition-list> * { C/C++ code }
<condition-list> * := C/C++ code
<condition-list> regular-expression => condition { C/C++ code }
<condition-list> regular-expression => condition := C/C++ code
<condition-list> * => condition { C/C++ code }
<condition-list> * => condition := C/C++ code
<condition-list> regular-expression :=> condition
<*> regular-expression { C/C++ code }
<*> regular-expression := C/C++ code
<*> * { C/C++ code }
<*> * := C/C++ code
<*> regular-expression => condition { C/C++ code }
<*> regular-expression => condition := C/C++ code
<*> * => condition { C/C++ code }
<*> * => condition := C/C++ code
<*> regular-expression :=> condition
<> { C/C++ code }
<> := C/C++ code
<> => condition { C/C++ code }
<> => condition := C/C++ code
<> :=> condition
<> :=> condition
<! condition-list> { C/C++ code }
<! condition-list> := C/C++ code
<!> { C/C++ code }
<!> := C/C++ code
NAMED DEFINITIONS
Named definitions are of the form:
name = regular-expression;
If -F is active, then named definitions are also of the form:
name { regular-expression }
INPLACE CONFIGURATIONS
re2c:condprefix = yyc;
Allows one to specify the prefix used for condition labels. That
is this text is prepended to any condition label in the gener-
ated output file.
re2c:condenumprefix = yyc;
Allows one to specify the prefix used for condition values. That
is this text is prepended to any condition enum value in the
generated output file.
re2c:cond:divider = /* *********************************** */ ;
Allows one to customize the devider for condition blocks. You
can use @@ to put the name of the condition or customize the
placeholder using re2c:cond:divider@cond.
re2c:cond:divider@cond = @@;
Specifies the placeholder that will be replaced with the condi-
tion name in re2c:cond:divider.
re2c:cond:goto = goto @@; ;
Allows one to customize the condition goto statements used with
:=> style rules. You can use @@ to put the name of the condition
or ustomize the placeholder using re2c:cond:goto@cond. You can
also change this to continue;, which would allow you to continue
with the next loop cycle including any code between loop start
and re2c block.
re2c:cond:goto@cond = @@;
Spcifies the placeholder that will be replaced with the condi-
tion label in re2c:cond:goto.
re2c:indent:top = 0;
Specifies the minimum number of indentation to use. Requires a
numeric value greater than or equal zero.
re2c:indent:string = \t ;
Specifies the string to use for indentation. Requires a string
that should contain only whitespace unless you need this for
external tools. The easiest way to specify spaces is to enclude
them in single or double quotes. If you do not want any inden-
tation at all you can simply set this to "".
re2c:yych:conversion = 0;
When this setting is non zero, then re2c automatically generates
conversion code whenever yych gets read. In this case the type
must be defined using re2c:define:YYCTYPE.
re2c:yych:emit = 1;
Generation of yych can be suppressed by setting this to 0.
re2c:yybm:hex = 0;
If set to zero then a decimal table is being used else a hexa-
decimal table will be generated.
re2c:yyfill:enable = 1;
Set this to zero to suppress generation of YYFILL (n). When
using this be sure to verify that the generated scanner does not
read behind input. Allowing this behavior might introduce sever
security issues to you programs.
re2c:yyfill:check = 1;
This can be set 0 to suppress output of the pre condition using
YYCURSOR and YYLIMIT which becomes useful when YYLIMIT + YYMAX-
FILL is always accessible.
re2c:define:YYFILL = YYFILL ;
Substitution for YYFILL. Note that by default re2c generates
argument in braces and semicolon after YYFILL. If you need to
make YYFILL an arbitrary statement rather than a call, set
re2c:define:YYFILL:naked to non-zero and use
re2c:define:YYFILL@len to denote formal parameter inside of
YYFILL body.
re2c:define:YYFILL@len = @@ ;
Any occurrence of this text inside of YYFILL will be replaced
with the actual argument.
re2c:yyfill:parameter = 1;
Controls argument in braces after YYFILL. If zero, agrument is
omitted. If non-zero, argument is generated unless
re2c:define:YYFILL:naked is set to non-zero.
re2c:define:YYFILL:naked = 0;
Controls argument in braces and semicolon after YYFILL. If zero,
both agrument and semicolon are omitted. If non-zero, argument
is generated unless re2c:yyfill:parameter is set to zero and
semicolon is generated unconditionally.
re2c:startlabel = 0;
If set to a non zero integer then the start label of the next
scanner blocks will be generated even if not used by the scanner
itself. Otherwise the normal yy0 like start label is only being
generated if needed. If set to a text value then a label with
that text will be generated regardless of whether the normal
start label is being used or not. This setting is being reset to
0 after a start label has been generated.
re2c:labelprefix = yy ;
Allows one to change the prefix of numbered labels. The default
is yy and can be set any string that is a valid label.
re2c:state:abort = 0;
When not zero and switch -f is active then the YYGETSTATE block
will contain a default case that aborts and a -1 case is used
for initialization.
re2c:state:nextlabel = 0;
Used when -f is active to control whether the YYGETSTATE block
is followed by a yyNext: label line. Instead of using yyNext
you can usually also use configuration startlabel to force a
specific start label or default to yy0 as start label. Instead
of using a dedicated label it is often better to separate the
YYGETSTATE code from the actual scanner code by placing a
/*!getstate:re2c*/ comment.
re2c:cgoto:threshold = 9;
When -g is active this value specifies the complexity threshold
that triggers generation of jump tables rather than using nested
if's and decision bitfields. The threshold is compared against a
calculated estimation of if-s needed where every used bitmap
divides the threshold by 2.
re2c:yych:conversion = 0;
When the input uses signed characters and -s or -b switches are
in effect re2c allows one to automatically convert to the
unsigned character type that is then necessary for its internal
single character. When this setting is zero or an empty string
the conversion is disabled. Using a non zero number the conver-
sion is taken from YYCTYPE. If that is given by an inplace con-
figuration that value is being used. Otherwise it will be (YYC-
TYPE) and changes to that configuration are no longer possible.
When this setting is a string the braces must be specified. Now
assuming your input is a char * buffer and you are using above
mentioned switches you can set YYCTYPE to unsigned char and this
setting to either 1 or (unsigned char).
re2c:define:YYCONDTYPE = YYCONDTYPE ;
Enumeration used for condition support with -c mode.
re2c:define:YYCTXMARKER = YYCTXMARKER ;
Allows one to overwrite the define YYCTXMARKER and thus avoiding
it by setting the value to the actual code needed.
re2c:define:YYCTYPE = YYCTYPE ;
Allows one to overwrite the define YYCTYPE and thus avoiding it
by setting the value to the actual code needed.
re2c:define:YYCURSOR = YYCURSOR ;
Allows one to overwrite the define YYCURSOR and thus avoiding it
by setting the value to the actual code needed.
re2c:define:YYDEBUG = YYDEBUG ;
Allows one to overwrite the define YYDEBUG and thus avoiding it
by setting the value to the actual code needed.
re2c:define:YYGETCONDITION = YYGETCONDITION ;
Substitution for YYGETCONDITION. Note that by default re2c gen-
erates braces after YYGETCONDITION. Set re2c:define:YYGETCONDI-
TION:naked to non-zero to omit braces.
re2c:define:YYGETCONDITION:naked = 0;
Controls braces after YYGETCONDITION. If zero, braces are omit-
ted. If non-zero, braces are generated.
re2c:define:YYSETCONDITION = YYSETCONDITION ;
Substitution for YYSETCONDITION. Note that by default re2c gen-
erates argument in braces and semicolon after YYSETCONDITION. If
you need to make YYSETCONDITION an arbitrary statement rather
than a call, set re2c:define:YYSETCONDITION:naked to non-zero
and use re2c:define:YYSETCONDITION@cond to denote formal parame-
ter inside of YYSETCONDITION body.
re2c:define:YYSETCONDITION@cond = @@ ;
Any occurrence of this text inside of YYSETCONDITION will be
replaced with the actual argument.
re2c:define:YYSETCONDITION:naked = 0;
Controls argument in braces and semicolon after YYSETCONDITION.
If zero, both agrument and semicolon are omitted. If non-zero,
both argument and semicolon are generated.
re2c:define:YYGETSTATE = YYGETSTATE ;
Substitution for YYGETSTATE. Note that by default re2c generates
braces after YYGETSTATE. Set re2c:define:YYGETSTATE:naked to
non-zero to omit braces.
re2c:define:YYGETSTATE:naked = 0;
Controls braces after YYGETSTATE. If zero, braces are omitted.
If non-zero, braces are generated.
re2c:define:YYSETSTATE = YYSETSTATE ;
Substitution for YYSETSTATE. Note that by default re2c generates
argument in braces and semicolon after YYSETSTATE. If you need
to make YYSETSTATE an arbitrary statement rather than a call,
set re2c:define:YYSETSTATE:naked to non-zero and use
re2c:define:YYSETSTATE@cond to denote formal parameter inside of
YYSETSTATE body.
re2c:define:YYSETSTATE@state = @@ ;
Any occurrence of this text inside of YYSETSTATE will be
replaced with the actual argument.
re2c:define:YYSETSTATE:naked = 0;
Controls argument in braces and semicolon after YYSETSTATE. If
zero, both agrument and semicolon are omitted. If non-zero, both
argument and semicolon are generated.
re2c:define:YYLIMIT = YYLIMIT ;
Allows one to overwrite the define YYLIMIT and thus avoiding it
by setting the value to the actual code needed.
re2c:define:YYMARKER = YYMARKER ;
Allows one to overwrite the define YYMARKER and thus avoiding it
by setting the value to the actual code needed.
re2c:label:yyFillLabel = yyFillLabel ;
Allows one to overwrite the name of the label yyFillLabel.
re2c:label:yyNext = yyNext ;
Allows one to overwrite the name of the label yyNext.
re2c:variable:yyaccept = yyaccept;
Allows one to overwrite the name of the variable yyaccept.
re2c:variable:yybm = yybm ;
Allows one to overwrite the name of the variable yybm.
re2c:variable:yych = yych ;
Allows one to overwrite the name of the variable yych.
re2c:variable:yyctable = yyctable ;
When both -c and -g are active then re2c uses this variable to
generate a static jump table for YYGETCONDITION.
re2c:variable:yystable = yystable ;
Deprecated.
re2c:variable:yytarget = yytarget ;
Allows one to overwrite the name of the variable yytarget.
REGULAR EXPRESSIONS
"foo" literal string "foo". ANSI-C escape sequences can be used.
'foo' literal string "foo" (characters [a-zA-Z] treated case-insensi-
tive). ANSI-C escape sequences can be used.
[xyz] character class; in this case, regular expression matches either
x, y, or z.
[abj-oZ]
character class with a range in it; matches a, b, any letter
from j through o or Z.
[^class]
inverted character class.
r \ s match any r which isn't s. r and s must be regular expressions
which can be expressed as character classes.
r* zero or more occurrences of r.
r+ one or more occurrences of r.
r? optional r.
(r) r; parentheses are used to override precedence.
r s r followed by s (concatenation).
r | s either r or s (alternative).
r / s r but only if it is followed by s. Note that s is not part of
the matched text. This type of regular expression is called
"trailing context". Trailing context can only be the end of a
rule and not part of a named definition.
r{n} matches r exactly n times.
r{n,} matches r at least n times.
r{n,m} matches r at least n times, but not more than m times.
. match any character except newline.
name matches named definition as specified by name only if -F is off.
If -F is active then this behaves like it was enclosed in double
quotes and matches the string "name".
Character classes and string literals may contain octal or hexadecimal
character definitions and the following set of escape sequences: \a,
\b, \f, \n, \r, \t, \v, \\. An octal character is defined by a back-
slash followed by its three octal digits (e.g. \377). Hexadecimal
characters from 0 to 0xFF are defined by backslash, a lower cased x and
two hexadecimal digits (e.g. \x12). Hexadecimal characters from 0x100
to 0xFFFF are defined by backslash, a lower cased \u or an upper cased
\X and four hexadecimal digits (e.g. \u1234). Hexadecimal characters
from 0x10000 to 0xFFFFffff are defined by backslash, an upper cased \U
and eight hexadecimal digits (e.g. \U12345678).
The only portable "any" rule is the default rule *.
SCANNER WITH STORABLE STATES
When the -f flag is specified, re2c generates a scanner that can store
its current state, return to the caller, and later resume operations
exactly where it left off.
The default operation of re2c is a "pull" model, where the scanner asks
for extra input whenever it needs it. However, this mode of operation
assumes that the scanner is the "owner" the parsing loop, and that may
not always be convenient.
Typically, if there is a preprocessor ahead of the scanner in the
stream, or for that matter any other procedural source of data, the
scanner cannot "ask" for more data unless both scanner and source live
in a separate threads.
The -f flag is useful for just this situation: it lets users design
scanners that work in a "push" model, i.e. where data is fed to the
scanner chunk by chunk. When the scanner runs out of data to consume,
it just stores its state, and return to the caller. When more input
data is fed to the scanner, it resumes operations exactly where it left
off.
Changes needed compared to the "pull" model:
o User has to supply macros YYSETSTATE () and YYGETSTATE (state).
o The -f option inhibits declaration of yych and yyaccept. So the user
has to declare these. Also the user has to save and restore these.
In the example examples/push_model/push.re these are declared as
fields of the (C++) class of which the scanner is a method, so they
do not need to be saved/restored explicitly. For C they could e.g. be
made macros that select fields from a structure passed in as parame-
ter. Alternatively, they could be declared as local variables, saved
with YYFILL (n) when it decides to return and restored at entry to
the function. Also, it could be more efficient to save the state from
YYFILL (n) because YYSETSTATE (state) is called unconditionally.
YYFILL (n) however does not get state as parameter, so we would have
to store state in a local variable by YYSETSTATE (state).
o Modify YYFILL (n) to return (from the function calling it) if more
input is needed.
o Modify caller to recognise if more input is needed and respond appro-
priately.
o The generated code will contain a switch block that is used to
restores the last state by jumping behind the corrspoding YYFILL (n)
call. This code is automatically generated in the epilog of the first
/*!re2c */ block. It is possible to trigger generation of the YYGET-
STATE () block earlier by placing a /*!getstate:re2c*/ comment. This
is especially useful when the scanner code should be wrapped inside a
loop.
Please see examples/push_model/push.re for "push" model scanner. The
generated code can be tweaked using inplace configurations state:abort
and state:nextlabel.
SCANNER WITH CONDITION SUPPORT
You can preceed regular expressions with a list of condition names when
using the -c switch. In this case re2c generates scanner blocks for
each conditon. Where each of the generated blocks has its own precondi-
tion. The precondition is given by the interface define YYGETCONDITON()
and must be of type YYCONDTYPE.
There are two special rule types. First, the rules of the condition <*>
are merged to all conditions (note that they have lower priority than
other rules of that condition). And second the empty condition list
allows one to provide a code block that does not have a scanner part.
Meaning it does not allow any regular expression. The condition value
referring to this special block is always the one with the enumeration
value 0. This way the code of this special rule can be used to initial-
ize a scanner. It is in no way necessary to have these rules: but some-
times it is helpful to have a dedicated uninitialized condition state.
Non empty rules allow one to specify the new condition, which makes
them transition rules. Besides generating calls for the define YYSET-
CONDTITION no other special code is generated.
There is another kind of special rules that allow one to prepend code
to any code block of all rules of a certain set of conditions or to all
code blocks to all rules. This can be helpful when some operation is
common among rules. For instance this can be used to store the length
of the scanned string. These special setup rules start with an exclama-
tion mark followed by either a list of conditions <! condition, ... >
or a star <!*>. When re2c generates the code for a rule whose state
does not have a setup rule and a star'd setup rule is present, than
that code will be used as setup code.
ENCODINGS
re2c supports the following encodings: ASCII (default), EBCDIC (-e),
UCS-2 (-w), UTF-16 (-x), UTF-32 (-u) and UTF-8 (-8). See also inplace
configuration re2c:flags.
The following concepts should be clarified when talking about encoding.
Code point is an abstract number, which represents single encoding sym-
bol. Code unit is the smallest unit of memory, which is used in the
encoded text (it corresponds to one character in the input stream). One
or more code units can be needed to represent a single code point,
depending on the encoding. In fixed-length encoding, each code point is
represented with equal number of code units. In variable-length encod-
ing, different code points can be represented with different number of
code units.
ASCII is a fixed-length encoding. Its code space includes 0x100 code
points, from 0 to 0xFF. One code point is represented with
exactly one 1-byte code unit, which has the same value as the
code point. Size of YYCTYPE must be 1 byte.
EBCDIC is a fixed-length encoding. Its code space includes 0x100 code
points, from 0 to 0xFF. One code point is represented with
exactly one 1-byte code unit, which has the same value as the
code point. Size of YYCTYPE must be 1 byte.
UCS-2 is a fixed-length encoding. Its code space includes 0x10000 code
points, from 0 to 0xFFFF. One code point is represented with
exactly one 2-byte code unit, which has the same value as the
code point. Size of YYCTYPE must be 2 bytes.
UTF-16 is a variable-length encoding. Its code space includes all Uni-
code code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF.
One code point is represented with one or two 2-byte code units.
Size of YYCTYPE must be 2 bytes.
UTF-32 is a fixed-length encoding. Its code space includes all Unicode
code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
code point is represented with exactly one 4-byte code unit.
Size of YYCTYPE must be 4 bytes.
UTF-8 is a variable-length encoding. Its code space includes all Uni-
code code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF.
One code point is represented with sequence of one, two, three
or four 1-byte code units. Size of YYCTYPE must be 1 byte.
In Unicode, values from range 0xD800 to 0xDFFF (surrogates) are not
valid Unicode code points, any encoded sequence of code units, that
would map to Unicode code points in the range 0xD800-0xDFFF, is
ill-formed. The user can control how re2c treats such ill-formed
sequences with --encoding-policy <policy> flag (see OPTIONS for full
explanation).
For some encodings, there are code units, that never occur in valid
encoded stream (e.g. 0xFF byte in UTF-8). If the generated scanner must
check for invalid input, the only true way to do so is to use default
rule *. Note, that full range rule [^] won't catch invalid code units
when variable-length encoding is used ([^] means "all valid code
points", while default rule * means "all possible code units").
GENERIC INPUT API
re2c usually operates on input using pointer-like primitives YYCURSOR,
YYMARKER, YYCTXMARKER and YYLIMIT.
Generic input API (enabled with --input custom switch) allows one to
customize input operations. In this mode, re2c will express all opera-
tions on input in terms of the following primitives:
+----------------+----------------------------+
|YYPEEK () | get current input charac- |
| | ter |
+----------------+----------------------------+
|YYSKIP () | advance to the next char- |
| | acter |
+----------------+----------------------------+
|YYBACKUP () | backup current input posi- |
| | tion |
+----------------+----------------------------+
|YYBACKUPCTX () | backup current input posi- |
| | tion for trailing context |
+----------------+----------------------------+
|YYRESTORE () | restore current input |
| | position |
+----------------+----------------------------+
|YYRESTORECTX () | restore current input |
| | position for trailing con- |
| | text |
+----------------+----------------------------+
|YYLESSTHAN (n) | check if less than n input |
| | characters are left |
+----------------+----------------------------+
A couple of useful links that provide some examples:
1. http://skvadrik.github.io/aleph_null/posts/re2c/2015-01-13-input_model.html
2. http://skvadrik.github.io/aleph_null/posts/re2c/2015-01-15-input_model_custom.html
ATTRIBUTES
See attributes(7) for descriptions of the following attributes:
+---------------+-----------------------+
|ATTRIBUTE TYPE | ATTRIBUTE VALUE |
+---------------+-----------------------+
|Availability | developer/parser/re2c |
+---------------+-----------------------+
|Stability | Uncommitted |
+---------------+-----------------------+
SEE ALSO
You can find more information about re2c on the website:
http://re2c.org. See also: flex(1), lex(1), quex (-
http://quex.sourceforge.net).
AUTHORS
Peter Bumbulis peter@csg.uwaterloo.ca
Brian Young bayoung@acm.org
Dan Nuffer nuffer@users.sourceforge.net
Marcus Boerger helly@users.sourceforge.net
Hartmut Kaiser hkaiser@users.sourceforge.net
Emmanuel Mogenet mgix@mgix.com
Ulya Trofimovich skvadrik@gmail.com
VERSION INFORMATION
This manpage describes re2c version 0.16, package date 21 Jan 2016.
NOTES
This software was built from source available at
https://github.com/oracle/solaris-userland. The original community
source was downloaded from
https://github.com/skvadrik/re2c/releases/down-
load/0.16/re2c-0.16.tar.gz
Further information about this software can be found on the open source
community website at http://re2c.org/.
RE2C(1)