YYAST - Manual

by Take Vos (2013)

YYAST is a library which completes Lex and Yacc by adding an Abstract Syntax Tree (AST) generator. The saved AST is designed to be easily parsed by compilers.

You can find the doxygen reference documentation here.

Description

I really like writing parsers using Lex and Yacc, but found that I had to create and AST for each language I was developing. Therefor I created this library which you can use directly inside the lex and yacc files. It was designed so you write as little as code as possible in these files, so it does not distract from the grammar of the language.

The second problem I tried to solve; is that I would like to use high level programming languages for writing compilers. However integrating Lex and Yacc with for example Python is problematic. The parser created with this library will create files which are very easy to write a parser for in most languages.

License

Since I am hoping that YYAST will be used in as many as project as possible I have decided to release the YYAST library under a very permissive and compatible license. I have chosen the BSD 2-Clause License.

Copyright (c) 2011-2013, Take Vos
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

- Redistributions of source code must retain the above copyright notice,
  this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, 
  this list of conditions and the following disclaimer in the documentation 
  and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE 
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 
POSSIBILITY OF SUCH DAMAGE.

Features

File format

A file consists of a single Abstract Syntax Tree, described by nested nodes. All nodes are structured in the same way to facilitate easy parsing.

The root node acts like the header of the file, child nodes of the root contain the metadata of the file. The last child node of the root is the AST of the parsed file.

Node

Every node in the file has the same structure. All Nodes are aligned to 64 bit boundary to ease parsing of the file when it is read in memory, some processors disallow reads from unaligned fields. All integer fields are in big endian, big endian is defacto standard for storage and transmission of information. The following table shows the header of every node.

type description
EightCC Node name.
uint64_t Node Size.
uint32_t Line number. The top most line is zero.
uint32_t Column number. The left most column is zero
uint32_t File number. The first file (the main file) is zero.
uint16_t Reserved, must be zero.
uint8_t Reserved, must be zero.
uint8_t Node type.

Name

Every node has a user defined name of 8 characters. Names that have less characters are padded with white spaces. A name which starts with an at '@' character are reserved by the yyast library for use as metadata in the header.

The following table shows the currently reserved named.

nameparenttypedescription
@ya 2013branchRoot node.
@files@ya 2013branchA list of original source files.
@file@filestextFilename of an original source file.
@nullnullA leaf node which implying that the optional child is not in use.

Node Size

The size in bytes of the complete node; header + data. The data must be in multiple of 64 bits.

Position: line, column, file number

The position in the parsed file where the node was found. The first character in the file will have line zero, column zero. The first file (zero) is the main source file. The line and column numbers are 32 bit, this is the same size as LLVM line and column tracking.

Using line control it is possible for generated code to reference the original source code. The filenames of these original sources are located in the header.

If part of the location is unknown or out of bounds, then these fields will be set to 0xffffffff.

Node type

The node type describes how the data of a node is structured.

NameValueDescription
null0A unused positional parameter, as a child node. Its name is '@null'.
leaf1A node without data or children.
branch2A node with children. All children are concatenated into the data of this node.
text3A node with a null terminated UTF-8 text string, with null padding for 64 bit alignment.
positive integer4An unsigned integer in big endian format. The size of the integer is a multiple of 8 bytes.
negative integer5An unsigned integer in big endian format. The size of the integer is a multiple of 8 bytes. The integer has an implicit negative sign.
binary float6A 'big-endian' IEEE 754 floating point number in "binary64" or "binary128" format.
decimal float7A 'big-endian' IEEE 754 floating point number in "decimal64" or "decimal128" format.
list254A temporary list node. This node is never written to file. Its name is '@list'.
count255A temporary line count node. This node is never written to file. Its name is '@count'.

Reserved

These fields are unused for now and must contain zeros for forward compatibility reasons.

API

This is a short introduction to the YYAST API, you can find more detailed information in the doxygen generated reference.

Include file

The function prototypes and macros to be used within lex and yacc are located in the yyast/yyast.h header file. This file should be included in both the lex and yacc grammar files.

For flex:

%{
#include <yyast/yyast.h>
%}
%option noyywrap
%%

For bison:

%{
#include <yyast/yyast.h>
int yylex();
void yyerror(const char *message);
%}
%defines
%error-verbose
%verbose
%%

Main

YYAST includes a main function to create a basic parser application which outputs and AST. You can use this function by adding the following to the epilogue of the yacc grammar file. The third argument of ya_main() is the extension given to the AST file when the name of the AST file is not given on the command line.

%%
void yyerror(const char *message)
{
    ya_error(message);
}

int main(int argc, char *argv[])
{
    return ya_main(argc, argv, "tap");
}

Literals

Literals In Lex

Literals are interpreted by the lexer and passed to parser as a node, through yylval. YYAST includes a set of literal construction functions to parse common types of literals.

functiondescription
ya_literal(name, type, buf, buf_len)A generic literal function for encoding user defined types..
ya_text(name, str, str_len)A text literal, used for strings and identifiers.
ya_positive_integer(name, str, str_len, base)A positive integer literal.
ya_negative_integer(name, str, str_len, base)A negative integer literal.
ya_binary_float(name, type, str, str_len, base, size)A binary float literal.

An example for a set of negative and positive integer literals of different bases.

{D}+       { yylval = ya_positive_integer("int", &yytext[0], yyleng - 0, 10); return T_INT; }
-{D}+      { yylval = ya_negative_integer("int", &yytext[1], yyleng - 1, 10); return T_INT; }
0[bB]{D}+  { yylval = ya_positive_integer("int", &yytext[2], yyleng - 2,  2); return T_INT; }
-0[bB]{D}+ { yylval = ya_negative_integer("int", &yytext[3], yyleng - 3,  2); return T_INT; }
0[oO]{D}+  { yylval = ya_positive_integer("int", &yytext[2], yyleng - 2,  8); return T_INT; }
-0[oO]{D}+ { yylval = ya_negative_integer("int", &yytext[3], yyleng - 3,  8); return T_INT; }
0[dD]{D}+  { yylval = ya_positive_integer("int", &yytext[2], yyleng - 2, 10); return T_INT; }
-0[dD]{D}+ { yylval = ya_negative_integer("int", &yytext[3], yyleng - 3, 10); return T_INT; }
0[xX]{D}+  { yylval = ya_positive_integer("int", &yytext[2], yyleng - 2, 16); return T_INT; }
-0[xX]{D}+ { yylval = ya_negative_integer("int", &yytext[3], yyleng - 3, 16); return T_INT; }

Literals In Yacc

Since literals are already converted to proper nodes in Lex, the literal value can simply be assigned to $$.

expr : T_INT      { $$ = $1; }
     | T_REAL     { $$ = $1; }
     | T_STRING   { $$ = $1; }
     ;

Branches

The YA_BRANCH function creates a new node with the childs given as arguments. The number of childs given on th argument is variable. The position of the branch is set to the same position of a child that appeared most early in the source file.

expr : expr '+' expr            { $$ = YA_BRANCH("+", &$1, &$3); }
     | expr '-' expr            { $$ = YA_BRANCH("-", &$1, &$3); }
     | expr '?' expr ':' expr   { $$ = YA_BRANCH("?:", &$1, &$3, &$5); }
     ;

Optional children.

In some cases a child of the branch is optional, which mirrors optional parts in the grammar of an expression or a statement. In the following example the for loop has an optional else block. You can mark an unused child with YA_NULL.

statement
    : T_FOR expression T_IN expression '{' block_content '}' {
        $$ = YA_BRANCH("for ", &$2, &$4, &$6, YA_NULL);
    }
    | T_FOR expression T_IN expression '{' block_content '}' T_ELSE '{' block_content '}' {
        $$ = YA_BRANCH("for ", &$2, &$4, &$6, &$10);
    }
    ;

Lists of children.

In the previous case we have seen that YA_BRANCH can accept variable children based on the grammar. But there are also cases where the number of children to a branch are variable based on the source file.

We use the YA_LIST function to assemble a temporary node. When an YA_LIST node is given as an argument to YA_LIST or YA_BRANCH then the children of YA_LIST become direct children of the node. For example a list literal is build up by the grammar in small steps. The result of this example is a single "list" branch node, with each expression branch node as a direct child of this node.

list_item       : expr                           { $$ = $1; }
                ;

list_item_list  : list_item                      { $$ = YA_LIST(&$1); }
                | list_item_list ',' list_item   { $$ = YA_LIST(&$1, &$3); }
                ;

list_content    :                                { $$ = YA_EMPTYLIST; }
                | ','                            { $$ = YA_EMPTYLIST; }
                | list_item_list                 { $$ = &1; }
                | list_item_list ','             { $$ = &1; }
                ;

expr            : '[' list_content ']'           { $$ = YA_BRANCH("list", &$2); }
                ;

Leafs

For the times you have a statement that act like a leave but isn't a literal.

statement : T_BREAK ';'     { $$ = YA_LEAF("break"); }
          | T_CONTINUE ';'  { $$ = YA_LEAF("continue"); }
          ;

Top level

The file is created by YA_HEADER, the result is passed to ya_start. It will cause the basic structure of the AST file to be created, including a list of filenames of the source code that is being tracked. YA_HEADER accepts a single node as argument. In the example below we see a source file that can have import statements, class definitions and functions at the top level. These are all nodes which are children of the top level "module" node.

%start module
%%
module_item
    : T_IMPORT fqname ';' {
        $$ = YA_BRANCH("import", &$2, YA_NULL);
    }
    | T_IMPORT fqname T_AS T_NAME ';' {
        $$ = YA_BRANCH("import", &$2, &$4);
    }
    | T_CLASS T_NAME '{' class_content '}' {
        $$ = YA_BRANCH("class", &$2, YA_NULL, &$4);
    }
    | T_CLASS T_NAME '(' sub_classes ')' '{' class_content '}' {
        $$ = YA_BRANCH("class", &$2, &$4, &$7);
    }
    | T_NAME '(' def_arguments ')' '{' block_content '}' {
        $$ = YA_BRANCH("function", &$1, &$3, &$6);
    }
    ;

module_list
    : module_item                 { $$ = YA_LIST(&$1); }
    | module_list module_item     { $$ = YA_LIST(&$1, &$2); }
    ;

module_content
    :                             { $$ = YA_EMPTYLIST; }
    | module_list                 { $$ = YA_LIST(&$1); }
    ;

module
    : module_content {
        $$ = YA_BRANCH("module", &$1);
        ya_start = YA_HEADER(&$$);
    }
    ;