Lenguajes de programaci´on
Compiladores
Andr´es Jaque P.
Universidad Javeriana
20 de marzo de 2014
1 Basics
2 Lexical analysis
3 Parser - Sintactic analysis Error handling
Top-Down Parsing Recursive descent parsing Left-recursive grammars Predictive parsing
Left factoring LL
Bottom-Up Parsing
LR
4 Semantic analysis
Basics
Purpose of compiler
Detect non-valid programs Translate to valid ones Any possible errors
Error kind Example Detected by
Lexical . . .$. . . Lexer Syntax . . .x∗ %. . . Parser Semantic . . .R
x;y =x(3);. . . Type checker Correctness whatever algorithm Tester User
Lexer
Input String of characters Output String of tokens
Additional resources
lex, flex
Compilers types
Universal
Cocke-Younger-Kasami algorithm Early algorithm
In practice are too slow Top-down
Botton-up
Errors
Error handled should
Report errors accurately and clearly Recover from an error quickly
Not slow down compilation of valid code Schemes to error handling
Panic mode Error productions
Automatic local o global corrections
Panic mode
When an error is detected discard tokens until one with one clear role is detected and continue from there
Looking for synchronizing tokens
Example
Consider the next erroneous expression:
(1 + +2) + 3 Panic-mode recovery:
Skip ahead to next integer and then continue
Bison: Use the special terminal error to describe how much input to skip
E →int|E +E|(E)|error int|(error)
Error productions
Especify known common mistakes in the grammar
Error productions
Especify known common mistakes in the grammar
Error productions
Especify known common mistakes in the grammar Example:
Error productions
Especify known common mistakes in the grammar Example:
Consider the next erroneous expression 3x instead of 3∗x
Error productions
Especify known common mistakes in the grammar Example:
Consider the next erroneous expression 3x instead of 3∗x Add the productionE→. . .|EE to the grammar
Error productions
Especify known common mistakes in the grammar Example:
Consider the next erroneous expression 3x instead of 3∗x Add the productionE→. . .|EE to the grammar
Disadvantage:
Error productions
Especify known common mistakes in the grammar Example:
Consider the next erroneous expression 3x instead of 3∗x Add the productionE→. . .|EE to the grammar
Disadvantage:
Complicates the grammar
Error corrections
Find a correct “nearby” program
Error corrections
Find a correct “nearby” program Try token insertions and deletions
Error corrections
Find a correct “nearby” program Try token insertions and deletions Exhaustive search
Error corrections
Find a correct “nearby” program Try token insertions and deletions Exhaustive search
Disadvantage:
Error corrections
Find a correct “nearby” program Try token insertions and deletions Exhaustive search
Disadvantage:
Hard to implement
Error corrections
Find a correct “nearby” program Try token insertions and deletions Exhaustive search
Disadvantage:
Hard to implement
Slows down parsing of correct programs
Error corrections
Find a correct “nearby” program Try token insertions and deletions Exhaustive search
Disadvantage:
Hard to implement
Slows down parsing of correct programs
“nearby” is not necessarily “the intended” program
Motivation
Many languages can’t be represented by regular languages e.g Parentheses balanced
Motivation
Many languages can’t be represented by regular languages e.g Parentheses balanced
Decision structures nested
Parser
Input string of tokens Output parser tree
Abstract syntax trees (AST)
Like parser trees but ignore some details A parser tree
Traces the operation of the parser Captures nesting structures But too much information
Parentheses
Single-successor nodes
Example
E →int|E+E|(E)
t
Parser tree Abstract syntax tree - Data
strcture representation
Recursive descent parsing
The parse tree is constructed From the top
From left to right
Terminals are seen in order to appearaence in the token stream
Example
Assume the grammar
E →T|T +E T →int|int∗T|(E) Start with top-level variable (non-terminal) E
Try the rules forE in order
Left-recursive grammars
A grammar is left recursive if it has a nonterminal A such that there is a derivation A⇒+ Aa for some stringa.
Let consider a production S →Sa
Left-recursive grammars
A grammar is left recursive if it has a nonterminal A such that there is a derivation A⇒+ Aa for some stringa.
Let consider a production S →Sa
Recursive descent parser does not work in such grammars
Left-recursive grammars
A grammar is left recursive if it has a nonterminal A such that there is a derivation A⇒+ Aa for some stringa.
Let consider a production S →Sa
Recursive descent parser does not work in such grammars S is derived into an infinite loop
Replacing left-recursive productions
Consider a production S →Sa|b (inmediate left-recursive)
Replacing left-recursive productions
Consider a production S →Sa|b (inmediate left-recursive) It could be replaced by non-left-recursive productions:
A→bA′ A′ →aA′|λ
Replacing left-recursive productions
Consider a production S →Sa|b (inmediate left-recursive) It could be replaced by non-left-recursive productions:
A→bA′ A′ →aA′|λ
General case productions
A→Aa1|Aa2|. . .|Aam|b1|b1|. . .|bn can be replaced by
A→b1A′|b2A′|. . .|bnA′ A′ →a1A′|a2A′|. . .|amA′|λ
Replacing left-recursive productions
S →Aa|b A→Ac|Sd|λ
The grammar above is left-recursive also
The following algorithm systematically eliminates left
recursion from a grammar, if it has not cycles or λ-productions
arrange the nonterminals in some orderA1,A2, . . . ,An
for each i from1 to ndo for each j from1 to i−1do
Replace each production of the formAi →Ajγby the production Ai→δ1γ|δ2γ|. . .|δkγ, where Aj →δ1|δ2|. . .|δk are all current Aj-productions.
end
eliminate the inmediate left recursion among theAi-productions end
Predictive parsing
Like recursive-descent but parser can predict which production to use
By looking at the next few tokens No backtracking
Predictive parsers accept LL(k) grammars
Left factor
A→aβ1 |aβ2
A→aA′ A′ →β1|β2
Left factor
A→aβ1 |aβ2
A→aA′ A′ →β1|β2
Ejemplo
E →T +E|T A→int|int∗T|(E)
Bottom-Up Parsing
Is more general than (deterministic) top-down parsing Is the preferred method
Don’t need left-factor grammars
Bottom-Up Parsing reduces a string to start symbol by inverting productions
t
Given the grammar E →T+E|T T →int∗T|int|(E)
Consider the string int∗int+int
Bottom-Up Parsing
A bottom-up parsing traces a rightmost derivation in reverse
Let αβω be a step of button-up parsing
Bottom-Up Parsing
A bottom-up parsing traces a rightmost derivation in reverse
Let αβω be a step of button-up parsing Assume the next reduction is by X →β
Bottom-Up Parsing
A bottom-up parsing traces a rightmost derivation in reverse
Let αβω be a step of button-up parsing Assume the next reduction is by X →β Then ω is a string of terminals
Bottom-Up Parsing
A bottom-up parsing traces a rightmost derivation in reverse
Let αβω be a step of button-up parsing Assume the next reduction is by X →β Then ω is a string of terminals
Because αXω→αβω is a step in a right-most derivation
Shift-Reduce parsing
Idea: Split string into two substrings
Right substring is yet unexamined by parsing
Left substring has terminals and non-terminals symbols The dividing point is marked with |
Bottom-Up Parsing
Bottom-up parsing uses only two kinds of actions:
Shift Move |a place to the right
Shifts a terminal to the left string ABC|xyz ⇒ABCx|yz
Reduce Apply an inverse production at the right end of the left string
If A→xy is a production, then Cbxy|ijk ⇒CbA|ijk
Bottom-Up Parsing reduces a string to start symbol by inverting productions
t
Given the grammar E →T+E|T T →int∗T|int|(E)
Consider the string int∗int+int
Shift-reduce parsing
Left string can be implemented by a stack Shift pushes a terminal on the stack Reduce
pops 0 or more symbols off of the stack (production r-h-s) pushes a nonterminal on the stack (production l-h-s)
How do we decide when to shift or reduce?
In a given state, more than one action (shift o reduce) may lead to a valid parse
If is legal a shift or reduce, there are shift-reduce conflict If is legal to reduce by two different productions, there are reduce-reduce conflict
Given the grammar
E →T +E|T T →int∗T|int|(E) Consider step int| ∗int+int
We could reduce byT →int givingT| ∗int+int
How do we decide when to shift or reduce?
In a given state, more than one action (shift o reduce) may lead to a valid parse
If is legal a shift or reduce, there are shift-reduce conflict If is legal to reduce by two different productions, there are reduce-reduce conflict
Given the grammar
E →T +E|T T →int∗T|int|(E) Consider step int| ∗int+int
We could reduce byT →int givingT| ∗int+int Mistake: No way to reduce to start symbolE
Handles
We want to reduce only if the result can still be reduced to the start symbol
Assume a rightmost derivation
S ⇒∗ αXω ⇒αβω Then αβ is a handle of αβω
A handle is a reduction that also allows further reductions back to the start symbol
We only want to reduce at handles
Handles
Informal induction on # of reduce moves:
Initially stack is empty
Inmediatly after reducing a handle
right-most non-terminal on top of the stack.
next handle must be to right of right-most-nonterminal, because this is a right-most derivation.
sequece of shift moves reaches next handle.
Bottom-Up Parsing
In shift-reduce parsing, handles appear only at the top of the stack, never inside
Handles are never to the left of the rightmost non-terminal Therefore, shift-reduce moves are sufficient; the|need never move left.
Bottom-up parsing algorithms are based on recognizing handles
Techniques for recognizing handles
There are no known efficient algorithms to recognize handles However, there are good heuristics for guessing handles On some CFGs, the heuristic always guess correctly
Recognizing handles
It is not obvious how to detect handles.
At each step the parser sees only the stack, not the entire input.
α is a viable prefix if there is aω such that α|ω is a state of shift reduce parser.
Viable prefix
A viable prefix does not extend past the right end of the handle
It is a viable prefix because it is a prefix of the handle
As long as a parser has viable prefixes on the stack no parsing error has been detected
Recognizing handles
For any grammar, the set of viable prefixes is a regular language.
Automata to accept viable prefix
Item
An item is a production with “.” somewhere on the r-h-s Example
The item for T →(E) are
T →.(E) T →(.E) T →(E.) T →(E).
The only item forX →λis X →.
Recognizing handles
The problem with recognize viable prefix is that the stack has only bits and pieces of the r-h-s of productions. (If it had a complete r-h-s, we could reduce)
These bits or pieces are always prefixes of r-h-s of productions.
Recognizing handles
Consider the input (int) on the next grammar E →T +E|T
T →int∗T|int|(E)
Then (E|) is a state of shift-reduce parser.
(E is a prefix of the r-h-s ofT →(E) (will be reduce after the next shift)
ItemT →(E.) says that so far we have seen (E of this production and hope to see )
The stack may have many prefixes of r-h-s
Prefix1Prefix2. . .Prefixn−1Prefixn
Let Prefixi be a prefix of r-h-s ofXi →αi EventuallyPrefixi will reduce toXi
The missing part ofαi−1starts withXi
i.e. there is aXi−1→Prefixi−1Xiβ for someβ
Recursively,Prefixk+1. . .Prefixn eventually reduce to the missing part of αk
Consider the string (int∗int):
(int∗ |int) is a state of shift-reduce parse
( is a prefix of r-h-s of T →(E)
λis a prefix of r-h-s of E →T
int∗is a prefix of r-h-s of T →int∗T
The stack of the items:
T →(.E) E →.T T →int∗.T Says:
We’ve seen ( ofT →(E) We’ve seen λof E →T We’ve seen int∗ of T →int∗T
To recognize viable prefix, we must:
Recognize a sequence of partial r-h-s’s of productions, where Each partial r-h-s can eventually reduce to part of the missing suffix of its predecessor.
Recognizing viable prefix
1 Add a dummy production S′→S to G.
2 The NFA sates are the items of G including the dummy production.
3 For itemE →α.Xβ add a transition E →α.Xβ7→X E →αX.β
4 For itemE →α.Xβ and production X →γ add a transition E →α.Xβ 7→λ X →.γ
5 Every state is an accepting state
6 Start state is S′ →.S
Tools
GNU bison YACC ANTLR
Semantic analysis
Last front end phase
Semantic analysis
Last front end phase
Cathces all remaining error of lexer and parser
Semantic analysis
Last front end phase
Cathces all remaining error of lexer and parser Some language constructions are no free context
Semantic analysis
Last front end phase
Cathces all remaining error of lexer and parser Some language constructions are no free context
Different languages, Different kinds of checks performed by semantic analysis
Semantic analysis
Two kinds of semantic analysis:
Static Checks are performed through compilation Dynamic Checks are performed in run-time
Additional code is generated to prevent:
Division by 0
Access to an invalid array position
Scope
Matching identifier declaration with uses.
The scope of an identifier is the portion of a program in which that identifier is accessible
The same identifier may refer to different things in different parts of the program
An identifier may have restricted scope
Symbol tables
Symple symbol table can be implemented in a stack Operations:
add symbol push symbol and associated info (type) on the stack
find symbol search symbol on the stack. Starting from top, return the first occurrence or NULL if none found.
remove symbol pop the stack
Symbol tables
Class names can be used before being identified
Symbol tables
Class names can be used before being identified
We can’t ckeck class names using symbol tables or even in one pass
Symbol tables
Class names can be used before being identified
We can’t ckeck class names using symbol tables or even in one pass
Solution:
Symbol tables
Class names can be used before being identified
We can’t ckeck class names using symbol tables or even in one pass
Solution:
1 Gather all class names
Symbol tables
Class names can be used before being identified
We can’t ckeck class names using symbol tables or even in one pass
Solution:
1 Gather all class names
2 Do the verification
Symbol tables
Class names can be used before being identified
We can’t ckeck class names using symbol tables or even in one pass
Solution:
1 Gather all class names
2 Do the verification
Semantic analysis requieres multiple passes
Types
What is a type?
Types
What is a type?
Although the notion varies from language to language, a type is:
Types
What is a type?
Although the notion varies from language to language, a type is:
A set of values
Types
What is a type?
Although the notion varies from language to language, a type is:
A set of values
A set of operations on those values
Types
What is a type?
Although the notion varies from language to language, a type is:
A set of values
A set of operations on those values
Classes are one instantiation of the modern notion of type
Types
A language’s type system specifies which operations are valid for which types
The goal of type checking is to ensure that operations are used only with correct types
Types
Kinds of languages
Statically typed: All or almost all checking of types are done as part of compilation (C, Java)
Dynamically typed: Almost all checking of types are done as part of program execution (Lisp, Python, Perl)
Untyped: No type checking (machine code)
Types
Type checking is the process of verifying fully typed programs The user declares types for identifiers
Type inference is the process of filling in missing type information
The compiler infers types for expressions