Lenguajes de programaci´on

(1)

Lenguajes de programaci´on

Compiladores

Andr´es Jaque P.

Universidad Javeriana

20 de marzo de 2014

(2)

1 Basics

2 Lexical analysis

3 Parser - Sintactic analysis Error handling

Top-Down Parsing Recursive descent parsing Left-recursive grammars Predictive parsing

Left factoring LL

Bottom-Up Parsing

LR

4 Semantic analysis

(3)

(4)

Basics

Purpose of compiler

Detect non-valid programs Translate to valid ones Any possible errors

Error kind Example Detected by

Lexical . . .$. . . Lexer Syntax . . .x∗ %. . . Parser Semantic . . .R

x;y =x(3);. . . Type checker Correctness whatever algorithm Tester User

(5)

Lexer

Input String of characters Output String of tokens

(6)

Additional resources

lex, flex

(7)

Compilers types

Universal

Cocke-Younger-Kasami algorithm Early algorithm

In practice are too slow Top-down

Botton-up

(8)

Errors

Error handled should

Report errors accurately and clearly Recover from an error quickly

Not slow down compilation of valid code Schemes to error handling

Panic mode Error productions

Automatic local o global corrections

(9)

Panic mode

When an error is detected discard tokens until one with one clear role is detected and continue from there

Looking for synchronizing tokens

(10)

Example

Consider the next erroneous expression:

(1 + +2) + 3 Panic-mode recovery:

Skip ahead to next integer and then continue

Bison: Use the special terminal error to describe how much input to skip

E →int|E +E|(E)|error int|(error)

(11)

Error productions

Especify known common mistakes in the grammar

(12)

Error productions

Especify known common mistakes in the grammar

(13)

Error productions

Especify known common mistakes in the grammar Example:

(14)

Error productions

Consider the next erroneous expression 3x instead of 3∗x

(15)

Error productions

Consider the next erroneous expression 3x instead of 3∗x Add the productionE→. . .|EE to the grammar

(16)

Error productions

Disadvantage:

(17)

Error productions

Disadvantage:

Complicates the grammar

(18)

Error corrections

Find a correct “nearby” program

(19)

Error corrections

Find a correct “nearby” program Try token insertions and deletions

(20)

Error corrections

Find a correct “nearby” program Try token insertions and deletions Exhaustive search

(21)

Error corrections

Disadvantage:

(22)

Error corrections

Disadvantage:

Hard to implement

(23)

Error corrections

Disadvantage:

Hard to implement

Slows down parsing of correct programs

(24)

Error corrections

Disadvantage:

Hard to implement

Slows down parsing of correct programs

“nearby” is not necessarily “the intended” program

(25)

Motivation

Many languages can’t be represented by regular languages e.g Parentheses balanced

(26)

Motivation

Many languages can’t be represented by regular languages e.g Parentheses balanced

Decision structures nested

(27)

Parser

Input string of tokens Output parser tree

(28)

Abstract syntax trees (AST)

Like parser trees but ignore some details A parser tree

Traces the operation of the parser Captures nesting structures But too much information

Parentheses

Single-successor nodes

(29)

Example

E →int|E+E|(E)

t

Parser tree Abstract syntax tree - Data

strcture representation

(30)

Recursive descent parsing

The parse tree is constructed From the top

From left to right

Terminals are seen in order to appearaence in the token stream

(31)

Example

Assume the grammar

E →T|T +E T →int|int∗T|(E) Start with top-level variable (non-terminal) E

Try the rules forE in order

(32)

Left-recursive grammars

A grammar is left recursive if it has a nonterminal A such that there is a derivation A⇒⁺ Aa for some stringa.

Let consider a production S →Sa

(33)

Left-recursive grammars

Recursive descent parser does not work in such grammars

(34)

Left-recursive grammars

Recursive descent parser does not work in such grammars S is derived into an infinite loop

(35)

Replacing left-recursive productions

Consider a production S →Sa|b (inmediate left-recursive)

(36)

Replacing left-recursive productions

Consider a production S →Sa|b (inmediate left-recursive) It could be replaced by non-left-recursive productions:

A→bA^′ A^′ →aA^′|λ

(37)

Replacing left-recursive productions

Consider a production S →Sa|b (inmediate left-recursive) It could be replaced by non-left-recursive productions:

A→bA^′ A^′ →aA^′|λ

General case productions

A→Aa1|Aa2|. . .|Aa_m|b1|b1|. . .|b_n can be replaced by

A→b1A^′|b2A^′|. . .|bnA^′ A^′ →a₁A^′|a₂A^′|. . .|a_mA^′|λ

(38)

Replacing left-recursive productions

S →Aa|b A→Ac|Sd|λ

The grammar above is left-recursive also

The following algorithm systematically eliminates left

recursion from a grammar, if it has not cycles or λ-productions

(39)

arrange the nonterminals in some orderA1,A2, . . . ,An

for each i from1 to ndo for each j from1 to i−1do

Replace each production of the formAi →Ajγby the production Ai→δ1γ|δ2γ|. . .|δkγ, where Aj →δ1|δ2|. . .|δk are all current Aj-productions.

end

eliminate the inmediate left recursion among theAi-productions end

(40)

Predictive parsing

Like recursive-descent but parser can predict which production to use

By looking at the next few tokens No backtracking

Predictive parsers accept LL(k) grammars

(41)

Left factor

A→aβ₁ |aβ₂

A→aA^′ A^′ →β1|β2

(42)

Left factor

A→aβ₁ |aβ₂

A→aA^′ A^′ →β1|β2

Ejemplo

E →T +E|T A→int|int∗T|(E)

(43)

Bottom-Up Parsing

Is more general than (deterministic) top-down parsing Is the preferred method

Don’t need left-factor grammars

(44)

Bottom-Up Parsing reduces a string to start symbol by inverting productions

t

Given the grammar E →T+E|T T →int∗T|int|(E)

Consider the string int∗int+int

(45)

Bottom-Up Parsing

A bottom-up parsing traces a rightmost derivation in reverse

Let αβω be a step of button-up parsing

(46)

Bottom-Up Parsing

Let αβω be a step of button-up parsing Assume the next reduction is by X →β

(47)

Bottom-Up Parsing

Let αβω be a step of button-up parsing Assume the next reduction is by X →β Then ω is a string of terminals

(48)

Bottom-Up Parsing

Let αβω be a step of button-up parsing Assume the next reduction is by X →β Then ω is a string of terminals

Because αXω→αβω is a step in a right-most derivation

(49)

Shift-Reduce parsing

Idea: Split string into two substrings

Right substring is yet unexamined by parsing

Left substring has terminals and non-terminals symbols The dividing point is marked with |

(50)

Bottom-Up Parsing

Bottom-up parsing uses only two kinds of actions:

Shift Move |a place to the right

Shifts a terminal to the left string ABC|xyz ⇒ABCx|yz

Reduce Apply an inverse production at the right end of the left string

If A→xy is a production, then Cbxy|ijk ⇒CbA|ijk

(51)

Bottom-Up Parsing reduces a string to start symbol by inverting productions

t

Given the grammar E →T+E|T T →int∗T|int|(E)

Consider the string int∗int+int

(52)

Shift-reduce parsing

Left string can be implemented by a stack Shift pushes a terminal on the stack Reduce

pops 0 or more symbols off of the stack (production r-h-s) pushes a nonterminal on the stack (production l-h-s)

(53)

How do we decide when to shift or reduce?

In a given state, more than one action (shift o reduce) may lead to a valid parse

If is legal a shift or reduce, there are shift-reduce conflict If is legal to reduce by two different productions, there are reduce-reduce conflict

Given the grammar

E →T +E|T T →int∗T|int|(E) Consider step int| ∗int+int

We could reduce byT →int givingT| ∗int+int

(54)

How do we decide when to shift or reduce?

In a given state, more than one action (shift o reduce) may lead to a valid parse

If is legal a shift or reduce, there are shift-reduce conflict If is legal to reduce by two different productions, there are reduce-reduce conflict

Given the grammar

E →T +E|T T →int∗T|int|(E) Consider step int| ∗int+int

We could reduce byT →int givingT| ∗int+int Mistake: No way to reduce to start symbolE

(55)

Handles

We want to reduce only if the result can still be reduced to the start symbol

Assume a rightmost derivation

S ⇒^∗ αXω ⇒αβω Then αβ is a handle of αβω

A handle is a reduction that also allows further reductions back to the start symbol

We only want to reduce at handles

(56)

Handles

Informal induction on # of reduce moves:

Initially stack is empty

Inmediatly after reducing a handle

right-most non-terminal on top of the stack.

next handle must be to right of right-most-nonterminal, because this is a right-most derivation.

sequece of shift moves reaches next handle.

(57)

Bottom-Up Parsing

In shift-reduce parsing, handles appear only at the top of the stack, never inside

Handles are never to the left of the rightmost non-terminal Therefore, shift-reduce moves are sufficient; the|need never move left.

Bottom-up parsing algorithms are based on recognizing handles

(58)

Techniques for recognizing handles

There are no known efficient algorithms to recognize handles However, there are good heuristics for guessing handles On some CFGs, the heuristic always guess correctly

(59)

Recognizing handles

It is not obvious how to detect handles.

At each step the parser sees only the stack, not the entire input.

α is a viable prefix if there is aω such that α|ω is a state of shift reduce parser.

(60)

Viable prefix

A viable prefix does not extend past the right end of the handle

It is a viable prefix because it is a prefix of the handle

As long as a parser has viable prefixes on the stack no parsing error has been detected

(61)

Recognizing handles

For any grammar, the set of viable prefixes is a regular language.

(62)

Automata to accept viable prefix

Item

An item is a production with “.” somewhere on the r-h-s Example

The item for T →(E) are

T →.(E) T →(.E) T →(E.) T →(E).

The only item forX →λis X →.

(63)

Recognizing handles

The problem with recognize viable prefix is that the stack has only bits and pieces of the r-h-s of productions. (If it had a complete r-h-s, we could reduce)

These bits or pieces are always prefixes of r-h-s of productions.

(64)

Recognizing handles

Consider the input (int) on the next grammar E →T +E|T

T →int∗T|int|(E)

Then (E|) is a state of shift-reduce parser.

(E is a prefix of the r-h-s ofT →(E) (will be reduce after the next shift)

ItemT →(E.) says that so far we have seen (E of this production and hope to see )

(65)

The stack may have many prefixes of r-h-s

Prefix₁Prefix₂. . .Prefix_n₋₁Prefix_n

Let Prefix_i be a prefix of r-h-s ofX_i →α_i EventuallyPrefixi will reduce toXi

The missing part ofαi−1starts withXi

i.e. there is aXi−1→Prefixi−1Xiβ for someβ

Recursively,Prefix_k+1. . .Prefix_n eventually reduce to the missing part of α_k

(66)

Consider the string (int∗int):

(int∗ |int) is a state of shift-reduce parse

( is a prefix of r-h-s of T →(E)

λis a prefix of r-h-s of E →T

int∗is a prefix of r-h-s of T →int∗T

The stack of the items:

T →(.E) E →.T T →int∗.T Says:

We’ve seen ( ofT →(E) We’ve seen λof E →T We’ve seen int∗ of T →int∗T

(67)

To recognize viable prefix, we must:

Recognize a sequence of partial r-h-s’s of productions, where Each partial r-h-s can eventually reduce to part of the missing suffix of its predecessor.

(68)

Recognizing viable prefix

1 Add a dummy production S^′→S to G.

2 The NFA sates are the items of G including the dummy production.

3 For itemE →α.Xβ add a transition E →α.Xβ7→^X E →αX.β

4 For itemE →α.Xβ and production X →γ add a transition E →α.Xβ 7→^λ X →.γ

5 Every state is an accepting state

6 Start state is S^′ →.S

(69)

Tools

GNU bison YACC ANTLR

(70)

Semantic analysis

Last front end phase

(71)

Semantic analysis

Cathces all remaining error of lexer and parser

(72)

Semantic analysis

Cathces all remaining error of lexer and parser Some language constructions are no free context

(73)

Semantic analysis

Cathces all remaining error of lexer and parser Some language constructions are no free context

Different languages, Different kinds of checks performed by semantic analysis

(74)

Semantic analysis

Two kinds of semantic analysis:

Static Checks are performed through compilation Dynamic Checks are performed in run-time

Additional code is generated to prevent:

Division by 0

Access to an invalid array position

(75)

Scope

Matching identifier declaration with uses.

The scope of an identifier is the portion of a program in which that identifier is accessible

The same identifier may refer to different things in different parts of the program

An identifier may have restricted scope

(76)

Symbol tables

Symple symbol table can be implemented in a stack Operations:

add symbol push symbol and associated info (type) on the stack

find symbol search symbol on the stack. Starting from top, return the first occurrence or NULL if none found.

remove symbol pop the stack

(77)

Symbol tables

Class names can be used before being identified

(78)

Symbol tables

We can’t ckeck class names using symbol tables or even in one pass

(79)

Symbol tables

Solution:

(80)

Symbol tables

Solution:

1 Gather all class names

(81)

Symbol tables

Solution:

2 Do the verification

(82)

Symbol tables

Solution:

2 Do the verification

Semantic analysis requieres multiple passes

(83)

Types

What is a type?

(84)

Types

What is a type?

Although the notion varies from language to language, a type is:

(85)

Types

What is a type?

A set of values

(86)

Types

What is a type?

A set of values

A set of operations on those values

(87)

Types

What is a type?

A set of values

A set of operations on those values

Classes are one instantiation of the modern notion of type

(88)

Types

A language’s type system specifies which operations are valid for which types

The goal of type checking is to ensure that operations are used only with correct types

(89)

Types

Kinds of languages

Statically typed: All or almost all checking of types are done as part of compilation (C, Java)

Dynamically typed: Almost all checking of types are done as part of program execution (Lisp, Python, Perl)

Untyped: No type checking (machine code)

(90)

Types

Type checking is the process of verifying fully typed programs The user declares types for identifiers

Type inference is the process of filling in missing type information

The compiler infers types for expressions