[tiger.ml.git] / README.md

Tiger.ml
========
A Tiger-compiler implementation in (OCa)ML

Status
------

![screenshot-tests-head](screenshots/tests-head.jpg)
...
![screenshot-tests-tail](screenshots/tests-tail.jpg)

### Features
#### Done
- [x] ch 1: Warm-up AST
- [x] ch 2: Lexer
- [x] ch 3: Parser
- [x] ch 4: AST
- [x] ch 5: Semantic Analysis (type checking)
- [x] ch 6: Activation Records
#### In-progress
- [ ] ch 7: Translation to Intermediate Code
#### TODO (short-term)
- [ ] ch 08: Basic Blocks and Traces
- [ ] ch 09: Instruction Selection
- [ ] ch 10: Liveness Analysis
- [ ] ch 11: Register Allocation
- [ ] ch 12: Putting It All Together
#### TODO (long-term)
- [ ] ch 13: Garbage Collection
- [ ] ch 15: Functional Programming Languages
- [ ] ch 16: Polymorphic Types
- [ ] ch 17: Dataflow Analysis
- [ ] ch 18: Loop Optimizations
- [ ] ch 19: Static Single-Assignment Form
- [ ] ch 20: Pipelining and Scheduling
- [ ] ch 21: The Memory Hierarchy
#### Maybe
- [ ] ch 14: Object-Oriented Languages

### Technical issues
- [-] testing framework
  - [x] run arbitrary code snippets
  - [x] check non-failures
  - [x] check expected output
  - [-] check expected exceptions
    - [x] semant stage
    - [ ] generalized expect `Output ('a option) | Exception of (exn -> bool)`
  - [x] run all book test case files 
  - [-] grid view (cols: lex, pars, semant, etc.; rows: test cases.) 
    - [x] implementation
    - [ ] refactoring
  - [ ] test time-outs (motive: cycle non-detection caused an infinite loop)
    - [ ] parallel test execution
- [ ] Travis CI

Implementation Notes
--------------------

### Parser

#### shift/reduce conflicts
##### grouping consecutive declarations
In order to support mutual recursion, we need to group consecutive
type and function declarations (see Tiger-book pages 97-99).

Initially, I defined the rules to do so as:

    decs:
      | dec      { $1 :: [] }
      | dec decs { $1 :: $2 }
      ;
    dec:
      | var_dec  { $1 }
      | typ_decs { Ast.TypeDecs $1 }
      | fun_decs { Ast.FunDecs $1 }
      ;

which, while straightforward (and working, because `ocamlyacc` defaults to
shift in case of a conflict), nonetheless caused a shift/reduce conflict in
each of: `typ_decs` and `fun_decs`; where the parser did not know whether to
shift and stay in `(typ|fun_)_dec` state or to reduce and get back to `dec`
state.

Sadly, tagging the rules with a lower precedence (to explicitly favor
shifting) - does not help :(

    %nonassoc LOWEST
    ...
    dec:
      | var_dec                { $1 }
      | typ_decs  %prec LOWEST { Ast.TypeDecs $1 }
      | fun_decs  %prec LOWEST { Ast.FunDecs $1 }
      ;

The difficulty seems to be in the lack of a separator token which would be
able to definitively mark the end of each sequence of consecutive
`(typ_|fun_)` declarations.

Keeping this in mind, another alternative is to manually capture the possible
interspersion patterns in the rules like:

    (N * foo) followed-by (N * not-foo)

for the exception of `var_dec`, which, since we do not need to group its
consecutive sequences, can be reduced upon first sighting.

The final rules I ended-up with are:

    decs:
      | var_dec   decs_any          { $1 :: $2 }
      | fun_decs  decs_any_but_fun  { (Ast.FunDecs  $1) :: $2 }
      | typ_decs  decs_any_but_typ  { (Ast.TypeDecs $1) :: $2 }
      ;

    decs_any:
      |                             { [] }
      | var_dec   decs_any          { $1 :: $2 }
      | fun_decs  decs_any_but_fun  { (Ast.FunDecs  $1) :: $2 }
      | typ_decs  decs_any_but_typ  { (Ast.TypeDecs $1) :: $2 }
      ;

    decs_any_but_fun:
      |                             { [] }
      | var_dec   decs_any          { $1 :: $2 }
      | typ_decs  decs_any_but_typ  { (Ast.TypeDecs $1) :: $2 }
      ;

    decs_any_but_typ:
      |                             { [] }
      | var_dec   decs_any          { $1 :: $2 }
      | fun_decs  decs_any_but_fun  { (Ast.FunDecs $1) :: $2 }
      ;

##### lval

### AST

#### print as M-exp

I chose to pretty-print AST as an (indented)
[M-expression](https://en.wikipedia.org/wiki/M-expression) - an underrated
format, used in Mathematica and was intended for Lisp by McCarthy himself; it
is nearly as flexible as S-expressions, but significantly more readable (IMO).

As an example, here is what `test28.tig` looks like after parsing and
pretty-printing:

    LetExp[
        [
        TypeDecs[
            TypeDec[
                arrtype1,
                ArrayTy[
                int]],
            TypeDec[
                arrtype2,
                ArrayTy[
                int]]],
        VarDec[
            arr1,
            arrtype1,
            ArrayExp[
                arrtype2,
                IntExp[
                    10],
                IntExp[
                    0]]]],
        SeqExp[
            VarExp[
                SimpleVar[
                    arr1]]]]

### Machine
Will most-likely compile to RISC and execute using SPIM (as favored by Appel)
Commit	Line	Data
	1	Tiger.ml
	2	========
	3	A Tiger-compiler implementation in (OCa)ML
	4
	5	Status
	6	------
	7
	8	![screenshot-tests-head](screenshots/tests-head.jpg)
	9	...
	10	![screenshot-tests-tail](screenshots/tests-tail.jpg)
	11
	12	### Features
	13	#### Done
	14	- [x] ch 1: Warm-up AST
	15	- [x] ch 2: Lexer
	16	- [x] ch 3: Parser
	17	- [x] ch 4: AST
	18	- [x] ch 5: Semantic Analysis (type checking)
	19	- [x] ch 6: Activation Records
	20	#### In-progress
	21	- [ ] ch 7: Translation to Intermediate Code
	22	#### TODO (short-term)
	23	- [ ] ch 08: Basic Blocks and Traces
	24	- [ ] ch 09: Instruction Selection
	25	- [ ] ch 10: Liveness Analysis
	26	- [ ] ch 11: Register Allocation
	27	- [ ] ch 12: Putting It All Together
	28	#### TODO (long-term)
	29	- [ ] ch 13: Garbage Collection
	30	- [ ] ch 15: Functional Programming Languages
	31	- [ ] ch 16: Polymorphic Types
	32	- [ ] ch 17: Dataflow Analysis
	33	- [ ] ch 18: Loop Optimizations
	34	- [ ] ch 19: Static Single-Assignment Form
	35	- [ ] ch 20: Pipelining and Scheduling
	36	- [ ] ch 21: The Memory Hierarchy
	37	#### Maybe
	38	- [ ] ch 14: Object-Oriented Languages
	39
	40	### Technical issues
	41	- [-] testing framework
	42	- [x] run arbitrary code snippets
	43	- [x] check non-failures
	44	- [x] check expected output
	45	- [-] check expected exceptions
	46	- [x] semant stage
	47	- [ ] generalized expect `Output ('a option) \| Exception of (exn -> bool)`
	48	- [x] run all book test case files
	49	- [-] grid view (cols: lex, pars, semant, etc.; rows: test cases.)
	50	- [x] implementation
	51	- [ ] refactoring
	52	- [ ] test time-outs (motive: cycle non-detection caused an infinite loop)
	53	- [ ] parallel test execution
	54	- [ ] Travis CI
	55
	56	Implementation Notes
	57	--------------------
	58
	59	### Parser
	60
	61	#### shift/reduce conflicts
	62	##### grouping consecutive declarations
	63	In order to support mutual recursion, we need to group consecutive
	64	type and function declarations (see Tiger-book pages 97-99).
	65
	66	Initially, I defined the rules to do so as:
	67
	68	decs:
	69	\| dec { $1 :: [] }
	70	\| dec decs { $1 :: $2 }
	71	;
	72	dec:
	73	\| var_dec { $1 }
	74	\| typ_decs { Ast.TypeDecs $1 }
	75	\| fun_decs { Ast.FunDecs $1 }
	76	;
	77
	78	which, while straightforward (and working, because `ocamlyacc` defaults to
	79	shift in case of a conflict), nonetheless caused a shift/reduce conflict in
	80	each of: `typ_decs` and `fun_decs`; where the parser did not know whether to
	81	shift and stay in `(typ\|fun_)_dec` state or to reduce and get back to `dec`
	82	state.
	83
	84	Sadly, tagging the rules with a lower precedence (to explicitly favor
	85	shifting) - does not help :(
	86
	87	%nonassoc LOWEST
	88	...
	89	dec:
	90	\| var_dec { $1 }
	91	\| typ_decs %prec LOWEST { Ast.TypeDecs $1 }
	92	\| fun_decs %prec LOWEST { Ast.FunDecs $1 }
	93	;
	94
	95	The difficulty seems to be in the lack of a separator token which would be
	96	able to definitively mark the end of each sequence of consecutive
	97	`(typ_\|fun_)` declarations.
	98
	99	Keeping this in mind, another alternative is to manually capture the possible
	100	interspersion patterns in the rules like:
	101
	102	(N * foo) followed-by (N * not-foo)
	103
	104	for the exception of `var_dec`, which, since we do not need to group its
	105	consecutive sequences, can be reduced upon first sighting.
	106
	107	The final rules I ended-up with are:
	108
	109	decs:
	110	\| var_dec decs_any { $1 :: $2 }
	111	\| fun_decs decs_any_but_fun { (Ast.FunDecs $1) :: $2 }
	112	\| typ_decs decs_any_but_typ { (Ast.TypeDecs $1) :: $2 }
	113	;
	114
	115	decs_any:
	116	\| { [] }
	117	\| var_dec decs_any { $1 :: $2 }
	118	\| fun_decs decs_any_but_fun { (Ast.FunDecs $1) :: $2 }
	119	\| typ_decs decs_any_but_typ { (Ast.TypeDecs $1) :: $2 }
	120	;
	121
	122	decs_any_but_fun:
	123	\| { [] }
	124	\| var_dec decs_any { $1 :: $2 }
	125	\| typ_decs decs_any_but_typ { (Ast.TypeDecs $1) :: $2 }
	126	;
	127
	128	decs_any_but_typ:
	129	\| { [] }
	130	\| var_dec decs_any { $1 :: $2 }
	131	\| fun_decs decs_any_but_fun { (Ast.FunDecs $1) :: $2 }
	132	;
	133
	134	##### lval
	135
	136	### AST
	137
	138	#### print as M-exp
	139
	140	I chose to pretty-print AST as an (indented)
	141	[M-expression](https://en.wikipedia.org/wiki/M-expression) - an underrated
	142	format, used in Mathematica and was intended for Lisp by McCarthy himself; it
	143	is nearly as flexible as S-expressions, but significantly more readable (IMO).
	144
	145	As an example, here is what `test28.tig` looks like after parsing and
	146	pretty-printing:
	147
	148	LetExp[
	149	[
	150	TypeDecs[
	151	TypeDec[
	152	arrtype1,
	153	ArrayTy[
	154	int]],
	155	TypeDec[
	156	arrtype2,
	157	ArrayTy[
	158	int]]],
	159	VarDec[
	160	arr1,
	161	arrtype1,
	162	ArrayExp[
	163	arrtype2,
	164	IntExp[
	165	10],
	166	IntExp[
	167	0]]]],
	168	SeqExp[
	169	VarExp[
	170	SimpleVar[
	171	arr1]]]]
	172
	173	### Machine
	174	Will most-likely compile to RISC and execute using SPIM (as favored by Appel)