1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
|
`antcc` is a small C compiler using its own independent backend.
Supports [most of C11 and some C23 features](doc/cstd.md), as well as some GNU extensions.
Currently still in a experimental stage, but can successfully build some
real-world C codebases such as Lua, SQLite,
[oksh](https://github.com/ibara/oksh), [tin](http://www.tin.org/) and itself.
`antcc` is inspired by other small C compilers like
[tcc](https://bellard.org/tcc/),
[cproc](https://git.sr.ht/~mcf/cproc),
[chibicc](https://github.com/rui314/chibicc),
and backends like [QBE](https://c9x.me/compile/) and [LLVM](https://llvm.org/).
## Requirements
`antcc` is written in standard C11 and can be built with any conforming
compiler toolchain. The `Makefile` requires GNU Make.
## Building
Run `./configure` to create `hostconfig.h` and `config.mk` for your system.
Build with
```
make
#or
make opt #compile with -O2
#or
make dbg #compile with UBsan and Asan
```
Install with `(sudo) make install`.
## Supported targets
For now just x86-64 POSIX (Sys-V + ELF). aarch64 backend is in the works. Tested and known to work:
- `x86_64-linux-gnu`
- `x86_64-linux-musl`
- `x86_64-unknown-openbsd`
## Usage
The driver is still incomplete but it mimics that of compilers like gcc, see `--help`.
`antcc` compiles translation units to object files directly, but the driver
will invoke an external command to link to an executable if `-c` isn't passed.
Cross-compilation is partially supported: cross-compiling object files works
but an external cross-compiling toolchain for linking is required; the driver
will try to find one (invoking e.g. `aarch64-linux-gnu-gcc`, or falling back
to [`zig cc`](https://andrewkelley.me/post/zig-cc-powerful-drop-in-replacement-gcc-clang.html)),
and appropiate include paths must be manually specified. You can specify the compiler target architecture with `-target <triple>`.
## Testing
`bootstrap.sh` will bootstrap the compiler in 3 stages:
- Stage 0 builds the compiler with the system's C compiler
- Stage 1 builds the compiler with the stage 0 output
- Stage 2 builds the compiler with the stage 1 output
- Then stage 1 and 2 outputs are verified to be identical
There are tests in the `test` directory:
- `test/run.sh`: local tests
- `test/lua.sh`: compile Lua 5.4.0 and run its testsuite
- `test/c-testsuite.sh`: run [c-testsuite](https://github.com/c-testsuite/c-testsuite)
- `test/sqlite.sh`: compile SQLite and run its testsuite (must pull in external sqlite submodule: `git submodule update --init --recursive`)
- `test/metalang99.sh`: compile and run [metalang99](https://github.com/hirrolot/metalang99) tests (preprocessor stress testing)
## Issues and contributing
You can report issues on the [issue tracker](https://codeberg.org/lsof/antcc/issues).
Contributions are welcome as long as they aren't low-effort AI slop, send as pull requests [on Codeberg](https://codeberg.org/lsof/antcc/pulls).
## Internals & Design
C type representation (`c_type.h` & `c_type.c`) is shared by the frontend and
backend because the backend is responsible for ABI-specific lowering of calling
conventions.
The C frontend is structured like so:
- Compiler driver (`a_main.c`), which parses command line options, inputs and
outputs and calls out to the core compiler to build individual object
files and possibly invoke an external command to link them together.
- Tokenizer & preprocessor (`c_lex.c`): The input file is scanned on-demand,
initially reading characters into an internal buffer after performing
backslash-newline delition (and possibly trigraph substitution), then producing
one token at a time when the parser requests the next one. Preprocessing
(directives & macro expansion) is also done on the fly.
- Parser & IR generation (`c.c`): The handwritten parser reads declarations
and keeps them in a symbol table/environment. Static data is written to
buffers that correspond to the .rodata/.data sections of the final object
file, emitting relocations to the object file interface too. Function
bodies are parsed and transformed into the IR in one pass. Expressions are
parsed into expression trees before being emitted or compile-time evaluated
(`c_eval.c`), but there is no whole-program AST. When the end of a
function definition is reached, the backend is called to perform all of the
passes that will finally transform it into machine code written to the
.text section.
The backend (`ir_*`) uses an IR in Static Single Assignment (SSA) form.
Instructions have a return type and up to two operands. Because of SSA form,
temporaries are simply referenced by the instruction that provides their
definition, so an explicit output operand is not required. The list of
instructions is defined in `ir_op.def`. Each basic block in the control flow
graph consists of 0 or more phi functions, followed by 0 or more instructions,
terminated by a jump (unconditional/conditional branch, return, or trap).
The builder API (`ir_builder.c`) used by the frontend performs peephole
optimizations on the fly, mainly constant folding.
Object file interface routines are in `obj.[c/h]` ELF implementation in
`o_elf.[c/h]`. Support for other object formats like PE and Mach-O is planned.
Debug information in the form of DWARF is also planned, but it is a sizeable
undertaking.
The `-d...` compiler flag can be used to print the output of different stages
of the backend for debugging.
The backend performs the following main passes:
- ABI lowering (`ir_abi0.c`, `t_x86-64_sysv.c`): implements target calling
convention details, such as lowering structures being passed/returned by
value in registers or the stack.
- Intrinsics lowering (`ir_intrin.c`): lowers some intrinsics emitted by the
frontend (currently just structcopy)
- mem2reg (`ir_mem2reg.c`): lower stack slots into SSA temporaries. This is
an important pass because the frontend puts every C variable into a stack
slot, and this pass transforms those into temporaries and phi instructions
in SSA form instructions when possible (most of the time, unless they are
aggregates or their address is taken), which is also how clang/LLVM does
it. Can be disabled with -O0.
- With -O1+ optimizations enabled
+ inlining (`ir_inliner.c`)
+ common-subexpression elimination (`ir_cse.c`),
+ general arithmetic simplifications, branch simplification
(`ir_simpl.c`)
- Stack lowering (`ir_stack.c`): `alloca` instructions are deleted and
corresponding stack slots replaced with calculated stack offsets.
- Instruction selection (`t_x86-64_isel.c`): architecture-specific
instruction selection, addressing mode utilization, introduction of
register constraints.
- Register allocation (`ir_regalloc.c`): performs linear scan register
allocation. A scratch register is reserved for operations with spilled
temporaries.
- Code emission (`t_x86-64_emit.c`): binary code for the target architecture is
emitted directly (not textual assembly). Relocations are deferred to the
object file interface too.
[ ... ]
|