213 lines
9.3 KiB
HTML
213 lines
9.3 KiB
HTML
<html>
|
|
<head>
|
|
<title>Bare Bones Backend</title>
|
|
<link rel="stylesheet" type="text/css" href="style.css">
|
|
</head>
|
|
<body>
|
|
<div id="banner">
|
|
<a href="http://www.webkit.org/" class="banner-link">
|
|
<div id="logo" class="site-logo">
|
|
WebKit
|
|
<span class="tagline">Open Source Web Browser Engine</span>
|
|
</div>
|
|
</a>
|
|
</div>
|
|
<div id="contents">
|
|
<h1>Bare Bones Backend</h1>
|
|
<p>The Bare Bones Backend, or B3 for short, is WebKit's optimizing JIT for procedures
|
|
containing C-like code. It's currently used as the default backend for the
|
|
<a href="https://trac.webkit.org/wiki/FTLJIT">FTL JIT</a> inside
|
|
<a href="https://trac.webkit.org/wiki/JavaScriptCore">JavaScriptCore</a>.</p>
|
|
|
|
<p>B3 comprises a <a href="intermediate-representation.html">C-like
|
|
SSA IR</a> known as "B3 IR", optimizations on B3 IR, an
|
|
<a href="assembly-intermediate-representation.html">assembly IR</a>
|
|
known as "Air", optimizations on Air, an instruction selector that turns B3 IR into Air,
|
|
and a code generator that assembles Air into machine code.</p>
|
|
|
|
<h2>Hello, World!</h2>
|
|
|
|
<p>Here's a simple example of C++ code that uses B3 to generate a function that adds two to
|
|
its argument and returns it:</p>
|
|
|
|
<pre><code>// Create a Procedure that holds our code.
|
|
Procedure proc;
|
|
BasicBlock* root = proc.addBlock();
|
|
root->appendNew<Value>(
|
|
proc, Return, Origin(),
|
|
root->appendNew<Value>(
|
|
proc, Add, Origin(),
|
|
root->appendNew<ArgumentRegValue>(proc, Origin(), GPRInfo::argumentGPR0),
|
|
root->appendNew<Const64Value>(proc, Origin(), 2)));
|
|
|
|
// Have B3 compile the Procedure into code. The code and all of its artifacts (constant pools, jump tables)
|
|
// will stay alive so long as Compilation stays alive.
|
|
std::unique_ptr<Compilation> compilation = std::make_unique<Compilation>(vm, proc);
|
|
|
|
// Get a function pointer that we can call. This function pointer points to JIT-generated machine code.
|
|
int64_t (*function)(int64_t) = bitwise_cast<int64_t (*)(int64_t)>(compilation->code().executableAddress());
|
|
|
|
// Run it and print the result!
|
|
printf("%lld\n", function(42)); // Prints 44.</code></pre>
|
|
|
|
<p>When compiled, the resulting machine code looks like this:</p>
|
|
|
|
<pre><code>0x3aa6eb801000: pushq %rbp
|
|
0x3aa6eb801001: movq %rsp, %rbp
|
|
0x3aa6eb801004: leaq 0x2(%rdi), %rax
|
|
0x3aa6eb801008: popq %rbp
|
|
0x3aa6eb801009: ret </code></pre>
|
|
|
|
<p>B3 always emits a frame pointer prologue/epilogue to play nice with debugging tools.
|
|
Besides that, you can see that B3 optimized the procedure's body to a single instruction:
|
|
in this case, a Load Effective Address to transfer %rdi + 2, where %rdi is the first
|
|
argument register, into %rax, which is the result register.</p>
|
|
|
|
<h2>B3 IR</h2>
|
|
|
|
<p>Clients of B3 usually interact with it using
|
|
<a href="intermediate-representation.html">B3 IR</a>. It's C-like, in the sense that it
|
|
models heap references as integers and does not attempt to verify memory accesses. It
|
|
enforces <a href="https://en.wikipedia.org/wiki/Static_single_assignment_form">static single
|
|
assignment</a>, or SSA for short. An SSA program will contain only one assignment to each
|
|
variable, which makes it trivial to trace from a use of a variable to the operation that
|
|
defined its value. B3 IR is designed to be easy to generate and cheap to manipulate.</p>
|
|
|
|
<p>In most ways, B3 IR is platform-agnostic. However, since B3 is designed to be used as a
|
|
backend for JITs, it does embrace some platform-specific concepts whenever it is pragmatic to
|
|
do so. B3 exposes direct control over argument registers, stack frame layout, the frame
|
|
pointer, and the call argument areas. It's possible to emit B3 IR that defines completely
|
|
novel calling conventions, both for callers of the procedure being generated and for callees of
|
|
the procedure's callsites. B3 also makes it easy to just emit a C call. There's an opcode for
|
|
that.</p>
|
|
|
|
<p>See <a href="intermediate-representation.html">the IR documentation</a> for more
|
|
info.</p>
|
|
|
|
<p>Here's an example of the IR from the example above:</p>
|
|
|
|
<pre><code>BB#0: ; frequency = 1.000000
|
|
Int64 @0 = ArgumentReg(%rdi)
|
|
Int64 @1 = Const64(2)
|
|
Int64 @2 = Add(@0, $2(@1))
|
|
Void @3 = Return(@2, Terminal)</code></pre>
|
|
|
|
<h2>B3 Optimizations</h2>
|
|
|
|
<p>B3 is fairly new - we only started working on it in late Oct 2015. But it already has
|
|
some awesome optimizations:</p>
|
|
|
|
<ul>
|
|
<li>CFG simplification.</li>
|
|
<li>Constant folding with some flow sensitivity.</li>
|
|
<li>Global CSE, including sophisticated load elimination.</li>
|
|
<li>Aggressive dead code elimination.</li>
|
|
<li>Tail duplication.</li>
|
|
<li>SSA fix-up</li>
|
|
<li>Optimal placement of constant materializations.</li>
|
|
<li>Integer overflow check elimination.</li>
|
|
<li>Reassociation.</li>
|
|
<li>Lots of miscellaneous strength reduction rules.</li>
|
|
</ul>
|
|
|
|
<h2>Air</h2>
|
|
|
|
<p><a href="assembly-intermediate-representation.html">Air</a>, or Assembly IR, is the way
|
|
that B3 represents the machine instruction sequence prior
|
|
to code generation. Air is like assembly, except that in addition to registers it has
|
|
temporaries, and in addition to the native address forms it has abstract ones like "stack"
|
|
(an abstract stack slot) and "callArg" (an abstract location in the outgoing call argument
|
|
area of the stack).</p>
|
|
|
|
<p>Here's the initial Air generated from the example above:</p>
|
|
|
|
<pre><code>BB#0: ; frequency = 1.000000
|
|
Move %rdi, %tmp1, @0
|
|
Move $2, %tmp2, $2(@1)
|
|
Add64 $2, %tmp1, %tmp0, @2
|
|
Move %tmp0, %rax, @3
|
|
Ret64 %rax, @3</code></pre>
|
|
|
|
<p>Note that the "@" references indicate the origin of the instruction in the B3 IR.</p>
|
|
|
|
<h2>Air Optimizations</h2>
|
|
|
|
<p>Air has sophisticated optimizations that transform programs that use temporaries and
|
|
abstract stack locations into ones that use registers directly. Air is also responsible
|
|
for ABI-related issues like stack layout and handling the C calling convention. Air has
|
|
the following optimizations:</p>
|
|
|
|
<ul>
|
|
<li><a href="https://www.cs.princeton.edu/research/techreps/TR-498-95">Iterated Register Coalescing</a>. This is our register allocator.</li>
|
|
<li>Graph coloring stack allocation.</li>
|
|
<li>Spill code fix-up.</li>
|
|
<li>Dead code elimination.</li>
|
|
<li>Partial register stall fix-up.</li>
|
|
<li>CFG simplification.</li>
|
|
<li>CFG layout optimization.</li>
|
|
</ul>
|
|
|
|
<p>Here's what these optimizations do to the example program:</p>
|
|
|
|
<pre><code>BB#0: ; frequency = 1.000000
|
|
Add64 $2, %rdi, %rax, @2
|
|
Ret64 %rax, @3</code></pre>
|
|
|
|
<h2>B3→Air lowering, also known as Instruction Selection</h2>
|
|
|
|
<p>The B3::LowerToAir phase converts B3 into Air by doing pattern-matching. It processes
|
|
programs backwards. At each B3 value, it greedily tries to match both the value and as
|
|
many of its children (i.e. Values it uses) and their children as possible to create a
|
|
single instruction. Different hardware targets support different instructions. Air
|
|
allows B3 to speak of the superset of all instructions on all targets, but exposes a fast
|
|
query to check if a given instruction, or specific instruction form (like 3-operand add,
|
|
for example) is available. The instruction selector simply cascades through the patterns
|
|
it knows about until it finds one that gives a legal instruction in Air.</p>
|
|
|
|
<p>The instruction selector is powerful enough to do basic things like compare-branch and
|
|
load-op-store fusion. It's smart enough to do things like what we call the Mega Combo,
|
|
where the following B3 IR:</p>
|
|
|
|
<pre><code>Int64 @0 = ArgumentReg(%rdi)
|
|
Int64 @1 = ArgumentReg(%rsi)
|
|
Int32 @2 = Trunc(@1)
|
|
Int64 @3 = ZExt32(@2)
|
|
Int32 @4 = Const32(1)
|
|
Int64 @5 = Shl(@3, $1(@4))
|
|
Int64 @6 = Add(@0, @5)
|
|
Int32 @7 = Load8S(@6, ControlDependent|Reads:Top)
|
|
Int32 @8 = Const32(42)
|
|
Int32 @9 = LessThan(@7, $42(@8))
|
|
Void @10 = Check(@9:WarmAny, generator = 0x103fe1010, earlyClobbered = [], lateClobbered = [],
|
|
usedRegisters = [], ExitsSideways|Reads:Top)</code></pre>
|
|
|
|
<p>Is turned into the following Air:</p>
|
|
|
|
<pre><code>Move %rsi, %tmp7, @1
|
|
Move %rdi, %tmp1, @0
|
|
Move32 %tmp7, %tmp2, @3
|
|
Patch &Branch8(3,SameAsRep), LessThan, (%tmp1,%tmp2,2), $42, @10</code></pre>
|
|
|
|
<p>And the resulting code ends up being:</p>
|
|
|
|
<pre><code>0x311001401004: movl %esi, %eax
|
|
0x311001401006: cmpb $0x2a, (%rdi,%rax,2)
|
|
0x31100140100a: jl 0x311001401015</code></pre>
|
|
|
|
<p>Other than the mandatory zero-extending operation to deal with the 32-bit argument being
|
|
used as an index, B3 is smart enough to convert the address computation, load, and compare
|
|
into a single instruction and then fuse that with the branch.</p>
|
|
|
|
<h2>Code generation</h2>
|
|
|
|
<p>The final form of Air contains no unallocated temporaries or abstract stack slots. Therefore,
|
|
it maps directly to machine code. The final code generation step is a very fast transformation
|
|
from Air's object-oriented way of representing those instructions to the target's machine
|
|
code. We use JavaScriptCore's macro assembler for this purpose.</p>
|
|
|
|
</div>
|
|
</body>
|
|
</html>
|
|
|
|
|