Domain-specific Language Implementation Patterns (Pt. 5): Compiler, Interpreter, and Transcompiler

5. Compiler, interpreter, and transcompiler

Compiler, interpreter, or transcompiler are generator programs capable of transforming the source unit along with its Intermediate Representation, usually an AST, to the final output form (binary, execution logic, source in another language). These generator programs are often placed at the end of the multistage pipeline, executed after the input source is parsed, analyzed and the IR have been rewritten. Due to the ways a language can be implemented, generators programs are divided into three main categories: compiler, interpreter, and transcompiler.

A compiler is a program that transform the AST of a high-level programming language and its metadata (symbol tables, scopes) into lower-level programming language. This process is the most used method to translate a language. By producing a low-level source unit as the final output, sometimes Assembly, the programmer gains two main advantages. First, because the low-level source represents the true primitive logic that goes on inside a computer (memory manipulation, binary operations), it can take benefit from executing by the computer directly, without going through or simulated by any software. Secondly, most of the time many high-level logic and constructs in higher-level languages do not have a corresponding representation in lower-level languages and some unnecessary information must be lost when compiling, it is extremely difficult for human to understand what an application is doing by looking at its compiled source. There are tools to analyze the compiled source, reverse engineer it, and translate it back to the “original” source code. But since only a part of the original code is translated in the first place, most reverse engineering tools for compiler today can only recover the constructs of a compiled application, and not data like symbol names, syntactic constructs, etc. This makes it much harder to steal source code, which is categorized as intellectual property and is often strictly protected by law through licensing. Because operating systems and computer types have different methods and techniques to execute machine instructions and even have different instruction sets, the output of the compilation process often only works on a single operating system and computer type. Take C for example, most compiler implementations for C are machine-dependent, that is, they can only compile C source code into Assembly that can be executed on a small set of instruction types, making developing portable applications much harder with C. Attempts were made to solve this problem, most notably two-phase compilation process. In this process, the high-level source unit is compiled into an intermediate form, often represents machine instructions, but does not actually correspond to any specific instruction set. The instructions in this medium must be generalized enough to be portable and should not contain any environment-specific feature without specifying so. Its instructions must also closely correspond to machine instructions, as they will be executed at runtime, so the translation from the medium‟s instructions to machine instructions can be instantaneous. After the compilation phase, the medium output contains portable code that cannot be executed by most operating systems natively, which calls for another language application – virtual machines. The virtual machines have the capability to read and understand the medium‟s code and execute it on the platform they run on. This allows portable code produced during the compilation phase to be executed where the virtual machines support. During the interpretation phase, the medium is again compiled by a tool called just-in-time compiler from its source code, down to machine code that can be natively executed. Other types of implementations try to parse the medium source and execute each instruction directly using a built-in executer. This is still possible without translating the medium code to machine code because the machine code is already there. Since the medium language is just a type of Assembly for a special runtime, each of its instructions maps one-to-one or one-to-many to a set of machine instructions, so the interpreter can simply run a sequence of machine instructions when it sees a particular intermediate instruction. This is similar to the notion of function in high-level programming, where the programmer can call a built-in function by its name and a sequence of code will be executed without requiring the programmer to know how it performs the job. In summation, the two-phase translation process actually happens across compile-time and runtime. The major benefit of this process is the portability across many platforms. Leveraging this portability, the JDK contains largely machine-independent code, allowing a programmer developing an application on Windows can expect the same application to generally work well on Linux or Mac OS. The second major benefit is the huge performance boost given when the interpreter can apply information it knows about the operating system it runs on to optimize code where normal compiler cannot. The final result is an architecture that supports cross-platform execution with acceptable speed. Languages that have runtimes performing this kind of translation process include C#, VB.NET, F# with the Common Language Runtime, and Java, Kotlin, Scala, Golo with the Java Virtual Machine (JVM). The internals of these runtimes are fine-tuned to mostly optimize for the best performance possible. For instance, the CLR uses a highly optimized just-in-time compiler to perform optimizations where necessary (removing dead code, inlining functions, ...), coupled with the ability to write unsafe code and interoperate with native libraries, making applications written in CLR-compliant languages sufficient fast while maintaining portability. On the other hand, the JVM performs interpretation by default, based on the fact that most of the bytecode instructions are not to be executed, and only compiles a method using two compilers C1 and C2 after it is invoked a certain amount of time. The result is also a sufficiently fast language despite the lack of low level constructs and the ability to write unsafe code like C#.

There are also many other implementation strategies, although they based mostly on the three discussed earlier. Although they produce fast and efficient results, they take a lot of skills to write and a lot of effort, not to mention the difficulty in maintaining the programs when the language design changes. These strategies also do not work on the Web environment, where a new language can only be impartially implemented when there is full consensus from major browser vendors. This partly explain the domination of JavaScript for logic execution on the Web, having a new and potentially better language implemented and adopted by the general consumption would take years. To accommodate for these issues, a different strategy of translation was invented, which is officially called transcompilation, or transpilation. In transcompilation, a high-level programming language is fed, parsed, analyzed, and translated into another high-level programming language that already has a supported runtime. The output of the transcompilation process can then be used as normal source code and be compiled, interpreted, and executed by its runtime. This is the case for TypeScript, it is described as a “better” language that works on the Web. It does this by having a transcompiler translating from TypeScript to JavaScript. Most of the time, programmers only need to care about writing TypeScript code, an automated process kicks in when necessary to generate JavaScript files, place them in appropriate folders, solve any linking references between multiple code files and between HTML and JavaScript. This process of transcompiling TypeScript to JavaScript allows programmers to write more efficient, less buggy code in less time while keeping all of their targeted platforms. This is also a common theme for other transcompiling programming languages. Some more examples include, CoffeeScript, NativeScript, Xtend. They are essentially developed for two main purposes. The first is to improve an existing language by translating another better language into it. The second is to allow code written for one platform runs on other platforms, like NativeScript, where programmers can write JavaScript code to run on Android or iOS.