Automating quine construction for C
A quine is a program that can output its own source code. Named after the 20th century philosopher Willard Van Orman Quine and popularized in Godel, Escher, Bach, writing quines has quickly become a quintessential bit of recreational coding. Countless examples can be found among the International Obfuscated C Code Contest (IOCCC) entries. While quines are possible in any Turing complete programming language— more on that theoretical result below— writing them can be tricky due to the structural requirements placed on the code for a valid quine. Success criteria is very strict: the output must be exactly identical to the original source, down to spacing and newlines. A miss is as good as a mile. The starting point for this blog post is a search for simplifying the creation of arbitrary quines. For concreteness, this proof of concept will focus on C. But the underlying ideas extend to other programming languages.
Since most interesting quines have some additional functionality besides being able to output their own code, it would be logical to divide development into two steps:
- Write the application as one normally would
- Convert it into a quine by feeding it through a compiler. (More accurately, a “transpiler” or “transcompiler” since the output itself is another valid program in the same language.)
Motivating example
Consider the canonical example of a quine. It takes no inputs and simply prints out its own source code to standard out. Here is a logical, if somewhat naive, starting point in C:
Minor problem: this is not a valid C program. Doing all the heavy lifting is a mystery function “get_self” that is not declared— much less defined— anywhere in the source file. (Historical aside: C used to permit calling such undeclared functions, with an implicit assumption of integer return type. Those liberties were revoked with the C99 standard. In any case, adding a declaration will merely postpone the inevitable: the compile-time error about undefined symbols becomes a link-time error about unresolved symbols.)
What if we could feed this snippet into another program that automatically transform it into a valid C program working as intended? Before going down that path, it is important to clarify the success criteria for a “working” quine. Informally we are looking for a transformation with these properties:
- Accepts as input an “almost valid” C program with one undefined function get_self(). This function takes no arguments and returns a nil-terminated C string.
- Outputs a valid C program with identical functionality
- That new program includes a valid C implementation of get_self() which returns the entire source code of the modified program, including the newly added function.
There are two subtleties here: First the implementation must return the source code of the modified program, incorporating all additions and changes made by the transformation. Printing the original, pre-transformed source would not qualify as a true quine. Second, the implementation provided for the mystery function get_self() must operate at source level. There are trivial but OS dependent ways to convert any program into a quine by mucking with the binary executable, after the compilation is done. For example, the ELF format for Linux executables allows adding new data sections to an existing file. One could take the source code and drop that into the compiled binary as a symbol with a specific name such as “my_source_code.” This allows for a simple get_self() implementation that parses the binary for the current executable, searching for that symbol in data sections and converting the result into a C string. Expedient as that sounds, such approaches are considered cheating and do not qualify as true quines according to generally accepted definition of a self-replicating program. Informally, the “quineness” property must be intrinsic to the source. In the hypothetical example where one is playing games with ELF sections, the capability originated with an after-the-fact modification of the binary and is not present anywhere in the source. Another example of cheating would be to commit the source code to an external location such as Github and download it at runtime. These are disqualified under our definition.
QCC: Quining C Compiler
qcc is a proof-of-concept that transforms arbitrary C programs to create valid quines according to the rules sketched above. For example, running the pre-quine sample from the last section through qcc yields a valid quine:
Looking at the transformed output, we observe the original program verbatim in the listing (lines 8-15) sandwiched between prepended and appended sections:
Changes preceding the original source are minimal:
- Courtesy warning against manual edits, as any change is likely to break the quine property. This is optional and can be omitted.
- Include directives for standard C libraries required by the get_self() implementation. This is also optional if the original source already includes them. (This PoC does not bother to check. Standard library headers commonly have include guards; second and subsequent inclusions become a noop.) Incidentally, includes can also be deferred until the section of source referencing them. There is no rule dictating that all headers must appear at the start; it is merely a style convention most C/C++ developers abide by.
- Declaration of get_self(). Also can be omitted if the original program has one.
More substantial additions appear after the original code:
- Definition of a helper function for hex-decoding.
- Two string constants carrying hex-encoded payloads. These could have been merged into a single string constant, but keeping them separate helps clarify the quine mechanism.
- Definition of the get_self() function. This is the core of the implementation.
Hex-decoding the first string shows that it is just the the original program, with minor changes for headers/declaration prepended and the helper function appended. There is nothing self-referential or unusual going on. But the second hex payload when decoded is identical to the function get_self(). That is the tell-tale of sign quines: inert “data” strings mirroring code that gets compiled into executable instructions. Note that while the first string constant is a function of the original input program, the second one is identical for all inputs. (This is why separating them helps clarify the mechanism.)
Making a quine out of the quining compiler
Careful readers will note that qcc itself is a C program, but the initial version referenced above is not itself a quine. That will not stand. One natural option is to add a new command line flag that will induce qcc to output its own source code instead of quining another program.
Boot-strapping that has one tricky aspect: one needs a functioning qcc executable before it can quine its own source code. While this can be achieved by maintaining two different versions of the compiler, it is possible to get by with a single version with some judicious use of preprocessor tricks and conditional compilation. The trick is defining a preprocessor macro and surrounding any code paths that reference quine-like behavior with #ifdef guards referencing that macro:
- Add an option to qcc to define this macro explicitly in the transformation when outputting a quine. That capability will come in handy for any application that needs to be a valid C program in its pre-quine state.
- Compile with default settings, resulting in the preprocessor macro being undefined and leaving those sections out of the build. (That is a good thing: otherwise compiler errors will result from the nonexistent self-referential function referenced in those lines.)
- Run this original, pre-quine qcc binary on its own source code, and specify that the preprocessor macro is to be defined in the output.
- Save and recompile the transformed output as the new qcc. No need to explicitly define the preprocessor macro via compiler flags; it is already hardwired into the transformed source.
This final version can now output its own source, in addition to quining other C programs:
Caveats and design options
As this is a proof of concept, there are caveats:
- It only operates on single files. As such it can not be used to create multiquines, where a collection of more than one program has access to the source code of every other program in the collection. That case can be handled by first concatenating them into a single program and some preprocessor tricks. Straight concatenation alone would result in an invalid C program, due to multiple definitions for main, not to mention any other conflicting symbol used. But conditional compilation with #ifdef allows activating only a subset of code during compilation.
- Choice of hex encoding is arbitrary. It is certainly not space-efficient, doubling the size of the original input. On the other hand, using hex avoids the visual clutter/confusion caused by having strings that look like valid C code. Raw C strings also have the drawback that special characters such as double quotes and backslashes must be escaped. While decoding is free in the sense that the C compiler takes care of it, helper functions are still needed for outputting escaped strings. Hex is simple enough to encode/decode in a handful of lines compared to more elaborate options such as base64 or compression, both of which would require more code inlined into every quine or pull in new library dependencies.
- Implementation of get_self() is not thread safe, due to use of static variables. If there is a race condition where multiple threads race to execute the first-time initialization code, the string holding the source code representation will get computed multiple times. All threads will still receive the correct value, but extra memory will be wasted for unused versions.
- String constants are currently emitted as a single line, which may overflow the maximum allowed depending on compiler. (C standard requires support for at least 4K characters and most compilers in practice far exceed that limit.)
- Finally: QCC transformations are easily recognizable. There is no attempt to hide the fact that a quine is being created. Functions and variable identifiers have meaningful, intuitive names. This is problematic in contexts such as IOCC where obfuscation and brevity are virtues.
Post-script: excursion on Kleene’s recursion theorem
Stepping back, qcc is an implementation of a transform that is guaranteed to exist by virtue of a result first proven in 1938, namely Kleene’s second-recursion theorem. While that theorem was originally stated in the context of recursive functions, here we state it in an equivalent setting with Turing machines.
Consider a Turing machine T that computes a function of two inputs x and y:
T(x, y)
Kleene’s result states that there is a Turing machine S which computes a function of just one input, such that the behavior of S on all inputs is closely related to the behavior of T:
∀x S(x) = T(x, #S)
where #S is the Turing number for the machine S— in other words, a reference to S itself. This statement says that for all inputs, S behaves just like the original machine T acting on the first input, with the second input hard-wired to its own (that is, of S) Turing number.
Since programming languages are effectively more convenient ways to construct Turing machines, we can translate this into even more familiar territory. Suppose we have a program P written in a high-level language which takes two inputs:
P(x, y) ⟹ output
Kleene’s theorem guarantees the existence of a program Q which takes a single input x and computes:
Q(x) = P(x, source(Q))
Here source(Q) is the analog of Turing numbers for high-level programming languages, namely a textual representation of the program Q in the appropriate alphabet, such as ASCII or Unicode.
For a concrete example, consider replicating a recent IOCCC entry: write a program that outputs its own SHA512 hash. At first this seems impossible, barring a catastrophic break in SHA512. As a cryptographic hash function, SHA512 is designed to be one-way: given a target hash, it is computationally infeasible to find a preimage input that would produce that target when fed into SHA512. Given the difficulty of that basic problem, finding a “self-consistent” program such that its own SHA512 hash is somehow contained in the source code seems daunting. But Kleene’s result makes it tractable. As a starting point, it is easy enough to write a program that receives some input (for example, from console) and outputs the SHA512 hash of that input. Squinting a little, we can view this as a degenerate function on two inputs. The first input is always ignored while the second input is the one processed through the hash function:
P(x, y) ⟹ SHA512(y)
Kleene’s recursion theorem then guarantees the existence of a corresponding program Q:
Q(x) = P(x, source(Q)) = SHA512(source(Q))
In other words, Q prints a cryptographic hash of its own source code.
Given this background, we have a theoretical view on QCC: For C programs that can be expressed in a single compilation unit, QCC constructs another C program that is the “fixed-point” guaranteed to exist by Kleene’s second recursion theorem, with the second input hard-wired to the source code of the new program.
CP



