Skip to content

Bootstrapping Executables

John Reppy edited this page Sep 4, 2024 · 21 revisions

Overview

This page documents the underlying steps that are used to build an executable program from SML sources. This method is used both to bootstrap the initial interactive system as part of building the system from scratch, as well as in the ml-build shell script (in the bin directory) that is used to create executable programs from source code.

Batch Compilation

The cmb-make script (in the system directory) and the ml-build script (part of the standard SML/NJ installation) compile SML code in "batch" mode (as opposed the compiling the code in the REPL). The result of batch compilation is a collection of binfiles and a boot-list file that specifies the order in which to link the binfiles.

The Boot-List File Format

Boot-list files are line-oriented text files. The first line of the file has the format

% num-specs max_line_len

which specifies the number of specifications and the maximum length of any line in the file (not counting the terminating newline). There are two forms of specifications: the line that specifies the runtime-system Persistent ID (RTPID) and the binfile specifications. The RTPID is specified as a line of the form

# pid

where pid is a 32-digit (16-byte) hexadecimal number.[1] The other lines in the boot-list file specify binfiles to be linked in their link order. These lines have the form

filename [ @ offset [ : desc ] ]

where filename is the name of the binfile, offset is an optional offset into the binfile (for stable libraries), and desc is the optional name of the object being loaded. The offset defaults to 0 and the desc defaults to filename. The desc field is used for messages, but does not affect the loading process. It typically includes both the filename and the offset, as well as the name of the source file being loaded.

Bootstrapping

The runtime system has a "boot loader" mode that will bootstrap an executable from a boot-file list and a collection of pre-compiled binfiles. The boot loader is invoked by specifying @SMLboot=bootlist as a command-line option to the sml command.

The process of bootstrapping is an iteration over the specifications in the boot-list file.

For each entry, the loader reads the specified binfile, resolves its imports as explained below, executes the code, and then binds the output PID to the result. In the special case of the RTPID specification, the loader binds the RTPID to the Assembly structure that implements the linkage to the runtime system.

Loading a Binfile

As discussed elsewhere, a binfile includes the PIDS of its imports (or dependencies) and the (optional) PID of its output. Executing the code of the binfile corresponds to evaluating the top-level declarations in the corresponding SML code. The result of this execution is a heap object representing the resulting module.[2] The bootstrap loader maintains a table that maps the output PIDS to these heap objects; the RTPID is mapped to the representation of the Assembly structure.

Recall that a binfile specification has a filename and optional offset. The bootstrap loader processes each binfile specification as follows:

  1. The specified binfile is opened for reading as a binary file.

  2. If the binfile is an archive, the loader seeks to the file offset given in the specification.

  3. The binfile header is read from the file; this information is used to determine the file offsets of the parts of the file that are read by the loader.

  4. Construct the import record from imports (see below for details). The size of this record is specified by the binfile header, with an additional slot added to the end for the literals.

  5. The code in the binfile is loaded. The entrypoint of the code is a function that is applied to the import record.

  6. The output PID for the binfile is bound to the result of executing the code. In the case that the binfile does not have an output PID, the result of executing the code is ignored.

Resolving Imports

The binfile header specifies the number of imports. The imports are specified as a sequence of pairs that are a serialization of the following SML types:

datatype tree = NODE of (int * tree) list

type import = pid * tree

Note that the number of pairs can be less than the number of imports, since each tree can specify multiple imports. We construct the import record by iterating over the import pairs. The following SML pseudocode describes the process, where lookup pid maps pid to the heap object that it is bound to, addImport(n, obj) adds obj to the import record at position n, and select(obj, i) selects the i`th component of the record `obj.

fun buildImportRecord (nImports, pidTrees) = let
      fun doPair (n, (pid, tree)) = let
            val obj = lookup pid
            fun doTree (n, NODE[], obj) = (addImport(n, obj); n+1)
              | doTree (n, NODE kids, obj) =
                  List.foldl
                    (fn ((i, tr), n) => doTree(n, tr, select(obj, i)))
                    n
                    kids
            in
              doTree (n, tree, obj)
            end
      in
        List.foldl doPair 0 pidTrees
      end

Future Improvements

The current handling of binfiles that are part of an archive is inefficient, since the archive file is opened and closed for each individual binfile. On the other hand, one has to be careful to limit the number of open files. Since we have a complete list of the files that we need to read from (i.e., the boot-list file), we can implement a cache of open files using the optimal page replacement algorithm.


1. Note that the RTPID does not have to appear at the beginning of boot-list file.
2. This object is a record of the top-level dynamic objects (e.g., function closures) produced by executing the top-level declarations.

Clone this wiki locally