On 01/09/2025 15:18, David Matthews wrote:
On 01/09/2025 12:20, Phil Clayton wrote:
On 31/08/2025 13:54, vqn wrote:
As far as I can understand, this requires being able to
- extract free identifiers from a module;
- compare the content of two interfaces (not just their exported
identifiers); 3. link old compiled code to new code it depends on.
While (1) could probably be implemented through namespaces, I'm not sure how to go about (2) and (3), especially since I am only wrapping the compiler API. I.e limited to a single '(source code * compiled env) -> fully compiled and linked code' operation.
Finding the free identifiers when compiling code is fairly easy. This is what PolyML.make although it only looks at functors, structures and signatures. There's no reason that other kinds of identifier couldn't be included if required.
Though for now the problem is more how to properly (de)serialize compiled code so that it can be reused for subsequent compilations. :)
Yes. This issue seems (sort of) related to linking names in old and new code but not for the object code itself (where types are, presumably, long since eliminated) but the SML types associated with certain entities in the object code. Clearly I'm not familiar with the internals of Poly/ML compilation but I may take a closer look.
Serialising the result of the compilation and loading the serialised data into a subsequent computation are the difficult part. When anything is compiled in Poly/ML the result is a graph in memory. Some of this is a data structure that describes the types and/or signatures and some of it is what might be described as the "value". Generally both the "type" and the "value" will involve the addresses of memory cells that were present before this particular computation. These might be the cells that make up the type "int", say, or the cells that make up the "print" function and link to other cells for "stdOut". Once the compilation is complete there's no way to go back from the graph and unpick it to work out which bits came from where.
This presents a problem for serialising if we want to be able to write out only part of the graph and then read it into a subsequent computation. PolyML.export, used to create object files, writes out the whole graph so there's no need to recreate from a partial graph.
It is possible to distinguish cells by whether they came from the executable, say. Newly created cells are created in the local heap but the cells in the executable are permanent and never garbage collected. PolyML.SaveState.saveState writes out new cells to the saved state. The addresses of cells in the parent executable are written as offsets in the parent. There's no way to know anything more about them so it's only possible to read the saved state back into the same executable. PolyML.saveModule does something similar.
Thank you for the high-level explanation - very helpful.
I'm not sure what this implies for CM/MLB since I'm not familiar with them. I can see that you might want to avoid unnecessary recompilation but is it also necessary to avoid duplication of the compiled code?
I would have thought it is necessary to avoid duplication of code where mutable state is involved but perhaps I have misunderstood. Still, I doubt the performance of CM could be matched if binary files contain multiple copies of the same code, so it is probably necessary, more so for large code bases where this is useful. (Also, I think users would expect incremental compilation to be an optimization, giving something equivalent to full compilation although not identical due to e.g. loading modules in a different order.)
If one module depends on another is the idea to avoid storing the compiled code for the dependencies with it?
Yes, because this wouldn't scale up for large code bases. In my case, the final binary (heap) from 32-bit SML/NJ is 45 MB and there are hundreds of modules.
Considering solutions to support cut-off incremental recompilation for MLB files, I wondered whether a checkpoint mechanism could allow only cells introduced after the checkpoint is declared to be written out to a file. References to cells in the base executable would be stored as offsets, as currently done, but references to other cells created before the checkpoint would not be stored as offsets but as ML names and types, along with their constructor and infix status. The thinking is that such a file could be loaded on top of new code that provided the same ML names and types with the same constructor/fixity status. This would introduce a slight overhead when a module is compiled for the first time but would decrease subsequent compilation time.
Currently vqn is trying to get a simpler incremental compilation scheme to work:
I have been trying to implement incremental compilation by caching compiled .mlb files (i.e compiler namespaces) and exporting and reimporting them through {save,load}ModuleBasic
Roughly speaking, an MLB file (http://mlton.org/MLBasis) defines a basis in terms of a list of SML files and other MLB files, evaluated in order. An MLB file is evaluated only once, so multiple references to the same MLB file reuse the result of its evaluation. I think this requires {save,load}Module and their basic variants to work hierarchically but I don't see how this is supported. I am guessing a saved module has its own copy of every dependency not in the (immutable) executable. This appears to be an issue for e.g. mutable state, as shown in the example below. (Note that `loadModule` seems to fail for Poly/ML built with compact32bit, so a non-compact32bit version is required.) Is there a way to make {save,load}Module give the expected behavior below?
Phil
(* Suppose we have a module A with state and a module B that depends on A. *)
structure A = struct val r = ref 0 fun set x = r := x fun get () = ! r end
structure B = struct fun get () = A.get () + 1 end ;
A.set 5; A.get (); (* expect 5, ok *) B.get (); (* expect 6, ok *)
PolyML.SaveState.saveModule ("/tmp/a", {sigs = [], structs = ["A"], functors = [], onStartup = NONE});
PolyML.SaveState.saveModule ("/tmp/b", {sigs = [], structs = ["B"], functors = [], onStartup = NONE});
(* **** Fresh Poly/ML session **** *)
PolyML.SaveState.loadModule "/tmp/a"; PolyML.SaveState.loadModule "/tmp/b";
A.set 10; A.get (); (* expected 10, ok *) B.get (); (* expected 11, got 6: module B not using the same state as A! *)