Reducing the size of on-disk saved state

List overview All Threads
Download

newer

older

Basis Library: select for file...

playing with Posix fork/exec/dup2...

matthew.fernandez＠gmail.com

11 Feb 2016 11 Feb '16

10:11 p.m.

Hi Poly folk,

I am an Isabelle/HOL user interested in reducing the on-disk size of my Isabelle/HOL heaps. These are read and written via the save state functionality in Poly/ML. Saved states compress very well via conventional algorithms. On a representative Isabelle/HOL heap I see the following:

$ ls -lh -r--r--r-- 1 matthew matthew 1.7G Feb 10 11:22 MyHeap -r--r--r-- 1 matthew matthew 279M Feb 10 11:22 MyHeap.bz2 -r--r--r-- 1 matthew matthew 331M Feb 10 11:22 MyHeap.gz -r--r--r-- 1 matthew matthew 173M Feb 10 11:22 MyHeap.xz

I would like to propose integrating compression into the save state functionality itself. The runtime of Bzip and XZ are probably slow enough to irritate some users, but Gzip is reasonably fast for the benefit it shows. Gzipping saved states would require linking against zlib or something similar.

David Matthews has suggested the save state code detect whether a state is compressed or not on loading and act accordingly, which would provide a nice transition path for users with existing saved states.

I'm interested in hearing others' opinions about this proposal. If we feel this is something reasonable to do, I'll work on implementing this. When replying to this email, please CC me directly as I'm not subscribed to the list.

Thanks, Matthew

Show replies by date

matthew.fernandez＠gmail.com

14 Feb 14 Feb

12:08 a.m.

New subject: [polyml] Reducing the size of on-disk saved state

I received several off-list replies to this, including a pointer to previous experiments by Data 61 colleagues that concluded LZO [0] was preferable to Gzip, due to faster compression and decompression times. For the representative heap below, LZO produces a 490M output. Compression and decompression take only a couple of seconds, compared with ~40s for compression and ~9s for decompression with Gzip. LZO also has the advantage of almost zero memory overhead for both operations. These measurements are fairly coarse grained, using the default settings of the respective command line tools but I expect them to be representative.

It looks available for most platforms, but is also ANSI C and GPLed so we could include it in-tree. I do not know what people's preferences are in this area.

From my perspective, LZO seems the clear front runner for now. Please let me know if you have an alternative suggestion or if you outright object to compressing saved states.

[0]: http://www.oberhumer.com/opensource/lzo/

On 12/02/16 09:11, Matthew Fernandez wrote:

...

Hi Poly folk,

I am an Isabelle/HOL user interested in reducing the on-disk size of my Isabelle/HOL heaps. These are read and written via the save state functionality in Poly/ML. Saved states compress very well via conventional algorithms. On a representative Isabelle/HOL heap I see the following:
 $ ls -lh
 -r--r--r-- 1 matthew matthew 1.7G Feb 10 11:22 MyHeap
 -r--r--r-- 1 matthew matthew 279M Feb 10 11:22 MyHeap.bz2
 -r--r--r-- 1 matthew matthew 331M Feb 10 11:22 MyHeap.gz
 -r--r--r-- 1 matthew matthew 173M Feb 10 11:22 MyHeap.xz
I would like to propose integrating compression into the save state functionality itself. The runtime of Bzip and XZ are probably slow enough to irritate some users, but Gzip is reasonably fast for the benefit it shows. Gzipping saved states would require linking against zlib or something similar.

David Matthews has suggested the save state code detect whether a state is compressed or not on loading and act accordingly, which would provide a nice transition path for users with existing saved states.

I'm interested in hearing others' opinions about this proposal. If we feel this is something reasonable to do, I'll work on implementing this. When replying to this email, please CC me directly as I'm not subscribed to the list.

Thanks, Matthew

makarius＠sketis.net

11:31 a.m.

New subject: [polyml] Reducing the size of on-disk saved state

On Sun, 14 Feb 2016, Matthew Fernandez wrote:

...

I received several off-list replies to this, including a pointer to previous experiments by Data 61 colleagues

How is this related to the experiment with heap hierarchies? I.e. a stack of heaps according to the session parent structure.

Makarius

matthew.fernandez＠gmail.com

9:13 p.m.

New subject: [polyml] Reducing the size of on-disk saved state

On Sunday, 14 February 2016, Makarius <makarius at sketis.net> wrote:

...

On Sun, 14 Feb 2016, Matthew Fernandez wrote:

I received several off-list replies to this, including a pointer to

...
previous experiments by Data 61 colleagues

How is this related to the experiment with heap hierarchies? I.e. a stack of heaps according to the session parent structure.

They are essentially orthogonal. AIUI the heap hierarchies change involved modifying Isabelle to use existing Poly/ML support for saved states that reference others, but not modifying Poly/ML itself. In contrast, this proposed change is purely to Poly/ML. Both have the effect of reducing disk usage and their effects are cumulative.

...

    Makarius

David.Matthews＠prolingua.co.uk

15 Feb 15 Feb

12:51 p.m.

New subject: [polyml] Reducing the size of on-disk saved state

On 14/02/2016 21:13, Matthew Fernandez wrote:

...

On Sunday, 14 February 2016, Makarius <makarius at sketis.net> wrote:

...
On Sun, 14 Feb 2016, Matthew Fernandez wrote:

I received several off-list replies to this, including a pointer to

...
previous experiments by Data 61 colleagues

How is this related to the experiment with heap hierarchies? I.e. a stack of heaps according to the session parent structure.

Makarius: are you referring to the Poly/ML saved state mechanism or something else?

...

They are essentially orthogonal. AIUI the heap hierarchies change involved modifying Isabelle to use existing Poly/ML support for saved states that reference others, but not modifying Poly/ML itself. In contrast, this proposed change is purely to Poly/ML. Both have the effect of reducing disk usage and their effects are cumulative.

As far I'm aware Isabelle just uses the shared state facilities of Poly/ML. That provides the ability to save states that are the direct descendant of the executable or the descendant of another saved state.

From a quick look at the code the main effect that child states have is that StateLoader::LoadFile needs to seek within the saved state file to get the name of the parent file. That has to be loaded before the child because the child may, almost certainly will, overwrite some of the parent data. That may affect how you compact the data. How well do the compression libraries cope with seeking within the file?

From my own point of view I'm concerned that compacting the heap files may add to the cost of loading. In my experimental IDE, while a file is being edited a lot of saved states are loaded and reloaded repeatedly to provide the context. I'd like to see what effect adding compaction has on it but it may be necessary to provide separate functions to save states with and without compaction. Loading is easier because the loader can look at the format. Note that there's no need to provide backwards compatibility because a saved state can only ever be reloaded into the executable that created it.

David

makarius＠sketis.net

1:54 p.m.

New subject: [polyml] Reducing the size of on-disk saved state

On Mon, 15 Feb 2016, David Matthews wrote:

...

On 14/02/2016 21:13, Matthew Fernandez wrote:

...
On Sunday, 14 February 2016, Makarius <makarius at sketis.net> wrote:

...
On Sun, 14 Feb 2016, Matthew Fernandez wrote:

I received several off-list replies to this, including a pointer to

...
previous experiments by Data 61 colleagues

How is this related to the experiment with heap hierarchies? I.e. a stack of heaps according to the session parent structure.

Makarius: are you referring to the Poly/ML saved state mechanism or something else?

Yes, but Isabelle puts a certain interpretation on top of it. The terms "session" and "heap" are from that vocabulary.

...

As far I'm aware Isabelle just uses the shared state facilities of Poly/ML. That provides the ability to save states that are the direct descendant of the executable or the descendant of another saved state.

It has fallen a bit behind in recent years. Most notably, heaps are not hierarchic: after loading a saved state and adding to it, an independent saved state is produced with all content taken together. This can add up to rather large heaps, especially for structured developments like Isabelle examples + AFP taken together.

People at former NICTA now Data61 have started to reduce heap file sizes recently. So my questions go ultimately in the direction: What are typical heap sizes seen in these big L4.verified applications? How is the "before" and "after" situation?

...

From my own point of view I'm concerned that compacting the heap files may add to the cost of loading.

I also remember that from ancient SML/NJ times, when it was still used in practice and Poly/ML not freely available. SML/NJ heaps were much bigger than in Poly/ML, and people started to think about implicit compression. Luckily such complication was avoided, because Poly/ML became freely available.

...

In my experimental IDE, while a file is being edited a lot of saved states are loaded and reloaded repeatedly to provide the context. I'd like to see what effect adding compaction has on it but it may be necessary to provide separate functions to save states with and without compaction. Loading is easier because the loader can look at the format. Note that there's no need to provide backwards compatibility because a saved state can only ever be reloaded into the executable that created it.

There is definitely a time vs. space trade-off.

My general impression is that loading and saving of states consumes significant time for highly structured applications like AFP. I have started thinking about avoiding to save states in the first place: just plough through all Isabelle sessions of AFP (excluding the "slow" group) in one big Poly/ML process.

Compression can take a lot of additional time, but there is also the paradoxical situation of SSDs (solid-state disks, which are not disks at all): heaps that require many GBs on conventional hard-disks may still fit into small SSDs and thus save/load much faster.

Makarius

matthew.fernandez＠gmail.com

17 Feb 17 Feb

1:14 a.m.

New subject: [polyml] Reducing the size of on-disk saved state

On 16/02/16 00:54, Makarius wrote:

...

People at former NICTA now Data61 have started to reduce heap file sizes recently. So my questions go ultimately in the direction: What are typical heap sizes seen in these big L4.verified applications? How is the "before" and "after" situation?

I wasn't at the coalface of that work, so I'm not well placed to comment on it. My current attempts at reducing heap sizes may be inconsequential to others. The sizes of the heaps I see are typically around a few GBs at most, but I have many of them hosted on a small SSD in one of my machines so a factor of 4 compression is a giant productivity boost for me there. From what I understand, the most punishing L4.verified session is currently CRefine, which I have rarely had cause to build in recent times.

On 16/02/16 00:12, David Matthews wrote:

...

From a quick look at the code the main effect that child states have is that StateLoader::LoadFile needs to seek within the saved state file to get the name of the parent file. That has to be loaded before the child because the child may, almost certainly will, overwrite some of the parent data. That may affect how you compact the data. How well do the compression libraries cope with seeking within the file?

Admittedly this is not something I had thought to look for, and now that I do I note there are seeks performed during state *saving* as well where Poly/ML overwrites data at the start of the stream. A cursory glance at the LZO API makes it seem like a requirement to seek the stream may well be a deal breaker... Rafal Kolanski, can you comment any more on this?

WRT the trade offs between time and space in loading/saving, my brief experiments and others' advice strongly suggested to me that the time cost was negligible in relation to everything else going on in the system. However, it's difficult to quantify this in any meaningful way without an implementation on hand. I'm travelling right now and in and out of internet range, but I can put together something when I'm back that can serve as a drop-in replacement for you both and myself to try out our use cases. If no one has given a canonical answer one way or the other by then, this effort would also answer whether it is possible to seek these streams.

David.Matthews＠prolingua.co.uk

6:34 p.m.

New subject: [polyml] Reducing the size of on-disk saved state

On 17/02/2016 01:14, Matthew Fernandez wrote:

...

On 16/02/16 00:12, David Matthews wrote:

...
From a quick look at the code the main effect that child states have

is that StateLoader::LoadFile needs to seek

...
within the saved state file to get the name of the parent file. That

has to be loaded before the child because the

...
child may, almost certainly will, overwrite some of the parent data.

That may affect how you compact the data.

...
How well do the compression libraries cope with seeking within the file?

Admittedly this is not something I had thought to look for, and now that I do I note there are seeks performed during state *saving* as well where Poly/ML overwrites data at the start of the stream. A cursory glance at the LZO API makes it seem like a requirement to seek the stream may well be a deal breaker... Rafal Kolanski, can you comment any more on this?

It may be possible to rework the code to avoid the seeks. Perhaps it would be easier to compact each section of the data separately rather than process the file as a whole, if that's possible. The seeks are just to move between sections.

Of more concern is that LZO is licensed under GPL rather than LGPL. Poly/ML is licensed under LGPL and that means that it cannot include or even link to LZO without coming under GPL. That doesn't preclude experimenting with it but for distribution I'd prefer a library that didn't have these problems.

David

rda＠lemma-one.com

8:19 p.m.

New subject: [polyml] Reducing the size of on-disk saved state

...

On 17 Feb 2016, at 18:34, David Matthews <David.Matthews at prolingua.co.uk> wrote: ... Of more concern is that LZO is licensed under GPL rather than LGPL. Poly/ML is licensed under LGPL and that means that it cannot include or even link to LZO without coming under GPL. That doesn't preclude experimenting with it but for distribution I'd prefer a library that didn't have these problems.

It would cause me significant problems if the Poly/ML licence changed to GPL. I can't understand why the LZO developers chose GPL rather than LGPL unless they hope that decision would force people to buy their commercial offering (which I believe is a doomed business plan).

Regards,

Rob.

matthew.fernandez＠gmail.com

25 Feb 25 Feb

1:20 a.m.

New subject: [polyml] Reducing the size of on-disk saved state

On 18/02/16 05:34, David Matthews wrote:

...

On 17/02/2016 01:14, Matthew Fernandez wrote:

...
On 16/02/16 00:12, David Matthews wrote:

...
From a quick look at the code the main effect that child states have

is that StateLoader::LoadFile needs to seek

...
within the saved state file to get the name of the parent file. That

has to be loaded before the child because the

...
child may, almost certainly will, overwrite some of the parent data.

That may affect how you compact the data.

...
How well do the compression libraries cope with seeking within the file?

Admittedly this is not something I had thought to look for, and now that I do I note there are seeks performed during state *saving* as well where Poly/ML overwrites data at the start of the stream. A cursory glance at the LZO API makes it seem like a requirement to seek the stream may well be a deal breaker... Rafal Kolanski, can you comment any more on this?

It may be possible to rework the code to avoid the seeks. Perhaps it would be easier to compact each section of the data separately rather than process the file as a whole, if that's possible. The seeks are just to move between sections.

Of more concern is that LZO is licensed under GPL rather than LGPL. Poly/ML is licensed under LGPL and that means that it cannot include or even link to LZO without coming under GPL. That doesn't preclude experimenting with it but for distribution I'd prefer a library that didn't have these problems.

David

I think the salient points at this stage are the following: 1. Poly/ML performs seeks on the save path to rewind the FD and update metadata including byte offsets of other sections in the file. Here I'm referring to SaveRequest::Perform. 2. LZO is GPL v2, while Poly/ML is LGPL v2.1. Thanks David and Rob for correcting me; I had misread the licence. 3. LZO streams do not appear to be seekable. Gzip streams seem seekable only for reading, and this is acknowledged to be slow.

The na?ve ways I can see of working around (1) are either (a) construct the entire state in memory first then stream it out to a compressed file, (b) effectively run the save state logic twice to predict the offset values in the first pass so the second pass that does the actual writing can run linearly uninterrupted or (c) write the state out then compress it to a second file and delete the first. None of these are particularly palatable to me. David, you mentioned that it might be possible to avoid the seeks. Did you have a different idea?

As you've noted, there are also seeks on the load path, but to me this is a lesser hurdle to overcome than the seeks on the save path.

As for (2), the licensing issue... this appears to be a show stopper for using LZO. As I've said, I'm not wed to any particular compression algorithm, so I'm happy to revert to Gzip or another suggestion if there's one. For my own use case, my precious resources are RAM and disk space. Runtime is not a concern to me as this operation is already dwarfed by other things on my critical path. I suspect this is not the case for others, so it may make sense to implement this as an opt-in feature. As always, any and all comments welcome.

Japheth.Lim＠nicta.com.au

4:24 a.m.

New subject: [polyml] Reducing the size of on-disk saved state

Matt,

On 25/02/16 12:20, Matthew Fernandez wrote:

...

I think the salient points at this stage are the following:

Poly/ML performs seeks on the save path to rewind the FD and

update metadata including byte offsets of other sections in the file. Here I'm referring to SaveRequest::Perform.

This is not a problem because segments can be compressed separately. Seeks are only performed to metadata sections and segment starts, as far as I can tell. The vast majority of the data, within the segments, can be compressed without issues as they are not fseek targets.

I had experimented with segment compression earlier, but ran into mysterious segfaults and gave up.

...

LZO is GPL v2, while Poly/ML is LGPL v2.1. Thanks David and Rob

for correcting me; I had misread the licence.

There are many alternatives to LZO, such as LZ4 (https://github.com/Cyan4973/lz4, BSD).

...

LZO streams do not appear to be seekable. Gzip streams seem

seekable only for reading, and this is acknowledged to be slow.

As for (1).

-- Japheth

________________________________

The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments.

David.Matthews＠prolingua.co.uk

4:14 p.m.

New subject: [polyml] Reducing the size of on-disk saved state

On 25/02/2016 04:24, Japheth Lim wrote:

...

On 25/02/16 12:20, Matthew Fernandez wrote:

...
I think the salient points at this stage are the following:

Poly/ML performs seeks on the save path to rewind the FD and

update metadata including byte offsets of other sections in the file. Here I'm referring to SaveRequest::Perform.

This is not a problem because segments can be compressed separately. Seeks are only performed to metadata sections and segment starts, as far as I can tell. The vast majority of the data, within the segments, can be compressed without issues as they are not fseek targets.

I had experimented with segment compression earlier, but ran into mysterious segfaults and gave up.

That's how I would have approached it. Is there any chance you could make your experiment available? It may be possible to find out what the segfaults are about. At the very least it would be something to build on.

David

makarius＠sketis.net

2:32 p.m.

New subject: [polyml] Reducing the size of on-disk saved state

On Thu, 25 Feb 2016, Matthew Fernandez wrote:

...

As for (2), the licensing issue... this appears to be a show stopper for using LZO. As I've said, I'm not wed to any particular compression algorithm, so I'm happy to revert to Gzip or another suggestion if there's one. For my own use case, my precious resources are RAM and disk space. Runtime is not a concern to me as this operation is already dwarfed by other things on my critical path.

...

From what I have seen in recent years, xz and 7z appear to be rather good

compression algorithms, but also a bit slow.

7z seems to support multithreading at the least: today that should be standard for any time-consuming algorithm, but it still isn't.

Makarius

David.Matthews＠prolingua.co.uk

4:36 p.m.

New subject: [polyml] Reducing the size of on-disk saved state

On 25/02/2016 01:20, Matthew Fernandez wrote:

...

The na?ve ways I can see of working around (1) are either (a) construct the entire state in memory first then stream it out to a compressed file, (b) effectively run the save state logic twice to predict the offset values in the first pass so the second pass that does the actual writing can run linearly uninterrupted or (c) write the state out then compress it to a second file and delete the first. None of these are particularly palatable to me. David, you mentioned that it might be possible to avoid the seeks. Did you have a different idea?

As you've noted, there are also seeks on the load path, but to me this is a lesser hurdle to overcome than the seeks on the save path.

The main use of seeks in the code is to get the parent name and, if necessary, to change it. ShowParent seeks to the file name to read and return it, RenameParent writes a new one and updates the location on disc.

If the loading sequence exactly mirrored the saving there would be no need for seeks for the relocations. It might be possible to make that change but I'm rather reluctant to change the code until we know that it would be worthwhile.

David

makarius＠sketis.net

16 Mar 16 Mar

10:58 p.m.

New subject: [polyml] Reducing the size of on-disk saved state

On Mon, 15 Feb 2016, David Matthews wrote:

...

As far I'm aware Isabelle just uses the shared state facilities of Poly/ML. That provides the ability to save states that are the direct descendant of the executable or the descendant of another saved state.

I've just changed that to use PolyML.SaveState.saveChild and PolyML.SaveState.loadHierarchy -- see http://isabelle.in.tum.de/repos/isabelle/rev/0189fe0f6452

...

From a quick look at the code the main effect that child states have is that StateLoader::LoadFile needs to seek within the saved state file to get the name of the parent file. That has to be loaded before the child because the child may, almost certainly will, overwrite some of the parent data. That may affect how you compact the data. How well do the compression libraries cope with seeking within the file?

From my own point of view I'm concerned that compacting the heap files may add to the cost of loading.

I'd like to see what effect adding compaction has on it but it may be necessary to provide separate functions to save states with and without compaction. Loading is easier because the loader can look at the format. Note that there's no need to provide backwards compatibility because a saved state can only ever be reloaded into the executable that created it.

Overall I don't quite see a benefit to spend a lot of run-time and complexity of the implementation to compress a few GB. Even a small SSD has already 64-128 GB.

What I often see is that a 32bit poly process is unable to save a heap due to memory shortage. Adding compression in the same process address space would probably make the situation worse.

Note that for the present Isabelle setup, it is trivial to add heap compression outside the poly process, operating on the already saved heap file. What remains is the problem of loading it again without decompressing into a separate file.

Anyway, what is the state of more detailed measurements of the application?

It should be also worthwhile to step back it bit, and ask why the heaps are so large. There might be some Isabelle/ML programming mishaps in the application with persistent values that refer to Context.theory instead of Context.theory_id. Standard containers like the Simplifier or Classical context take care of that, but add-ons might not.

Makarius

matthew.fernandez＠gmail.com

31 Mar 31 Mar

6:56 a.m.

New subject: [polyml] Reducing the size of on-disk saved state

Apologies for my radio silence on this; got preempted by some other pressing things.

I'm still in the dark about the Data61 experiments and have no details to add there. My own pressing need for this has evaporated so I'm probably unlikely to implement this unless someone else urgently needs it. Please speak up if you do. Some further replies to Makarius inline below.

On 17 March 2016 at 09:58, Makarius <makarius at sketis.net> wrote:

...

Overall I don't quite see a benefit to spend a lot of run-time and complexity of the implementation to compress a few GB. Even a small SSD has already 64-128 GB.

My machine in question at the time had a 128GB SSD which was always on the brink of being full. Anything that gained me >500MB was most welcome. Whether this is a niche scenario, I do not know.

...

What I often see is that a 32bit poly process is unable to save a heap due to memory shortage. Adding compression in the same process address space would probably make the situation worse.

Quite possible. I do not know as I exclusively run 64-bit Poly.

...

Note that for the present Isabelle setup, it is trivial to add heap compression outside the poly process, operating on the already saved heap file. What remains is the problem of loading it again without decompressing into a separate file.

Yes, you're right. I would be open to this solution as well, but as I'm more proficient in C++ than Scala or ML I looked to the PolyML source first.

...

Anyway, what is the state of more detailed measurements of the application?

It should be also worthwhile to step back it bit, and ask why the heaps are so large. There might be some Isabelle/ML programming mishaps in the application with persistent values that refer to Context.theory instead of Context.theory_id. Standard containers like the Simplifier or Classical context take care of that, but add-ons might not.

You are almost certainly correct. L4.verified is large repository of research artefacts. Many things in there have been written in haste by single authors without code review. Some of the authors are juniors or people like myself with a comparatively shallow exposure to Isabelle, and we don't always make the best implementation choices. I expect these details will not be surprising to anyone from an academic background. It would be nice to do a bit more tuning, but there are only so many hours in the day.

3408

Age (days ago)

3457

Last active (days ago)

polyml@lists.polyml.org

15 comments

5 participants

tags (0)

participants (5)

David.Matthews＠prolingua.co.uk
Japheth.Lim＠nicta.com.au
makarius＠sketis.net
matthew.fernandez＠gmail.com
rda＠lemma-one.com