Persistence

MAIN GOALS

What are the goals of adding persistence to galaxy?

  • Stopping, restarting, and refocusing computations.

    This is described well by a example. Imagine running a galaxy simulation, stopping at an arbitrary point, saving the state, and then continuing at a later date. In addition to stopping and restarting computations, we could perhaps change the focus of the calculation - e.g. change the frontier, or some other setting, and then continue from a saved point.

  • Archiving objects and results. This is useful because:
    • We can save time and cycles by avoiding re-computation of results, or reconstruction of objects. For example, a DataTree constructed from a large data set could be reused by several different simulations, but would only ever need to be constructed once.

    • Verification and validation of scientific results. We should be able to know how we constructed an object, so we are able to verify, validate and possibly repeat experiments.

    • Dating and witnessing of results. The system could sign and date the creation time of objects and results, providing evidence of IP rights. This is probably not something that would be part of the system from the beginning, but is nevertheless an interesting concept.

APPROACH

The current plan is to build the functionality in an ad hoc manner into Galaxy, so that we can better understand the issues and difficulties. Perhaps then we can step back and consider how one would incorporate this functionality into a system in a more generic fashion.

SPECULATIVE CLI SESSION

What would this mean in terms of system interaction? Here are some *very* speculative cli session extracts :

  >set datatree = new DataTree(datastream, tessellation, distfactory)
  >psave datatree
  >quit

A few weeks later....

  >prestore datatree
  >set sim = new  MultiLevelSimulation(datatree, ...)
  > ....

Or, if we wanted to know how we created the data tree:

  >pinfo datatree
  Created: January 12th 1979
  Created by: weinberg
  Type: DataTree*
  Galaxy Version: 0.9.45
  UPO ID: 17938
  Components used to build: datastream, tessellation, distfactory
  Dependent on: tessellation
  Objects dependent on datatree: simulation3, alistairssim...
  
  >pinfo tessellation  

  ....

Or, maybe we could export the object and send it to somebody else. This would have to include dependencies too, so that the person at the other end could successfully reconstruct the object.

  >pexport datatree XMLFormat  "datatree.xml"

There comes a time when certain objects are no longer required. However, we must make sure consistency is maintained:

  >premove tessellation
  This object cannot be deleted.  The following objects depend on this object:
   datatree
  
  > premove datatree
  datatree was deleted.
  >premove tessellation
  tessellation was deleted.

USER PERSISTENT OBJECTS

The main new concept is the User Persistent Object, or UPO. I've found it hard to completely pin down a definition of a UPO, so bear with me in this description. UPOs are different from C++ objects, or traditional persistent objects in orthogonal persistence. UPOs are the abstract objects that a user would deal with when interacting with the system - in galaxy these objects correspond almost exactly to the objects that can be created in CLI. So UPOs are tessellations, data trees, simulations, streams (buffered), stream filters, and models. Some examples of objects in galaxy that are not UPOs: Nodes, MethodTable, SymbolTable, clivector. One of the important things to realize is that we are not keeping track of all C++ objects, and that only a few of the C++ classes would correspond to UPOs. The user only has interest in the high level objects related to the computation or experiment being performed.

Eliot used the term "objects with semantic integrity" when describing UPOs. I'm not comfortable with this term because I'm not completely sure of its meaning. Perhaps someone can help me here?

IMMUTABILITY

Maintaining dependencies is much more straightforward when UPOs are immutable. The main advantage is that an object doesn't become invalid because something it was built from has changed. Some objects can be modeled very naturally as immutable objects. In Galaxy, tessellations are a great example of this -- once built, their structure does not change. However, some objects are inherently mutable -- a good example would be an output stream, which changes every time a new record is appended.

We can get round this problem of mutable objects if we shift our requirement of immutability from objects to object states. We can model mutable objects as a series of immutable states, and we can model truly immutable objects as an object with a single state. If we were to record the state every time a mutable object changed, then this would approach would not be feasible. Fortunately, we are only interested in having a record at certain points in history - specifically at the point when the user requests that the state is saved.

We can reduce the amount of data used when storing mutable object states by storing them as differences from another state. It might take a little longer to reconstruct object states, but this will be essential for objects such as streams, where the mutation is an append operation.

PROPERTIES OF OBJECTS

UPOs will have certain generic properties that are common to all UPOs. This is probably an incomplete list:

  • Date of creation (possibly signed).
  • Creator.
  • Class (Type)
  • Superclass (really a property of the class)
  • version numbers for pertinent source files, or at least a version number for the whole system.

They will also have properties specific to their type:

  • Other UPOs this UPO depends on.
  • Other UPOs that depend on this object
  • Parameters to constructor, or method call that prompted their creation.

NAMING, SEARCHING, FINDING, and STORING objects.

There are many possible ways of archiving objects - we could use a database, a mixture of the file system and a database, or the file system alone. There are advantages and disadvantages to all of these approaches. Also, there are plenty of representations we can use to store object content - CDF(?), XML, CORBA....

EVOLUTION OF DATA STORAGE FORMAT

While not a problem that will be encountered during initial development, it is prudent to also consider how objects and their data storage format will evolve. It is hoped that the abstract nature of UPOs will mean that implementation changes will not always lead to a changed storage format. For example, while the way that tessellations are implemented might change, the way that a tessellation is described in a file need not.

Realistically, however, there will be points when the storage format will need to change. For minor changes it might be possible to provide convertors to upgrade stored objects, but where major changes are required then it may not even be possible to do this because there is no obvious mapping.

GARBAGE COLLECTION.

There will be issues of garbage collection to think about at some point. Obviously, we can't just keep on saving objects for ever - this will eventually swamp the disks. Garbage collection will be user driven at the highest level - we can't just start deleting saved objects the user has specifically saved away. However, there will be occasions where UPOs contain references to other UPOs that were not explicitly created by the user (eg a Frontier UPO being saved when a Simulation state is saved). We will have to clear up these orphaned objects.

There is also the issue of maintaining consistency in the UPO store. We must protect the user from deleting UPOs that are required by other UPOs.


Send suggestions, questions, and feedback to WEINBERG at ASTRO dot UMASS dot EDU.
Documentation generated at Fri Mar 26 00:35:11 2010 by doxygen