Submit your data

Visualisation development — a Java prototype

The following is a technical summary and analysis of a Java prototype developed by BODC in 2003-4. The prototype (JSerplo) demonstrated that Java provided the facilities and performance to support our visualisation operations. GUI elements from the prototype, combined with development work on our software's data manipulation capabilities, led to Edserplo, in use since 2005.

The following is targeted at readers who have a sound knowledge of Object-Oriented software engineering.

Introduction

BODC data scientists used visualisation software to quality control oceanographic data. Suspect data are highlighted but the actual values are not changed.

Name Data type Languages Year deployed
Serplo Time/depth series Fortran 77, Fortran 90 1989
EDTEVA Tide gauge data Fortran 77, Fortran 90, SQL 1993
Waview Spectral wave data C++, Fortran 77 1996
Xerplo 2D time series data C++, Fortran 77 1997

The visualisation programs above depended on Silicon Graphics hardware. The falling cost of hardware made it prudent to change to Linux or Windows. 

Design objectives

  • Adopt an integrated approach allowing the simultaneous replacement of the four visualisation programs with one application, reducing overall effort.
  • A modular system for handling formats.
  • New ability to capture state at the conclusion of a session and re-open with a particular configuration.
  • Allow dynamic generation of data series and subsequent output with the option to view or not to view.
  • New ability to allow the flagging of outliers (points divorced from the main body of data) on scatter plots.
  • Dynamic generation of  derived channels, in particular tides and residuals.
  • Ability to tidally analyse non-port data, e.g. current meter data, including 2-dimensional data, such as acoustic doppler current profiler (ADCP) data.
  • Ability to select subsets of bins.
  • Ability to display ADCP scatter plots for separate bins.
  • Introduce a standardised mechanism for dealing with selection. Selection ranges over
    • series
    • channels
    • bins
    • ports
    • casts
    • wave histogram blocks
    • cells
  • Implicit selection of data cycles via "blocking". Selection emerges at three levels and has to be supported in an efficient manner

Series, channels and selection

Each series is an instantiation of the series class, itself an incarnation of the BODC series model. Each series and channel has a selection switch. A SeriesChannel object has two, one from each parent.

If the relevant channel is deselected all associated SeriesChannels are immediately deselected since they have a reference to the same selection object. Similarly, if a series is deselected all ChannelSeries are automatically deselected, since the same selection object is embedded in both.

Each channel is identified uniquely by name, type, and rank. Each series is uniquely identified by internal name, external name and also ordinal within set (so that repeated instances of the same series can be distinguished as can similar data from the same port emanating from the same file, as happens with tide gauge data).

ChannelSilo, IO and option handling

JSerplo's data ingester reads the options and files listed in the driver file, determining the file format and reading the data into series objects, which clones the DriverOptions object into each series. These in turn are collected into a ChannelSilo object, which holds all the series data, iterators and other page-specific information. The ChannelSilo object can be serialised and takes the place of the dump file when operating in 'EDTEVA' mode.

The ChannelSilo object is passed to the page constructors. Each page derives an iterator by pruning those series that cannot be expressed through that page. Thus if the series does not have a time channel it cannot be displayed by TimPage. If the series does not have a second dimension it cannot be expressed through the TCADPage. If there is no (rank-1) latitude and longitude, there cannot be a series track map.

Channel aliasing and derived channels

The derived channels can be calculated and the aliased channels generated.

Examples of derived channels include: Cartesian current vectors (derived from speed and direction) and residual currents or sea levels (after the tidal signal has been removed). These channels do not belong to the series but reside as separate objects within the ChannelSilo. They incorporate the SeriesId object appropriate to the series to which they relate. Aliased channels share arrays with their counterparts within the series objects.

Iterators

There are separate iterator classes, e.g.

  • SeriesIterator
  • ChannelIterator
  • SeriesChannelIterator
  • PortIterator
  • BinIterator
  • CTDCastIterator

These iterators are more complicated and richer in functionality than those of the Java Collections Framework.

  • They can be traversed in either direction.
  • They may or may not be in singleton mode.
  • They are responsive to the selectability of the underlying component.
  • They can be reset.
  • Processing starts with the current pointer and proceeds, if not in singleton mode, in ring-buffer fashion to terminate with the one that precedes it. Thus a fragment of code might look like the following
 
ChannelSilo cs = new ChannelSilo(); 
ChannelId[] criterion = new ChannelId[2];
  ...   
SeriesIterator serit = cs.getSeriesIterator(page,criterion);
serit.start();
while(serit.next()){
  Series sr = serit.get();
  System.out.println("Series " + (SeriesId) sr);
}
if(!serit.doneAny())
  System.out.println("Nothing selected");  

Note that serit.get() can return a null pointer if not embedded in a next() loop. If the underlying item to which the internal pointer of the iterator object has been deselected, get() will re-assign the pointer by searching in the currently indicated direction to the next available selected item. If none it will return a null pointer.

These classes are not synchronised, so this could happen in the loop illustrated above in a multithreaded environment. To avoid this one could synchronise on the ChannelSilo object.

ChannelSilo and page synchronisation

The ChannelSilo object retains a HashMap of entities required for the smooth running of the program including iterators and interPage communication such as data cycle locking.

When fielding a request for a given iterator for a given page it first looks at the HashMap and if it finds none it creates one, stores it and passes it to the requesting page. Otherwise it passes the existing object.

EDTEVA maintained synchrony of Port currency between Page1 and TimPage, whereas Serplo did not maintain currency between the corresponding pages. This was not a critical issue but illustrates the power of the mechanism. By asking for the iterator for another page and using it, automatic synchronisation is obtained.

Note that the criterion object for any given page is available via a static method for that page. A criterion object identifies the channel identifiers (ChannelId objects) which must be present for a series to be viewable. If no criterion argument is given (the alternative get() method) only the existing object can be retrieved and no object will be generated.

Outlier chasing on scatter plots

An outlier is a point divorced from the main body of data. Typically you want to view its context in a time series plot or to flag it immediately. As BODC stores vectors in polar form (direction and magnitude), and the plot is generated via Cartesian (X-Y) coordinates, this was not straightforward.

An object similar to an iterator called a CannedOutlier is created within the ChannelSilo. This is initially empty but it can store (a list of) points identified by cycle number and series. The chief problem here was to decide what conditions apply to garbage collection as a list is expensive in terms of storage. The choice was between having a set for each series or one for the current series and deciding on what conditions the selected set would outlive the current session.

Waview had a similar requirement. Points associated with a particular histogram block can be inspected in turn on the TimPage display. This functionality was why Waview remained as a separate program rather than being incorporated within Xerplo.

Selection entities

Selection entities of interest

  • In EDTEVA — Ports, Series and Channels.
  • In Xerplo — Series, Bins, Cells and Channels.
  • In Serplo — Series, Casts and Channels.
  • In Waview — Series, Histogram Blocks and Channels.

EDTEVA is different in having a larger, more encompassing entity than the Series: EDTEVA's Page1 selects Ports and Channels. In JSerplo any series sporting a non-null PortId will qualify for the PortPage and will not appear, in the first instance, on the SeriesPage. A mechanism was created to allow the series related to a given port to be listed separately.

Array handling and blocking

In C++ and Fortran you can pass an array simply by giving the start address. Contiguous subsetting can be done by incrementing the address to the start of the subset. In Java the situation is not so simple as an array also has a defined length. This means that Java can be grossly inefficient by comparison because you have to make a copy of each subset requested. It also creates update problems as you're no longer updating the master array.

The alternative used was to pass the beginning and endpoints along with the master array. The master array however is encapsulated within the Channel object. For rank-2 data the storage order is bin within data cycle. The solution was to equip the Channel class with appropriate subsetting methods.

Blocking was introduced to improve plotting speed. The data are subsetted into blocks of, say, 200 cycles and only blocks which intersect with the plotting window are plotted. Plotting in the prototype indicated no cost penalty for doing without but it is believed that the data set was probably too small to demonstrate the effect. Thus we retained blocking for the time being.

One efficiency gain is that the underlying files are updated on a channel-wide basis as each series channel keeps tabs on whether it has been modified or not. This reduces noticeably the time taken to write data to file.