BODC's Transfer system
The following explores BODC's Transfer system. It assumes prior knowledge of software programming techniques.
Transfer is BODC's process of converting externally supplied data (we've received data in over 320 different formats) into our in-house format.
The data have to conform to BODC's series model. By converting to a standard format much of the processing can be standardised. In addition, quality control begins with Transfer itself.
The Transfer system was created in 1981. It eliminated repetitious coding and improved the maintainability of the BODC software libraries.
Standardised elements were built into the system with a well-defined interface to format specific functions. These functions must be created for each format by the Transfer programmer.
Stamping each file with the appropriate series identification is one of the key functions of Transfer. BODC uses a number of identifiers for each series. These become embedded within file names and within files and databases.
- Inventory numbers
- Accession, or data entry, identifier (oooyyaaaa)
- Sub accession identifier (s)
- Intermediate Processing Serial (IPS) number (iii00)
- Originator's identifier (CSHOID)
- BODC series reference number
The accession identifies organisation (ooo), the last two digits of the year (yy), and the number within the year (aaaa). Transfer applies the accession, sub accession and IPS (oooyyaaaa/s/iii00) to each series.
The sub accession letter allows the splitting of large accessions. Zero is used when splitting is not required. The IPS is allocated on a sequential basis within the range 00000-99900 (but is always divisible by100). These are then married to the BODC series reference number in a subsequent process.
Why do we need all these separate identifiers?
Inventory numbers can predate the series' arrival at BODC. At some stage the two have to be connected. The oooyyaaaa/s/iii00 form is handy because it emphasises issues that are pertinent to the data banking process, e.g. the organisation who provided the data and when they were delivered to BODC.
Consecutive IPS numbers indicate consecutive files (series reference numbers are not allocated consecutively). Furthermore, and not infrequently, the same series can be supplied more than once. It is useful for discussion and comparison to have a unique identifier within the system and the oooyyaaaa/s/iii00 form is guaranteed to be unique.
CSHOID is a 12 character concatenation of identifiers which the originator of the data might use to identify the series. Transfer must always generate this identifier but it is not guaranteed to be unique.
Transfer written in the Fortran programming language became operational in 1982. It ran on a Honeywell mainframe with output to magnetic tape and microfiche. Headers were stored as disk files. When the system was migrated to a IBM 4381 mainframe in 1987, Transfer was re-engineered to use disk output exclusively.
The programmer had three subroutines to write
- process the series header (HDnnn, 'nnn' the format identifier)
- process a series data cycle (CYnnn)
- wrap up the processing for the individual series (namely the trailer subroutine - TLnnn)
The routines were then slotted into the mainline program (TRnnn).
The programmer also had to prepare a Channel Specification Table (CST) in which the path for each data channel was mapped out. The table was converted into its binary equivalent, the Channel Descriptor Vector (CDV). It was then read by the mainline program.
Not all formats have a simple fixed form as the parameter set may vary from series to series. To cope with this situation dynamic channel handling allowed the header module to specify which channels were present.
The Fortran version of the Transfer software is no longer in use.
Only four transfers in this category were ever written, the first dating to 1997. Only the last of these had proper classes for "B" (header) file handling and series identification.
There was no CST. The C++ transfers were developed because the Fortran Transfer system could not handle 2D data. However, the object-oriented analysis required for the proper handling of header information and series identification carried straight over into the MATLAB version.
MATLAB Transfer was coded and became operational in early 2001. By 2004 it had supplanted the other forms of Transfer.
MATLAB utilises object classes to manage the "B" file and series identifiers. In other respects it is a reversion to a form of Transfer employed before a Fortran system was developed and reflects the extension of the BODC data model to 2D data.
With 1D data there are only two ways to receive the data, by data cycle or parameter.
With 2D data the possible varieties of structure that may be submitted to BODC increases enormously. Such variety is best addressed by giving freedom back to the programmer. This allows them to construct transfers as they see fit, subject to certain conditions.
The transferred files are in BODC's in-house QXF format. After appropriate checks have been met, the series is registered as Transferred.
Transfer "B" file
The "B" file carries header data. This can differ depending on the type of data. In some formats space and time coordinates for the series are provided and in others they are not.
If coordinates are provided the values should be recorded in the "B" file in the field provided for the purpose. This mechanism allows the subsequent direct update of database fields and greatly reduces the problems associated with manual transcription.
Textual comment, channel limit information and identifiers are also present in the "B" file.
The next step
After data have been successfully transferred, a copy of the transferred version is archived.
The data are now ready for the next stage of BODC's data processing steps, normally screening.