There exists a private document "Scatter-Gather File System Format" (Roger Cappallo 2014.7.1) with a description of the overall structure of scatter-gather files. However there is apparently no *public* documentation yet on the Mark6 non-RAID recording format. The information here was reverse-engineered from a disk set with actual Mark6 scatter- gather (SG) mode recordings, the data in individual files, and Mark6 d-plane 1.12 source code. 0) Some points on SG mode vs. RAID 1) Description of scatter-gather mode file sets and mounted disks 2) The metadata files under /mnt/disks/.meta/[1-4]/[0-7]/ ('group', 'slist') 3) The layout of data and headers in files of a SG mode file set 4) Failure modes 0) Some points on SG mode vs. RAID Disk failure after recording in: SG mode --> a subset of data is lost (periodic chunks in reconstructed file) RAID 0 mode --> the XFS file system will be broken and all files are lost, although a low-level VDIF-specific recovery might be possible if there are suitable tools for this... RAID 5 mode --> single disk failure is survivable, no data lost Disk failure during recording in: SG mode --> might be able to continue recording? Mark6 dplane implementation not quite clear... RAID 0 mode --> recording probably freezes RAID 5 mode --> should be able to continue writing even with the degraded RAID, at least until the next disk fails Recording-time performance: SG mode --> should allow best throughput, should be resilient to disk failures although this depends on the Mark6 dplane implementation RAID 5 mode --> probably fine for 4 Gbps recording, but for 4 to 16+ Gbps it might be too slow? (untested...) Data reassembly in: SG mode --> data from files scattered across all disks has to be combined, removing embedded metadata during this gathering process RAID 0&5 --> no special reassembly steps Mark6 disk mounting: SG mode Each of the two LSI SAS HBA cards in the Mark6 handles 16 disks. The Mark6 software divides the disks into 8-disk groups (historical reasons?) Hence there can be 4 groups with 8 disks each (groups 1 to 4; disks 0 to 7) Disks have each their own XFS file systems (2 XFS partitions per disk: data and metadata) Automounting? Apparently not, must mount with a script, or manually, or via dplane+cplane+da-client RAID mode Certainly just a single XFS file system distributed over all disks Not tested if individual disks have two partitions (could have 1 in RAID, 1 with disk-specific metadata) Automounting? mdadm automount perhaps? 1) Description of scatter-gather mode file sets and mounted disks In scatter-gather mode there is a set of files associated with each VLBI scan. The files are on scatter-gather related disk partitions on XFS.file systems. All partitions are: cat /proc/partitions 8 0 488386584 sda -- OS disk (fixed) 8 1 468596736 sda1 Linux, Mark6 root file system 8 2 1 sda2 ? 8 5 19786752 sda5 Linux, swap 8 16 3907018584 sdb -- VLBI disk, 1 of 16 (removable) 8 17 3906919424 sdb1 data partition, XFS file system 8 18 97280 sdb2 meta data partition, XFS file system ... 8 240 3907018584 sdp -- VLBI disk, 15 of 16 (removable) 8 241 3906919424 sdp1 data partition, XFS file system 8 242 97280 sdp2 meta data partition, XFS file system 65 0 3907018584 sdq -- VLBI disk, 16 of 16 (removable) 65 1 3906919424 sdq1 data partition, XFS file system 65 2 97280 sdq2 meta data partition, XFS file system Mounted as: /dev/sdb1 on /mnt/disks/1/0 type xfs (rw) /dev/sdb2 on /mnt/disks/.meta/1/0 type xfs (rw) ... /dev/sdp1 on /mnt/disks/2/6 type xfs (rw) /dev/sdp2 on /mnt/disks/.meta/2/6 type xfs (rw) /dev/sdq1 on /mnt/disks/2/7 type xfs (rw) /dev/sdq2 on /mnt/disks/.meta/2/7 type xfs (rw) Meta data files (see 2): /mnt/disks/.meta/[1-4]/[0-7]/group diskpack volume serial number /mnt/disks/.meta/[1-4]/[0-7]/slist scan list in JSON format Scatter-gather files (see 3): /mnt/disks/[1-4]/[0-7]/filename When scans are recorded across 16-disks ("2 diskpacks") in SG mode the resulting set of data files is, for example: /mnt/disks/1/0: -rw-r--r-- 1 oper mark6 1.2G Oct 26 15:13 n14st02c_fila10_2014y299d06h12m40s.vdif -rw-r--r-- 1 oper mark6 1.3G Oct 26 15:09 n14st02c_fila10_2014y299d15h08m49s.vdif /mnt/disks/1/1: -rw-r--r-- 1 oper mark6 1.4G Oct 26 15:13 n14st02c_fila10_2014y299d06h12m40s.vdif -rw-r--r-- 1 oper mark6 1.3G Oct 26 15:09 n14st02c_fila10_2014y299d15h08m49s.vdif /mnt/disks/1/2: -rw-r--r-- 1 oper mark6 1.4G Oct 26 15:13 n14st02c_fila10_2014y299d06h12m40s.vdif -rw-r--r-- 1 oper mark6 1.3G Oct 26 15:09 n14st02c_fila10_2014y299d15h08m49s.vdif ... /mnt/disks/1/7: -rw-r--r-- 1 oper mark6 1.3G Oct 26 15:13 n14st02c_fila10_2014y299d06h12m40s.vdif -rw-r--r-- 1 oper mark6 1.4G Oct 26 15:09 n14st02c_fila10_2014y299d15h08m49s.vdif /mnt/disks/2/0: -rw-r--r-- 1 oper mark6 1.3G Oct 26 15:13 n14st02c_fila10_2014y299d06h12m40s.vdif -rw-r--r-- 1 oper mark6 1.4G Oct 26 15:09 n14st02c_fila10_2014y299d15h08m49s.vdif ... /mnt/disks/2/7: -rw-r--r-- 1 oper mark6 1.3G Oct 26 15:13 n14st02c_fila10_2014y299d06h12m40s.vdif -rw-r--r-- 1 oper mark6 1.3G Oct 26 15:09 n14st02c_fila10_2014y299d15h08m49s.vdif In the above example two scans (n14st02c_fila10_2014y299d06h12m40s.vdif and n14st02c_fila10_2014y299d15h08m49s.vdif) were recorded on the two diskpacks. The scatter files are about 1.3 GB in size each. The total size of one scan was about 21 GB, or 16 files x 1.3 GB/file. 2) The metadata files under /mnt/disks/.meta/[1-4]/[0-7]/ ('group', 'slist') The 'group' meta data files contain a single line, without a newline. The information reflects (hopefully) the volume serial number sticker on the diskpack. When recording across two diskpacks (16 disks total) the metadata might look like this: $ for ii in `seq 0 7`; do cat /mnt/disks/.meta/1/$ii/group; echo; done 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 $ for ii in `seq 0 7`; do cat /mnt/disks/.meta/2/$ii/group; echo; done 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 2:KVN00500/32000/4/8:KVN00600/32000/4/8 The 'slist' meta data contains a list of scans on the disks. Each disk should contain a copy identical with the 'slist' files on the other disks of a diskpack or disk group. $ for ii in `seq 0 7`; do md5sum /mnt/disks/.meta/1/$ii/slist; done ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/1/0/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/1/1/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/1/2/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/1/3/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/1/4/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/1/5/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/1/6/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/1/7/slist for ii in `seq 0 7`; do md5sum /mnt/disks/.meta/2/$ii/slist; done ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/2/0/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/2/1/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/2/2/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/2/3/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/2/4/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/2/5/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/2/6/slist ff872a296c46b64138041cb27b1197b1 /mnt/disks/.meta/2/7/slist The 'slist' files are in JSON format, for example (with slight reformatting): { 1: {'status': 'recorded', 'num_str': 1, 'start_tm': 1410876577.706074, 'create_time': '2014y259d14h09m38s', 'sn': 'wrtest_2nd_test_001', 'dur': 100, 'spc': 0, 'size': '7.546'}, 2: {'status': 'recorded', 'size': '8.533', 'start_tm': 1410909599.106127, 'create_time': '2014y259d23h20m00s', 'sn': 'wrtest_2nd_tvg_001', 'dur': 100, 'spc': 0, 'num_str': 1}, 3: {'status': 'recorded', 'size': '0.000', 'start_tm': 1410938164.948671, 'create_time': '2014y260d07h16m05s', 'sn': 'wrtest_2nd_cw_001', 'dur': 100, 'spc': 0, 'num_str': 1}, ... 59: {'status': 'recorded', 'size': '15.226', 'start_tm': 1415584753.080439, 'create_time': '2014y314d01h59m14s', 'sn': 'wrtest_kjcc_test_2014y314d10h59m09s', 'dur': 15, 'spc': 0, 'num_str': 1}} } We can look at the last scan ('wrtest_kjcc_test_2014y314d10h59m09s', 16 files called "wrtest_kjcc_test_2014y314d10h59m09s.vdif"). The recording was started manually and the file name part about 2014y314d10h59m09s is not entirely correct. In addition the Mark6 system was in the KST time zone (Korean Standard Time). The 16 files can be combined for a readable VDIF file using Mark6 'gather' or 'm6sg_gather' in this library package. JSON/slist: the create_time of the scan is 2014y314d01h59m14s VDIF file: the first time stamp is MJD = 56971/01:59:14.19 (10Nov2014/DOY314) as inserted set by the VLBI backend The VDIF time stamp and the time stamp in the JSON-formatted metadata agree. If one were to check a diskpack contents in say, DiFX, to correlate a full diskpack without explicitly specifying a list of individual scans, one could *probably* use the JSON data to map different time ranges to scan names. 3) The layout of data and headers in files of a SG mode file set Each file in a SG file set (files associated with the same scan) looks like this: [file header] [block a header] [block a data (~10MB)] [block b header] [block b data (~10MB)] [block c header] [block c data (~10MB)] ... The file header sync_word uint32 0xfeed6666 version uint32 version number of the SG file format block_size uint32 nominal size of blocks including block header packet_format uint32 0:vdif, 1:mark5, 2:unknown (not sure what this is used for) packet_size uint32 length of data packets (frames?) in bytes The block header (version 1) blocknum uint32 sequence number of the block, increasing and unique in the file set of the scan The block header (version 2) blocknum uint32 sequence number of the block, increasing and unique in the file set of the scan wb_size uint32 size of this block including the block header, might be shorter than block_size of the file header, e.g. the last block in a scan might be just partially filled The file and block header describe mainly the packet size and the size of a data block. In the Mark6 source code these blocks are sometimes referred to as "cells" (?) The data section of each block is ensured to contain an integer number of VLBI frames. In version 2 of the file format the blocks can differ in size from one block to the next, perhaps to accomodate some kind of VLBI data that yields a dynamically changing packet size. The block numbers within a file are always (?) increasing. Consecutive blocks in the same file have increasing block numbers that have "gaps" (e.g, a=0, b=16, c=35, ...). The block numbers that are "missing" in one file (in this example, the missing blocks would be 1,2,3..,15,17,18,..,34,...) are found in one of the other files associated with the scan. The Mark6 recording software is able to do on-the fly conversion from Mark5B into VDIF, and one may probably safely assume that the recorded data on the disks are always VDIF. During scatter-gather mode recording the Mark6 opens one file on each of the disks. Network data are written to this set of open files. Because the 10GbE recording in Mark6 software is not done Round-Robin across these files, the order of blocks across the files is somewhat random. In a 4-disk example, the four files of one SG recording (one file on every disk) might contain: file 0 file 1 file 2 file 3 -------------------------------- block 0 1 2 3 block 4 7 5 6 block 9 8 10 11 block ... It is not clear from the Mark6 source code whether the following is also possible, e.g., when the disk containing file 3 is very slow compared to the other disks: file 0 file 1 file 2 file 3 -------------------------------- block 0 1 2 3 block 4 5 6 9 block 7 8 10 13 block 11 12 14 ... 4) Failure modes Hard failures In RAID 0 mode most failures are probably catastrophic. In RAID 5 mode a disk failure probably needs the operator to insert a new disk and start the rebuild of the degraded array. At that time, or even if reading data while in degraded mode without starting the array rebuild, Murphy's law lad dictates that a second disk will fail, causing a catastrophic failure. Soft failures Dead/unmounted disk in SG mode: regular pieces will be missing in the VLBI data. Equivalent to reading a VDIF file where some frames are missing. Missing a file in a file set of some scan: equivalent to the above case. Disks attached but not mounted: just mount them normally Bizarre failures Scatter-gather file metadata corrupt, or XFS file systems partially corrupt and return bad file data: probably needs manual intervention, might try to rename that file in the file set that contains bad metadata, or unmount that particular disk/file system Mismatch in /mnt/disks/.meta/[1-4]/[0-7]/group and ./slist JSON metadata and actual contents of XFS file systems on the scatter-gather disks: needs manual intervention? or just ignore those metadata, they are largely just for convenience?