From wbrisken@parallax Thu Nov 11 14:56:12 2010 Date: Thu, 11 Nov 2010 14:56:11 -0700 (MST) From: Walter Brisken To: V-FASTER exploder Subject: New architecture for transient capture Hi folks, Its a holiday here so I'm doing something fun: trying to figure out how to properly implement the code to capture baseband data for interesting events. My initial concept had a fatal flaw in that information needed was collected by the wrong fork() and digging out of that situation would have involved all sorts of complicated locks and so on. I think I have a better concept now. I'll try to describe that here. The general framework that we have to live with is as follows. The operator GUI sends a message to the software correlator head node "mk5daemon" process specifying the job to start and the list of processors and mark5 units to use. A fork() call within mk5daemon allows a change of userid from which the whole process is started. There are two critical end of job messages. The first, DIFX_STATE_DONE, is sent from mpifxcorr itself indicating that the job completed properly. The second, DIFX_STATE_MPIDONE, is emitted by the fork()ed mk5daemon process that started the correlator, indicating to the gui that the next job can be started. The transient copy has to happen between these to function calls. My new concept calls for mpifxcorr itself to be called from yet another wrapper program that first sets up a multicast receiver to listens for and prioritizes transient event messages and then spawns mpifxcorr. Once that process exits, if DIFX_STATE_DONE was received (rather than receiving DIFX_STATE_ABORTING or nothing at all) the copying can begin. Since each Mark5 unit will be running its own wrapper program, the copying will automatically happen in parallel, and each unit will copy as much data as it can within the predetermined time window (typically a couple % of the job run-time; perhaps we can link this to the database to see how big the queue is and dynamically change the maximum copy duration...). Once the time limit, or copy queue, has run out, the wrapper program will exit and mk5daemon will finally emit DIFX_STATE_MPIDONE. There are several additional benefits to this. One is that each version of difx can have its own independent version of this wrapper program. Second, each wrapper program only has to be responsible for one job and one Mark5 unit. There is one pitfall to this whole design that was present even in the earlier design. This transient copy will only work for data residing on a Mark5 unit, not the case where baseband data is stored on a disk file. This will mean that testing will be slightly hindered, but I don't see that as a big draw back. With some work a different copying algorithm could be used that could use disk files. Unlike the initial design, the full information about the job can be accessed by the wrapper program that would allow non-mark5 copying to be contemplated. I hope today to at least get the infrastructure in place to use the wrapper program and I'll see how far I get before sundown. Cheers, Walter