/**
\page jobControlWindow Job Control Window
\brief Complete description of the Job Control Window and its functions.
\tableofcontents
Each Job in the Queue Browser Window has a Job "Control/Monitor" Window
associated with it. This window, which can be produced by picking
the "Control/Monitor for [JOB NAME]" option in the Job menu (right
click on a job in the Queue browser to raise this menu), contains all
items required to tailor a DiFX job, and controls to actually run and
monitor it.
Like many DiFX GUI windows, the Control/Monitor Window is organized into "Sections"
that can be opened or closed depending on whether you are interested in
the details they contain. To open a section, click on its title
bar. Click in the same location to close it.
\section jobControlInputFile Input File Section
This section has an editable window containing the text of the
DiFX "input" file - the file read by DiFX
that governs a run. There is also a field showing the full
path to the input file on the DiFX host, and two buttons:
- Save uploads the exact
context of the text window to the DiFX host and places it in the
location given by the full path. This can be done if any
hand-editing changes have been made to the Input file (see the
warning
below about edits!). Unless this button is pushed, any
editing
changes you have made will have no effect.
- Refresh obtains the
contents of the file (given by the full path) on the DiFX host
and puts it in the editable text window. This can be done if you
make changes that you don't like or are unsure of. All of your
changes will be lost. Note that "Refresh" can only undo changes
made since that last "Save".
Warning: The text field
allows you to edit the Input file, but remember that DiFX depends on this file to
run properly. Because the original is automatically generated there
is not a lot of format-checking of the content or clear and concise
error reporting if anything you put in there is incorrect - DiFX may just blow up.
Don't mess around with this file unless you really know what you are
doing.
\section jobControlCalcFile Calc File Section
This section is identical to the "Input File" section except that
it holds the "calc" file for the job, which is automatically created when
the job is created. The same controls are available and the same warnings apply.
\section jobControlMachinesList Machines List Section
The Machines List section provides controls that can be used to
select the data sources and processors used to process the job.
Selections are used to produce "Machines" and "Threads" files on the
DiFX host which are used by mpirun to control
multiprocessing. The .machines
and .threads
files can be
generated using the "Apply" button, and then examined in the
\ref jobControlMachinesFile "Machines File" and \ref jobControlThreadsFile "Threads File" sections as
outlined below. Alternatively the "Apply" button can be ignored and the
.machines
and .threads
files will be generated automatically when the
job is started.
The Machines List Section has two primary panels which govern
\ref jobControlDataSources "Data Sources" (which machines are used as the source of data for each
antenna involved in a correlation) and
ref jobControlProcessors "Processors" (which machines are devoted to round-robin DiFX processing).
When the Job Control window is first opened, these fields are populated automatically
in accordance with setting in the \ref dataSourceDefaults "Data Source Defaults"
and \ref processorDefaults "Processor Defaults" areas of the Settings window.
\subsection jobControlDataSources Data Sources
The purpose of the Data Sources section is to allow the user to
select the machines that will serve as the "source" of data for
each
station involved in a correlation. For "MODULE" data types
the
data source will be the Mark5 unit that contains the required
module. For "FILE" data types the data source will be the
machine
that reads the data files off of disk. A "NETWORK" data
source is
the machine that serves as the network destination of a data
stream
from a remote source. Each station involved in a correlation
can
have only one data source machine assigned to it, and ideally each
station will have a different machine assigned to it.
The Data Sources panel includes a section for each station.
These sections differ somewhat depending on the type of data
source
being used. Different stations can use different data source
types, however for the moment it is assumed that all of the data
for a
particular station will be using the same data source type.
The following image shows the Data Sources panel for a two-station
correlation. The first station uses a FILE data source, and
the
second station uses a MODULE data source:
For each data source a pull-down menu is provided from which any
appropriate available machine can be selected. This is the
actual
data source that will be included in the resulting .machines
file. Following it is a checkbox labelled "Skip" that can be
used
to remove the station from the job (see
Removing Stations From a
Job
below), a column for the station involved (two letter
abbreviation),
the format for the data, and the data source type.
Subsequent
columns differ based on the data source type.
Initial Data Source Choice
When the Job Control Window is first opened, the GUI will have
chosen
what it feels are logical machine picks for each data
source. For
module data, a MarkV with the appropriate module will be used if
it can
be found. The machine for file data will be chosen based on
several user settings (see
Data Source
Defaults
in the Job Processing Settings of the Settings Window), an option
among
which is a specific assignment based on path.
FILE Data Sources
A file can be read by any data source that has access to the file,
thus
the source selection pull-down menu provides everything seen on the
DiFX cluster, including processors and Mark5 units (some of these
machines may not have access to the storage system containing the
file,
but unfortunately the GUI has no way of knowing this).
The last column of a FILE data source line shows the first in the
list of files that contain the data for the antenna (usually one
file
for each scan). This list can be expanded to see all of the
files. The names of the files are obtained from the "DATA
TABLE"
section of the .input file.
If the "DATA TABLE" section
of
the .input file does not contain anything (this can happen of the
data
for a station were listed in a File List
and vex2difx failed to
find
appropriate data files), the data are considered "missing" and the
data
source will be outlined in red.
This job will not process correctly, however there is a way to
repair
it - see Removing Stations
From a Job below.
MODULE Data Sources
A module data source can only be a machine capable of reading a
module
- a Mark5 unit. The pull down menu for a module data type
includes only the Mark5 machines that can be seen on the DiFX
cluster.
The VSNs of modules installed in each Mark5 are
listed
in the pull down menu - the Mark5 unit containing the VSN required
for
the experiment and antenna should be chosen as the data
source.
If the required VSN is not installed in the current data source
the VSN
name will be colored red, as in the
image
above. If the data source does contain the correct module
the VSN
name will be colored blue (see the
previous images).
NETWORK Data Sources
Currently a work in progress.
Choose
Data Source Based on Module
- When this option is checked, the GUI will attempt to pick
Mark5
data sources that contain the modules required for the
job. If
the proper modules are not present they will appear in red in
the data
source list. If you are not using Mark5 modules as your
data
source this option will have no effect. I might get rid of
this
check box - instead make this behavior a quiet and constant
default.
Removing Stations
From a
Job
- It may be desirable to run a job without one of its
associated stations. The most obvious situation where this
would
be the case would be when the data for a station was "missing"
for some
reason. In the example pictured below, the data source for
the
station "KE" is missing because the .input file had no data
associated
with it (it would have been "D/STREAM 1" in the .input
file).
Running DiFX on this job will result in an error. To draw
attention to this situation, the Data Source menu has outlined
the KE
station data in red.
To remove the missing station from the experiment (or to remove any
other station, if desired), check the "Skip" box associated with it,
and then click "Rebuild Job". The job can then be re-run
normally.
What is the GUI Doing
When it Rebuilds a Job?
To "rebuild" a job that runs with a different set of stations, the
GUI
has to
create an entirely new set of job-specific files, including the
.input
file, the
.calc file, and some others. This is done by creating a new,
job-specific .v2d file (based on the .v2d file for the experiment)
with
the station selections, then running vex2difx on it to create the new job files.
Before running vex2difx,
existing files are first renamed with ".0" appended to their
original
name for the first rebuild, ".1" for the second rebuild, etc, so
nothing is thrown away. The list of files for a job named "JOB
NAME" that are renamed (if they exist) are [JOB NAME].calc, [JOB
NAME].difxlog, [JOB NAME].flag, [JOB NAME].im, [JOB NAME].difx (a
directory), [JOB NAME].input, [JOB NAME].machines, [JOB
NAME].threads,
and [MANGLED JOB NAME].v2d (from a previous rebuild). The
"MANGED
JOB NAME" of the .v2d file matches the JOB NAME of other files
except
that all underscores have been replaced by hyphens (vex2difx does not permit
underscores in the .v2d file name).
Automating
the Rebuild Procedure
When data for a station are missing from a job it is often the case
that the job is part of a large, many-job experiment, where, for
whatever reason, some portion of the data for a scheduled station
does
not exist. In this case it may be many jobs that will have to
be
rebuilt and re-run. The GUI scheduling system can be set to
recognize when a scheduled job fails to run due to missing data,
rebuild the job with the offending station omitted, and re-run the
job
automatically. In fact, this is the default behavior of the
scheduler. See Running
Jobs With the Scheduler and Scheduler Settings.
Processors
- Eliminate Non-Responding Processors
- When guiServer on the DiFX host creates the Machines
and
Threads files, it will run a quick check on each processor and
data
source (if applicable) to make sure it has the proper
permissions to
execute an mpirun on the specific host. If a
processor
fails this test and this option is selected, it will be
automatically
removed from the list of processors used on the job.
- Eliminate Processors Over ____ % Busy
- This option will de-select any processors that have more than
a
given percentage of their CPU time consumed.
General Guidelines
- In general you have complete flexibility in which machines you
choose to do what, although the GUI will try to alert you if you
request something that won't work. Given that, there are a
number
of ways you can wisely choose your machines such that your
correlation
processing is more efficient.
With some exceptions (such as situations where you have
no
choice) a machine should not appear in both the Data Sources and
Processors list.
Each antenna should, if possible, have an independent
data
source associated with it. The nature of correlation
requires
that data will be required for all antennas simultaneously
and
having more than one antenna on a data source will cause that
source to
swap inefficiently between files.
If possible the head node should have a dedicated
machine
(not just a single thread on a machine). The head node
should not
appear in the list of processors.
If a processor machine has n cores, assign at
most n
- 1 threads to it. The GUI will try to do this
automatically.
\section jobControlMachinesFile Machines File Section
Slightly different - changes to the "Machines List" settings will
cause the regeneration of this file.
\section jobControlThreadsFile Threads File Section
more of the same
\section jobControlRunControls Run Controls Section
This section contains the controls for starting and stopping jobs.
\section jobControlStatusBar Status Bar
The Status Bar contains the current run state of the job.
Progress and some simple information are included. This section
has no label, and cannot be opened or closed. It resides just
below the \ref jobControlsRunControls "Run Controls" section.
\section jobControlRealTimeMonitors Real Time Monitors
The Real Time Monitors section can be used to produce plots of job
output data as they are produced (reasonably close to real time).
Some first-level analysis can also be done. The primary aim of
these plots is not to produce final, reduced results, but to quickly
give the user an idea whether a job was a success, or in what ways its
parameters need to be adjusted to make it better.
\section jobControlMessageMonitor Message Monitor
The message monitor captures and displays all messages that
relate to the job.
*/