Job Control Window
Each Job in the Queue Browser Window has a Job "Control/Monitor"
Window
associated with it. This window, which can be produced by
picking
the "Control/Monitor for [JOB NAME]" option in the Job menu (right
click on a job in the Queue browser to raise this menu), contains
all
items required to tailor a DiFX job, and controls to actually run
and
monitor it.
Like many DiFX GUI
windows, the Control/Monitor Window is organized into "Sections"
that
can be opened or closed depending on whether you are interested in
the
details they contain. To open a section, click on its title
bar. Click in the same location to close it.
"Input File" Section
This section has an editable window containing the text of the
DiFX
"input" file - the file read by difx
that governs a run. There is also a field showing the full
path
to the input file on the DiFX host, and two buttons:
- Save uploads the exact
context of the text window to the DiFX host and places it in the
location given by the full path. This can be done if any
hand-editing changes have been made to the Input file (see the
warning
below about edits!). Unless this button is pushed, any
editing
changes you have made will have no effect.
- Refresh obtains the
contents of the file (given by the full path) on the DiFX host
and puts
it in the editable text window. This can be done if you
make
changes that you don't like or are unsure of. All of your
changes
will be lost. Note that "Refresh" can only undo changes
made
since that last "Save".
Warning: The text
field
allows you to edit the Input file, but remember that difx depends on this file to
run
properly. Because the original is automatically generated
there
is not a lot of format-checking of the content or clear and concise
error reporting if anything you put in there is incorrect - difx may just blow up.
Don't
mess around with this file unless you really know what you are
doing.
"Calc File" Section
This section is identical to the "Input File" section except that
it
holds the "calc" file (used for...uh...what is it used
for?). The
same controls are available and the same warnings apply.
"Machines List" Section
The Machines List section provides controls that can be used to
select the data sources and processors used to process the
job.
Selections are used to produce "Machines" and "Threads" files on
the
DiFX host which are used by mpirun to control
multiprocessing. The .machines and .threads files can be
generated using the "Apply" button, and then examined in the "Machines File" and "Threads File" sections as
outlined
below. Alternatively the "Apply" button can be ignored and
the
.machines and .threads files will be generated automatically when
the
job is started.
The Machines List Section has two primary panels which govern
"Data
Sources" (which machines are used as the source of data for each
antenna involved in a correlation) and "Processors" (which
machines are
devoted to round-robin DiFX processing).
Data Sources
The purpose of the Data Sources section is to allow the user to
select the machines that will serve as the "source" of data for
each
station involved in a correlation. For "MODULE" data types
the
data source will be the Mark5 unit that contains the required
module. For "FILE" data types the data source will be the
machine
that reads the data files off of disk. A "NETWORK" data
source is
the machine that serves as the network destination of a data
stream
from a remote source. Each station involved in a correlation
can
have only one data source machine assigned to it, and ideally each
station will have a different machine assigned to it.
The Data Sources panel includes a section for each station.
These sections differ somewhat depending on the type of data
source
being used. Different stations can use different data source
types, however for the moment it is assumed that all of the data
for a
particular station will be using the same data source type.
The following image shows the Data Sources panel for a two-station
correlation. The first station uses a FILE data source, and
the
second station uses a MODULE data source:
For each data source a pull-down menu is provided from which any
appropriate available machine can be selected. This is the
actual
data source that will be included in the resulting .machines
file. Following it is a checkbox labelled "Skip" that can be
used
to remove the station from the job (see
Removing Stations From a
Job
below), a column for the station involved (two letter
abbreviation),
the format for the data, and the data source type.
Subsequent
columns differ based on the data source type.
Initial Data Source Choice
When the Job Control Window is first opened, the GUI will have
chosen
what it feels are logical machine picks for each data
source. For
module data, a MarkV with the appropriate module will be used if
it can
be found. The machine for file data will be chosen based on
several user settings (see
Data Source
Defaults
in the Job Processing Settings of the Settings Window), an option
among
which is a specific assignment based on path.
FILE Data Sources
A file can be read by any data source that has access to the file,
thus
the source selection pull-down menu provides everything seen on the
DiFX cluster, including processors and Mark5 units (some of these
machines may not have access to the storage system containing the
file,
but unfortunately the GUI has no way of knowing this).
The last column of a FILE data source line shows the first in the
list of files that contain the data for the antenna (usually one
file
for each scan). This list can be expanded to see all of the
files. The names of the files are obtained from the "DATA
TABLE"
section of the .input file.
If the "DATA TABLE" section
of
the .input file does not contain anything (this can happen of the
data
for a station were listed in a File List
and vex2difx failed to
find
appropriate data files), the data are considered "missing" and the
data
source will be outlined in red.
This job will not process correctly, however there is a way to
repair
it - see Removing Stations
From a Job below.
MODULE Data Sources
A module data source can only be a machine capable of reading a
module
- a Mark5 unit. The pull down menu for a module data type
includes only the Mark5 machines that can be seen on the DiFX
cluster.
The VSNs of modules installed in each Mark5 are
listed
in the pull down menu - the Mark5 unit containing the VSN required
for
the experiment and antenna should be chosen as the data
source.
If the required VSN is not installed in the current data source
the VSN
name will be colored red, as in the
image
above. If the data source does contain the correct module
the VSN
name will be colored blue (see the
previous images).
NETWORK Data Sources
Currently a work in progress.
Choose
Data Source Based on Module
- When this option is checked, the GUI will attempt to pick
Mark5
data sources that contain the modules required for the
job. If
the proper modules are not present they will appear in red in
the data
source list. If you are not using Mark5 modules as your
data
source this option will have no effect. I might get rid of
this
check box - instead make this behavior a quiet and constant
default.
Removing Stations
From a
Job
- It may be desirable to run a job without one of its
associated stations. The most obvious situation where this
would
be the case would be when the data for a station was "missing"
for some
reason. In the example pictured below, the data source for
the
station "KE" is missing because the .input file had no data
associated
with it (it would have been "D/STREAM 1" in the .input
file).
Running DiFX on this job will result in an error. To draw
attention to this situation, the Data Source menu has outlined
the KE
station data in red.
To remove the missing station from the experiment (or to remove any
other station, if desired), check the "Skip" box associated with it,
and then click "Rebuild Job". The job can then be re-run
normally.
What is the GUI Doing
When it Rebuilds a Job?
To "rebuild" a job that runs with a different set of stations, the
GUI
has to
create an entirely new set of job-specific files, including the
.input
file, the
.calc file, and some others. This is done by creating a new,
job-specific .v2d file (based on the .v2d file for the experiment)
with
the station selections, then running vex2difx on it to create the new job files.
Before running vex2difx,
existing files are first renamed with ".0" appended to their
original
name for the first rebuild, ".1" for the second rebuild, etc, so
nothing is thrown away. The list of files for a job named "JOB
NAME" that are renamed (if they exist) are [JOB NAME].calc, [JOB
NAME].difxlog, [JOB NAME].flag, [JOB NAME].im, [JOB NAME].difx (a
directory), [JOB NAME].input, [JOB NAME].machines, [JOB
NAME].threads,
and [MANGLED JOB NAME].v2d (from a previous rebuild). The
"MANGED
JOB NAME" of the .v2d file matches the JOB NAME of other files
except
that all underscores have been replaced by hyphens (vex2difx does not permit
underscores in the .v2d file name).
Automating
the Rebuild Procedure
When data for a station are missing from a job it is often the case
that the job is part of a large, many-job experiment, where, for
whatever reason, some portion of the data for a scheduled station
does
not exist. In this case it may be many jobs that will have to
be
rebuilt and re-run. The GUI scheduling system can be set to
recognize when a scheduled job fails to run due to missing data,
rebuild the job with the offending station omitted, and re-run the
job
automatically. In fact, this is the default behavior of the
scheduler. See Running
Jobs With the Scheduler and Scheduler Settings.
Processors
- Eliminate Non-Responding Processors
- When guiServer on the DiFX host creates the Machines
and
Threads files, it will run a quick check on each processor and
data
source (if applicable) to make sure it has the proper
permissions to
execute an mpirun on the specific host. If a
processor
fails this test and this option is selected, it will be
automatically
removed from the list of processors used on the job.
- Eliminate Processors Over ____ % Busy
- This option will de-select any processors that have more than
a
given percentage of their CPU time consumed.
General Guidelines
- In general you have complete flexibility in which machines you
choose to do what, although the GUI will try to alert you if you
request something that won't work. Given that, there are a
number
of ways you can wisely choose your machines such that your
correlation
processing is more efficient.
With some exceptions (such as situations where you have
no
choice) a machine should not appear in both the Data Sources and
Processors list.
Each antenna should, if possible, have an independent
data
source associated with it. The nature of correlation
requires
that data will be required for all antennas simultaneously
and
having more than one antenna on a data source will cause that
source to
swap inefficiently between files.
If possible the head node should have a dedicated
machine
(not just a single thread on a machine). The head node
should not
appear in the list of processors.
If a processor machine has n cores, assign at
most n
- 1 threads to it. The GUI will try to do this
automatically.
"Machines File" Section
Slightly different - changes to the "Machines List" settings will
cause the regeneration of this file.
"Threads File" Section
more of the same
"Run Controls" Section
This section contains the controls for starting and stopping
jobs.
Status Bar
The Status Bar contains the current run state of the job.
Progress and some simple information are included. This
section
has no label, and cannot be opened or closed. It resides
just
below the "Run Controls" section.
Real Time Monitors
The Real Time Monitors section can be used to produce plots of
job
output data as they are produced (reasonably close to real
time).
Some first-level analysis can also be done. The primary aim
of
these plots is not to produce final, reduced results, but to
quickly
give the user an idea whether a job was a success, or in what ways
its
parameters need to be adjusted to make it better.
Message Monitor
The message monitor captures and displays all messages that
relate
to the job.