source: tags/release-1.0.7/docs/executionFlow.tex

Last change on this file was 208, checked in by Matthew Whiting, 18 years ago
  • Enabled saving and reading in of a smoothed array, in manner directly analogous to that for the recon array.
    • New file : src/Cubes/readSmooth.cc
    • The other new functions go in existing files eg. saveImage.cc
    • Renamed some functions (like writeHeader...) to be more obvious what they do.
    • The reading in is taken care of by new function Cube::readSavedArrays() -- handles both smoothed and recon'd arrays.
    • Associated parameters in Param class
    • Clarified names of FITS header strings in duchamp.hh.
  • Updated the documentation to describe the ability to smooth a cube.
  • Added description of feedback mechanisms in the Install appendix.
  • Also, Hanning class improved to guard against memory leaks.


File size: 23.4 KB
Line 
1\secA{What \duchamp\ is doing}
2\label{sec-flow}
3
4The execution flow of \duchamp\ is detailed here, indicating the main
5algorithmic steps that are used. The program is written in C/C++ and
6makes use of the \textsc{cfitsio}, \textsc{wcslib} and \textsc{pgplot}
7libraries.
8
9\secB{Image input}
10\label{sec-input}
11
12The cube is read in using basic \textsc{cfitsio} commands, and stored
13as an array in a special C++ class. This class keeps track of the list
14of detected objects, as well as any reconstructed arrays that are made
15(see \S\ref{sec-recon}). The World Coordinate System
16(WCS)\footnote{This is the information necessary for translating the
17pixel locations to quantities such as position on the sky, frequency,
18velocity, and so on.} information for the cube is also obtained from
19the FITS header by \textsc{wcslib} functions \citep{greisen02,
20calabretta02}, and this information, in the form of a \texttt{wcsprm}
21structure, is also stored in the same class.
22
23A sub-section of an image can be requested via the \texttt{subsection}
24parameter -- this can be a good idea if the cube has very noisy edges,
25which may produce many spurious detections. The generalised form of
26the subsection that is used by \textsc{cfitsio} is
27\texttt{[x1:x2:dx,y1:y2:dy,z1:z2:dz,...]}, such that the x-coordinates run
28from \texttt{x1} to \texttt{x2} (inclusive), with steps of
29\texttt{dx}. The step value can be omitted (so a subsection of the
30form \texttt{[2:50,2:50,10:1000]} is still valid). \duchamp\ does not
31make use of any step value present in the subsection string, and any
32that are present are removed before the file is opened.
33
34If one wants the full range of a coordinate then replace the range
35with an asterisk, \eg \texttt{[2:50,2:50,*]}. If one wants to use a
36subsection, one must set \texttt{flagSubsection = 1}. A complete
37description of the section syntax can be found at the \textsc{fitsio}
38web site%
39\footnote{%
40\href%
41{http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node90.html}%
42{http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node90.html}}.
43
44\secB{Image modification}
45\label{sec-modify}
46
47Several modifications to the cube can be made that improve the
48execution and efficiency of \duchamp\ (their use is optional, governed
49by the relevant flags in the parameter file).
50
51\secC{BLANK pixel removal}
52
53If the imaged area of a cube is non-rectangular (see the example in
54Fig.~\ref{fig-moment}, a cube from the HIPASS survey), BLANK pixels are
55used to pad it out to a rectangular shape. The value of these pixels
56is given by the FITS header keywords BLANK, BSCALE and BZERO. While
57these pixels make the image a nice shape, they will unnecessarily
58interfere with the processing (as well as taking up needless
59memory). The first step, then, is to trim them from the edge. This is
60done when the parameter \texttt{flagBlankPix=true}. If the above
61keywords are not present, the user can specify the BLANK value by the
62parameter \texttt{blankPixValue}.
63
64Removing BLANK pixels is particularly important for the reconstruction
65step, as lots of BLANK pixels on the edges will smooth out features in
66the wavelet calculation stage. The trimming will also reduce the size
67of the cube's array, speeding up the execution. The amount of trimming
68is recorded, and these pixels are added back in once the
69source-detection is completed (so that quoted pixel positions are
70applicable to the original cube).
71
72Rows and columns are trimmed one at a time until the first non-BLANK
73pixel is reached, so that the image remains rectangular. In practice,
74this means that there will be some BLANK pixels left in the trimmed
75image (if the non-BLANK region is non-rectangular). However, these are
76ignored in all further calculations done on the cube.
77
78\secC{Baseline removal}
79
80Second, the user may request the removal of baselines from the
81spectra, via the parameter \texttt{flagBaseline}. This may be
82necessary if there is a strong baseline ripple present, which can
83result in spurious detections at the high points of the ripple. The
84baseline is calculated from a wavelet reconstruction procedure (see
85\S\ref{sec-recon}) that keeps only the two largest scales. This is
86done separately for each spatial pixel (\ie for each spectrum in the
87cube), and the baselines are stored and added back in before any
88output is done. In this way the quoted fluxes and displayed spectra
89are as one would see from the input cube itself -- even though the
90detection (and reconstruction if applicable) is done on the
91baseline-removed cube.
92
93The presence of very strong signals (for instance, masers at several
94hundred Jy) could affect the determination of the baseline, and would
95lead to a large dip centred on the signal in the baseline-subtracted
96spectrum. To prevent this, the signal is trimmed prior to the
97reconstruction process at some standard threshold (at $8\sigma$ above
98the mean). The baseline determined should thus be representative of
99the true, signal-free baseline. Note that this trimming is only a
100temporary measure which does not affect the source-detection.
101
102\secC{Ignoring bright Milky Way emission}
103
104Finally, a single set of contiguous channels can be ignored -- these
105may exhibit very strong emission, such as that from the Milky Way as
106seen in extragalactic \hi\ cubes (hence the references to ``Milky
107Way'' in relation to this task -- apologies to Galactic
108astronomers!). Such dominant channels will produce many detections
109that are unnecessary, uninteresting (if one is interested in
110extragalactic \hi) and large (in size and hence in memory usage), and
111so will slow the program down and detract from the interesting
112detections.
113
114The use of this feature is controlled by the \texttt{flagMW}
115parameter, and the exact channels concerned are able to be set by the
116user (using \texttt{maxMW} and \texttt{minMW} -- these give an
117inclusive range of channels). When employed, these channels are
118ignored for the searching, and the scaling of the spectral output (see
119Fig.~\ref{fig-spect}) will not take them into account. They will be
120present in the reconstructed array, however, and so will be included
121in the saved FITS file (see \S\ref{sec-reconIO}). When the final
122spectra are plotted, the range of channels covered by these parameters
123is indicated by a green hashed box.
124
125\secB{Image reconstruction}
126\label{sec-recon}
127
128The user can direct \duchamp\ to reconstruct the data cube using the
129\atrous\ wavelet procedure. A good description of the procedure can be
130found in \citet{starck02:book}. The reconstruction is an effective way
131of removing a lot of the noise in the image, allowing one to search
132reliably to fainter levels, and reducing the number of spurious
133detections. This is an optional step, but one that greatly enhances
134the source-detection process, with the payoff that it can be
135relatively time- and memory-intensive.
136
137\secC{Algorithm}
138
139The steps in the \atrous\ reconstruction are as follows:
140\begin{enumerate}
141\item The reconstructed array is set to 0 everywhere.
142\item The input array is discretely convolved with a given filter
143  function. This is determined from the parameter file via the
144  \texttt{filterCode} parameter -- see Appendix~\ref{app-param} for
145  details on the filters available.
146\item The wavelet coefficients are calculated by taking the difference
147  between the convolved array and the input array.
148\item If the wavelet coefficients at a given point are above the
149  requested threshold (given by \texttt{snrRecon} as the number of
150  $\sigma$ above the mean and adjusted to the current scale -- see
151  Appendix~\ref{app-scaling}), add these to the reconstructed array.
152\item The separation of the filter coefficients is doubled. (Note that
153  this step provides the name of the procedure\footnote{\atrous\ means
154  ``with holes'' in French.}, as gaps or holes are created in the
155  filter coverage.)
156\item The procedure is repeated from step 2, using the convolved array
157  as the input array.
158\item Continue until the required maximum number of scales is reached.
159\item Add the final smoothed (\ie convolved) array to the
160  reconstructed array. This provides the ``DC offset'', as each of the
161  wavelet coefficient arrays will have zero mean.
162\end{enumerate}
163
164The reconstruction has at least two iterations. The first iteration
165makes a first pass at the wavelet reconstruction (the process outlined
166in the 8 stages above), but the residual array will likely have some
167structure still in it, so the wavelet filtering is done on the
168residual, and any significant wavelet terms are added to the final
169reconstruction. This step is repeated until the change in the measured
170standard deviation of the background (see note below on the evaluation
171of this quantity) is less than some fiducial amount.
172
173It is important to note that the \atrous\ decomposition is an example
174of a ``redundant'' transformation. If no thresholding is performed,
175the sum of all the wavelet coefficient arrays and the final smoothed
176array is identical to the input array. The thresholding thus removes
177only the unwanted structure in the array.
178
179Note that any BLANK pixels that are still in the cube will not be
180altered by the reconstruction -- they will be left as BLANK so that
181the shape of the valid part of the cube is preserved.
182
183\secC{Note on Statistics}
184
185The correct calculation of the reconstructed array needs good
186estimators of the underlying mean and standard deviation of the
187background noise distribution. These statistics are estimated using
188robust methods, to avoid corruption by strong outlying points. The
189mean of the distribution is actually estimated by the median, while
190the median absolute deviation from the median (MADFM) is calculated
191and corrected assuming Gaussianity to estimate the underlying standard
192deviation $\sigma$. The Gaussianity (or Normality) assumption is
193critical, as the MADFM does not give the same value as the usual rms
194or standard deviation value -- for a normal distribution
195$N(\mu,\sigma)$ we find MADFM$=0.6744888\sigma$. Since this ratio is
196corrected for, the user need only think in the usual multiples of
197$\sigma$ when setting \texttt{snrRecon}. See Appendix~\ref{app-madfm}
198for a derivation of this value.
199
200When thresholding the different wavelet scales, the value of $\sigma$
201as measured from the wavelet array needs to be scaled to account for
202the increased amount of correlation between neighbouring pixels (due
203to the convolution). See Appendix~\ref{app-scaling} for details on
204this scaling.
205
206\secC{User control of reconstruction parameters}
207
208The most important parameter for the user to select in relation to the
209reconstruction is the threshold for each wavelet array. This is set
210using the \texttt{snrRecon} parameter, and is given as a multiple of
211the rms (estimated by the MADFM) above the mean (which for the wavelet
212arrays should be approximately zero). There are several other
213parameters that can be altered as well that affect the outcome of the
214reconstruction.
215
216By default, the cube is reconstructed in three dimensions, using a
2173-dimensional filter and 3-dimensional convolution. This can be
218altered, however, using the parameter \texttt{reconDim}. If set to 1,
219this means the cube is reconstructed by considering each spectrum
220separately, whereas \texttt{reconDim=2} will mean the cube is
221reconstructed by doing each channel map separately. The merits of
222these choices are discussed in \S\ref{sec-notes}, but it should be
223noted that a 2-dimensional reconstruction can be susceptible to edge
224effects if the spatial shape of the pixel array is not rectangular.
225
226The user can also select the minimum scale to be used in the
227reconstruction. The first scale exhibits the highest frequency
228variations, and so ignoring this one can sometimes be beneficial in
229removing excess noise. The default is to use all scales
230(\texttt{minscale = 1}).
231
232Finally, the filter that is used for the convolution can be selected
233by using \texttt{filterCode} and the relevant code number -- the
234choices are listed in Appendix~\ref{app-param}. A larger filter will
235give a better reconstruction, but take longer and use more memory when
236executing. When multi-dimensional reconstruction is selected, this
237filter is used to construct a 2- or 3-dimensional equivalent.
238
239\secB{Input/Output of reconstructed arrays}
240\label{sec-reconIO}
241
242The reconstruction stage can be relatively time-consuming,
243particularly for large cubes and reconstructions in 3-D. To get around
244this, \duchamp\ provides a shortcut to allow users to perform multiple
245searches (\eg with different thresholds) on the same reconstruction
246without calculating the reconstruction each time.
247
248The first step is to choose to save the reconstructed array as a FITS
249file by setting \texttt{flagOutputRecon = true}. The file will be
250saved in the same directory as the input image, so the user needs to
251have write permissions for that directory.
252
253The filename will be derived from the input filename, with extra
254information detailing the reconstruction that has been done. For
255example, suppose \texttt{image.fits} has been reconstructed using a
2563-dimensional reconstruction with filter \#2, thresholded at $4\sigma$
257using all scales. The output filename will then be
258\texttt{image.RECON-3-2-4-1.fits} (\ie it uses the four parameters
259relevant for the \atrous\ reconstruction as listed in
260Appendix~\ref{app-param}). The new FITS file will also have these
261parameters as header keywords. If a subsection of the input image has
262been used (see \S\ref{sec-input}), the format of the output filename
263will be \texttt{image.sub.RECON-3-2-4-1.fits}, and the subsection that
264has been used is also stored in the FITS header.
265
266Likewise, the residual image, defined as the difference between the
267input and reconstructed arrays, can also be saved in the same manner
268by setting \texttt{flagOutputResid = true}. Its filename will be the
269same as above, with \texttt{RESID} replacing \texttt{RECON}.
270
271If a reconstructed image has been saved, it can be read in and used
272instead of redoing the reconstruction. To do so, the user should set
273\texttt{flagReconExists = true}. The user can indicate the name of the
274reconstructed FITS file using the \texttt{reconFile} parameter, or, if
275this is not specified, \duchamp\ searches for the file with the name
276as defined above. If the file is not found, the reconstruction is
277performed as normal. Note that to do this, the user needs to set
278\texttt{flagAtrous = true} (obviously, if this is \texttt{false}, the
279reconstruction is not needed).
280
281\secB{Smoothing the cube}
282\label{sec-smoothing}
283
284An alternative to doing the wavelet reconstruction is to Hanning
285smooth the cube. This technique can be useful in reducing the noise
286level slightly (at the cost of making neighbouring pixels correlated
287and blurring any signal present), and is particularly well suited to
288the case where a particular signal width is believed to be present in
289the data. It is also substantially faster than the wavelet
290reconstruction.
291
292The cube is smoothed only in the spectral domain. That is, each
293spectrum is independently smoothed, and then put together to form the
294smoothed cube. This is then treated in the same way as the
295reconstructed cube, and is used for the searching algorithm (see
296below). Note that in the case of both the reconstruction and the
297smoothing options being requested, the reconstruction will take
298precedence and the smoothing will \emph{not} be done.
299
300There is only one parameter necessary to define the degree of
301smoothing -- the Hanning width $a$ (given by the user parameter
302\texttt{hanningWidth}). The coefficients of the Hanning filter are
303defined by
304\[
305\frac{1+\cos(\pi x/a)}{2},\ \frac{-(a+1)}{2}\leq x \leq \frac{a+1}{2},
306\]
307and zero elsewhere. Note that the width specified must be an odd
308integer (if the parameter provided is even, it is incremented by one).
309
310The user is able to save the smoothed array in exactly the same manner
311as for the reconstructed array -- set \texttt{flagOutputSmooth =
312true}, and then the smoothed array will be saved in
313\texttt{image.SMOOTH-a.fits}, where a is replaced by the Hanning width
314used. Similarly, a saved file can be read in by setting
315\texttt{flagSmoothExists = true} and either specifying a file to be
316read with the \texttt{smoothFile} parameter or relying on \duchamp\ to
317find the file with the name as given above.
318
319\secB{Searching the image}
320\label{sec-detection}
321
322The image is searched for detections in two ways: spectrally (a
3231-dimensional search in the spectrum in each spatial pixel), and
324spatially (a 2-dimensional search in the spatial image in each
325channel). In both cases, the algorithm finds connected pixels that are
326above the user-specified threshold. In the case of the spatial image
327search, the algorithm of \citet{lutz80} is used to raster-scan through
328the image and connect groups of pixels on neighbouring rows.
329
330Note that this algorithm cannot be applied directly to a 3-dimensional
331case, as it requires that objects are completely nested in a row: that
332is, if you are scanning along a row, and one object finishes and
333another starts, you know that you will not get back to the first one
334(if at all) until the second is completely finished for that
335row. Three-dimensional data does not have this property, which is why
336we break up the searching into 1- and 2-dimensional cases.
337
338The basic idea behind detection is to locate sets of contiguous voxels
339that lie above some threshold. \duchamp\ now calculates one threshold
340for the entire cube (previous versions calculated thresholds for each
341spectrum and image). This enables calculation of signal-to-noise
342ratios for each source (see Section~\ref{sec-output} for details). The
343user can manually specify a value (using the parameter
344\texttt{threshold}) for the threshold, which will override the
345calculated value. Note that this only applies for the first of the two
346cases discussed below -- the FDR case ignores any manually-set
347threshold value.
348
349The determination of the threshold is done in one of two ways. The
350first way is a simple sigma-clipping, where a threshold is set at a
351fixed number $n$ of standard deviations above the mean, and pixels
352above this threshold are flagged as detected. The value of $n$ is set
353with the parameter \texttt{snrCut}. As before, the value of the
354standard deviation is estimated by the MADFM, and corrected by the
355ratio derived in Appendix~\ref{app-madfm}.
356
357The second method uses the False Discovery Rate (FDR) technique
358\citep{miller01,hopkins02}, whose basis we briefly detail here. The
359false discovery rate (given by the number of false detections divided
360by the total number of detections) is fixed at a certain value
361$\alpha$ (\eg $\alpha=0.05$ implies 5\% of detections are false
362positives). In practice, an $\alpha$ value is chosen, and the ensemble
363average FDR (\ie $\langle FDR \rangle$) when the method is used will
364be less than $\alpha$.  One calculates $p$ -- the probability,
365assuming the null hypothesis is true, of obtaining a test statistic as
366extreme as the pixel value (the observed test statistic) -- for each
367pixel, and sorts them in increasing order. One then calculates $d$
368where
369\[
370d = \max_j \left\{ j : P_j < \frac{j\alpha}{c_N N} \right\},
371\]
372and then rejects all hypotheses whose $p$-values are less than or
373equal to $P_d$. (So a $P_i<P_d$ will be rejected even if $P_i \geq
374j\alpha/c_N N$.) Note that ``reject hypothesis'' here means ``accept
375the pixel as an object pixel'' (\ie we are rejecting the null
376hypothesis that the pixel belongs to the background).
377
378The $c_N$ values here are normalisation constants that depend on the
379correlated nature of the pixel values. If all the pixels are
380uncorrelated, then $c_N=1$. If $N$ pixels are correlated, then their
381tests will be dependent on each other, and so $c_N = \sum_{i=1}^N
382i^{-1}$. \citet{hopkins02} consider real radio data, where the pixels
383are correlated over the beam. In this case the sum is made over the
384$N$ pixels that make up the beam. The value of $N$ is calculated from
385the FITS header (if the correct keywords -- BMAJ, BMIN -- are not
386present, the size of the beam is taken from the parameter
387\texttt{beamSize}).
388
389The theory behind the FDR method implies a direct connection between
390the choice of $\alpha$ and the fraction of detections that will be
391false positives. However, due to the merging process, this direct
392connection is lost when looking at the final number of detections --
393see discussion in \S\ref{sec-notes}. The effect is that the number of
394false detections will be less than indicated by the $\alpha$ value
395used.
396
397If the cube has been reconstructed or smoothed, the residuals (defined
398in the sense of original $-$ reconstruction) are used to estimate the
399noise parameters of the cube. Otherwise they are estimated directly
400from the cube itself. In both cases, robust estimators are used.
401
402Detections must have a minimum number of pixels to be counted. This
403minimum number is given by the input parameters \texttt{minPix} (for
4042-dimensional searches) and \texttt{minChannels} (for 1-dimensional
405searches).
406
407Finally, the search only looks for positive features. If one is
408interested instead in negative features (such as absorption lines),
409set the parameter \texttt{flagNegative = true}. This will invert the
410cube (\ie multiply all pixels by $-1$) prior to the search, and then
411re-invert the cube (and the fluxes of any detections) after searching
412is complete. All outputs are done in the same manner as normal, so
413that fluxes of detections will be negative.
414
415\secB{Merging detected objects}
416\label{sec-merger}
417
418The searching step produces a list of detected objects that will have
419many repeated detections of a given object -- for instance, spectral
420detections in adjacent pixels of the same object and/or spatial
421detections in neighbouring channels. These are then combined in an
422algorithm that matches all objects judged to be ``close'', according
423to one of two criteria.
424
425One criterion is to define two thresholds -- one spatial and one in
426velocity -- and say that two objects should be merged if there is at
427least one pair of pixels that lie within these threshold distances of
428each other. These thresholds are specified by the parameters
429\texttt{threshSpatial} and \texttt{threshVelocity} (in units of pixels
430and channels respectively).
431
432Alternatively, the spatial requirement can be changed to say that
433there must be a pair of pixels that are \emph{adjacent} -- a stricter,
434but perhaps more realistic requirement, particularly when the spatial
435pixels have a large angular size (as is the case for \hi\
436surveys). This method can be selected by setting the parameter
437\texttt{flagAdjacent} to 1 (\ie \texttt{true}) in the parameter
438file. The velocity thresholding is done in the same way as the first
439option.
440
441Once the detections have been merged, they may be ``grown''. This is a
442process of increasing the size of the detection by adding adjacent
443pixels that are above some secondary threshold. This threshold is
444lower than the one used for the initial detection, but above the noise
445level, so that faint pixels are only detected when they are close to a
446bright pixel. The value of this threshold is a possible input
447parameter (\texttt{growthCut}), with a default value of
448$1.5\sigma$. The use of the growth algorithm is controlled by the
449\texttt{flagGrowth} parameter -- the default value of which is
450\texttt{false}. If the detections are grown, they are sent through the
451merging algorithm a second time, to pick up any detections that now
452overlap or have grown over each other.
453
454Finally, to be accepted, the detections must span \emph{both} a
455minimum number of channels (to remove any spurious single-channel
456spikes that may be present), and a minimum number of spatial
457pixels. These numbers, as for the original detection step, are set
458with the \texttt{minChannels} and \texttt{minPix} parameters. The
459channel requirement means there must be at least one set of
460\texttt{minChannels} consecutive channels in the source for it to be
461accepted.
Note: See TracBrowser for help on using the repository browser.