source: trunk/docs/executionFlow.tex @ 167

Last change on this file since 167 was 162, checked in by Matthew Whiting, 18 years ago

Editing of Guide. Mostly minor, with one new parameter added (beamSize).

File size: 20.9 KB
Line 
1\secA{What \duchamp\ is doing}
2\label{sec-flow}
3
4The execution flow of \duchamp\ is detailed here, indicating the main
5algorithmic steps that are used. The program is written in C/C++ and
6makes use of the \textsc{cfitsio}, \textsc{wcslib} and \textsc{pgplot}
7libraries.
8
9\secB{Image input}
10\label{sec-input}
11
12The cube is read in using basic \textsc{cfitsio} commands, and stored
13as an array in a special C++ class. This class keeps track of the list
14of detected objects, as well as any reconstructed arrays that are made
15(see \S\ref{sec-recon}). The World Coordinate System
16(WCS)\footnote{This is the information necessary for translating the
17pixel locations to quantities such as position on the sky, frequency,
18velocity, and so on.} information for the cube is also obtained from
19the FITS header by \textsc{wcslib} functions \citep{greisen02,
20calabretta02}, and this information, in the form of a \texttt{wcsprm}
21structure, is also stored in the same class.
22
23A sub-section of an image can be requested via the \texttt{subsection}
24parameter -- this can be a good idea if the cube has very noisy edges,
25which may produce many spurious detections. The generalised form of
26the subsection that is used by \textsc{cfitsio} is
27\texttt{[x1:x2:dx,y1:y2:dy,z1:z2:dz,...]}, such that the x-coordinates run
28from \texttt{x1} to \texttt{x2} (inclusive), with steps of
29\texttt{dx}. The step value can be omitted (so a subsection of the
30form \texttt{[2:50,2:50,10:1000]} is still valid). \duchamp\ does not
31make use of any step value present in the subsection string, and any
32that are present are removed before the file is opened.
33
34If one wants the full range of a coordinate then replace the range
35with an asterisk, \eg \texttt{[2:50,2:50,*]}. If one wants to use a
36subsection, one must set \texttt{flagSubsection = 1}. A complete
37description of the section syntax can be found at the \textsc{fitsio}
38web site%
39\footnote{%
40\href%
41{http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node90.html}%
42{http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node90.html}}.
43
44\secB{Image modification}
45\label{sec-modify}
46
47Several modifications to the cube can be made that improve the
48execution and efficiency of \duchamp\ (their use is optional, governed
49by the relevant flags in the parameter file).
50
51\secC{BLANK pixel removal}
52
53If the imaged area of a cube is non-rectangular (see the example in
54Fig.~\ref{fig-moment}, a cube from the HIPASS survey), BLANK pixels are
55used to pad it out to a rectangular shape. The value of these pixels
56is given by the FITS header keywords BLANK, BSCALE and BZERO. While
57these pixels make the image a nice shape, they will unnecessarily
58interfere with the processing (as well as taking up needless
59memory). The first step, then, is to trim them from the edge. This is
60done when the parameter \texttt{flagBlankPix=true}. If the above
61keywords are not present, the user can specify the BLANK value by the
62parameter \texttt{blankPixValue}.
63
64Removing BLANK pixels is particularly important for the reconstruction
65step, as lots of BLANK pixels on the edges will smooth out features in
66the wavelet calculation stage. The trimming will also reduce the size
67of the cube's array, speeding up the execution. The amount of trimming
68is recorded, and these pixels are added back in once the
69source-detection is completed (so that quoted pixel positions are
70applicable to the original cube).
71
72Rows and columns are trimmed one at a time until the first non-BLANK
73pixel is reached, so that the image remains rectangular. In practice,
74this means that there will be some BLANK pixels left in the trimmed
75image (if the non-BLANK region is non-rectangular). However, these are
76ignored in all further calculations done on the cube.
77
78\secC{Baseline removal}
79
80Second, the user may request the removal of baselines from the
81spectra, via the parameter \texttt{flagBaseline}. This may be
82necessary if there is a strong baseline ripple present, which can
83result in spurious detections at the high points of the ripple. The
84baseline is calculated from a wavelet reconstruction procedure (see
85\S\ref{sec-recon}) that keeps only the two largest scales. This is
86done separately for each spatial pixel (\ie for each spectrum in the
87cube), and the baselines are stored and added back in before any
88output is done. In this way the quoted fluxes and displayed spectra
89are as one would see from the input cube itself -- even though the
90detection (and reconstruction if applicable) is done on the
91baseline-removed cube.
92
93The presence of very strong signals (for instance, masers at several
94hundred Jy) could affect the determination of the baseline, and would
95lead to a large dip centred on the signal in the baseline-subtracted
96spectrum. To prevent this, the signal is trimmed prior to the
97reconstruction process at some standard threshold (at $8\sigma$ above
98the mean). The baseline determined should thus be representative of
99the true, signal-free baseline. Note that this trimming is only a
100temporary measure which does not affect the source-detection.
101
102\secC{Ignoring bright Milky Way emission}
103
104Finally, a single set of contiguous channels can be ignored -- these
105may exhibit very strong emission, such as that from the Milky Way as
106seen in extragalactic \hi\ cubes (hence the references to ``Milky
107Way'' in relation to this task -- apologies to Galactic
108astronomers!). Such dominant channels will produce many detections
109that are unnecessary, uninteresting (if one is interested in
110extragalactic \hi) and large (in size and hence in memory usage), and
111so will slow the program down and detract from the interesting
112detections.
113
114The use of this feature is controlled by the \texttt{flagMW}
115parameter, and the exact channels concerned are able to be set by the
116user (using \texttt{maxMW} and \texttt{minMW} -- these give an
117inclusive range of channels). When employed, these channels are
118ignored for the searching, and the scaling of the spectral output (see
119Fig.~\ref{fig-spect}) will not take them into account. They will be
120present in the reconstructed array, however, and so will be included
121in the saved FITS file (see \S\ref{sec-reconIO}). When the final
122spectra are plotted, the range of channels covered by these parameters
123is indicated by a green hashed box.
124
125\secB{Image reconstruction}
126\label{sec-recon}
127
128The user can direct \duchamp\ to reconstruct the data cube using the
129\atrous\ wavelet procedure. A good description of the procedure can be
130found in \citet{starck02:book}. The reconstruction is an effective way
131of removing a lot of the noise in the image, allowing one to search
132reliably to fainter levels, and reducing the number of spurious
133detections. This is an optional step, but one that greatly enhances
134the source-detection process, with the payoff that it can be
135relatively time- and memory-intensive.
136
137\secC{Algorithm}
138
139The steps in the \atrous\ reconstruction are as follows:
140\begin{enumerate}
141\item The reconstructed array is set to 0 everywhere.
142\item The input array is discretely convolved with a given filter
143  function. This is determined from the parameter file via the
144  \texttt{filterCode} parameter -- see Appendix~\ref{app-param} for
145  details on the filters available.
146\item The wavelet coefficients are calculated by taking the difference
147  between the convolved array and the input array.
148\item If the wavelet coefficients at a given point are above the
149  requested threshold (given by \texttt{snrRecon} as the number of
150  $\sigma$ above the mean and adjusted to the current scale -- see
151  Appendix~\ref{app-scaling}), add these to the reconstructed array.
152\item The separation of the filter coefficients is doubled. (Note that
153  this step provides the name of the procedure\footnote{\atrous\ means
154  ``with holes'' in French.}, as gaps or holes are created in the
155  filter coverage.)
156\item The procedure is repeated from step 2, using the convolved array
157  as the input array.
158\item Continue until the required maximum number of scales is reached.
159\item Add the final smoothed (\ie convolved) array to the
160  reconstructed array. This provides the ``DC offset'', as each of the
161  wavelet coefficient arrays will have zero mean.
162\end{enumerate}
163
164The reconstruction has at least two iterations. The first iteration
165makes a first pass at the wavelet reconstruction (the process outlined
166in the 8 stages above), but the residual array will likely have some
167structure still in it, so the wavelet filtering is done on the
168residual, and any significant wavelet terms are added to the final
169reconstruction. This step is repeated until the change in the measured
170standard deviation of the background (see note below on the evaluation
171of this quantity) is less than some fiducial amount.
172
173It is important to note that the \atrous\ decomposition is an example
174of a ``redundant'' transformation. If no thresholding is performed,
175the sum of all the wavelet coefficient arrays and the final smoothed
176array is identical to the input array. The thresholding thus removes
177only the unwanted structure in the array.
178
179Note that any BLANK pixels that are still in the cube will not be
180altered by the reconstruction -- they will be left as BLANK so that
181the shape of the valid part of the cube is preserved.
182
183\secC{Note on Statistics}
184
185The correct calculation of the reconstructed array needs good
186estimators of the underlying mean and standard deviation of the
187background noise distribution. These statistics are estimated using
188robust methods, to avoid corruption by strong outlying points. The
189mean of the distribution is actually estimated by the median, while
190the median absolute deviation from the median (MADFM) is calculated
191and corrected assuming Gaussianity to estimate the underlying standard
192deviation $\sigma$. The Gaussianity (or Normality) assumption is
193critical, as the MADFM does not give the same value as the usual rms
194or standard deviation value -- for a normal distribution
195$N(\mu,\sigma)$ we find MADFM$=0.6744888\sigma$. Since this ratio is
196corrected for, the user need only think in the usual multiples of
197$\sigma$ when setting \texttt{snrRecon}. See Appendix~\ref{app-madfm}
198for a derivation of this value.
199
200When thresholding the different wavelet scales, the value of $\sigma$
201as measured from the wavelet array needs to be scaled to account for
202the increased amount of correlation between neighbouring pixels (due
203to the convolution). See Appendix~\ref{app-scaling} for details on
204this scaling.
205
206\secC{User control of reconstruction parameters}
207
208The most important parameter for the user to select in relation to the
209reconstruction is the threshold for each wavelet array. This is set
210using the \texttt{snrRecon} parameter, and is given as a multiple of
211the rms (estimated by the MADFM) above the mean (which for the wavelet
212arrays should be approximately zero). There are several other
213parameters that can be altered as well that affect the outcome of the
214reconstruction.
215
216By default, the cube is reconstructed in three dimensions, using a
2173-dimensional filter and 3-dimensional convolution. This can be
218altered, however, using the parameter \texttt{reconDim}. If set to 1,
219this means the cube is reconstructed by considering each spectrum
220separately, whereas \texttt{reconDim=2} will mean the cube is
221reconstructed by doing each channel map separately. The merits of
222these choices are discussed in \S\ref{sec-notes}, but it should be
223noted that a 2-dimensional reconstruction can be susceptible to edge
224effects if the spatial shape of the pixel array is not rectangular.
225
226The user can also select the minimum scale to be used in the
227reconstruction. The first scale exhibits the highest frequency
228variations, and so ignoring this one can sometimes be beneficial in
229removing excess noise. The default is to use all scales
230(\texttt{minscale = 1}).
231
232Finally, the filter that is used for the convolution can be selected
233by using \texttt{filterCode} and the relevant code number -- the
234choices are listed in Appendix~\ref{app-param}. A larger filter will
235give a better reconstruction, but take longer and use more memory when
236executing. When multi-dimensional reconstruction is selected, this
237filter is used to construct a 2- or 3-dimensional equivalent.
238
239\secB{Input/Output of reconstructed arrays}
240\label{sec-reconIO}
241
242The reconstruction stage can be relatively time-consuming,
243particularly for large cubes and reconstructions in 3-D. To get around
244this, \duchamp\ provides a shortcut to allow users to perform multiple
245searches (\eg with different thresholds) on the same reconstruction
246without calculating the reconstruction each time.
247
248The first step is to choose to save the reconstructed array as a FITS
249file by setting \texttt{flagOutputRecon = true}. The file will be
250saved in the same directory as the input image, so the user needs to
251have write permissions for that directory.
252
253The filename will be derived from the input filename, with extra
254information detailing the reconstruction that has been done. For
255example, suppose \texttt{image.fits} has been reconstructed using a
2563-dimensional reconstruction with filter \#2, thresholded at $4\sigma$
257using all scales. The output filename will then be
258\texttt{image.RECON-3-2-4-1.fits} (\ie it uses the four parameters
259relevant for the \atrous\ reconstruction as listed in
260Appendix~\ref{app-param}). The new FITS file will also have these
261parameters as header keywords. If a subsection of the input image has
262been used (see \S\ref{sec-input}), the format of the output filename
263will be \texttt{image.sub.RECON-3-2-4-1.fits}, and the subsection that
264has been used is also stored in the FITS header.
265
266Likewise, the residual image, defined as the difference between the
267input and reconstructed arrays, can also be saved in the same manner
268by setting \texttt{flagOutputResid = true}. Its filename will be the
269same as above, with \texttt{RESID} replacing \texttt{RECON}.
270
271If a reconstructed image has been saved, it can be read in and used
272instead of redoing the reconstruction. To do so, the user should set
273\texttt{flagReconExists = true}. The user can indicate the name of the
274reconstructed FITS file using the \texttt{reconFile} parameter, or, if
275this is not specified, \duchamp\ searches for the file with the name
276as defined above. If the file is not found, the reconstruction is
277performed as normal. Note that to do this, the user needs to set
278\texttt{flagAtrous = true} (obviously, if this is \texttt{false}, the
279reconstruction is not needed).
280
281\secB{Searching the image}
282\label{sec-detection}
283
284The image is searched for detections in two ways: spectrally (a
2851-dimensional search in the spectrum in each spatial pixel), and
286spatially (a 2-dimensional search in the spatial image in each
287channel). In both cases, the algorithm finds connected pixels that are
288above the user-specified threshold. In the case of the spatial image
289search, the algorithm of \citet{lutz80} is used to raster-scan through
290the image and connect groups of pixels on neighbouring rows.
291
292Note that this algorithm cannot be applied directly to a 3-dimensional
293case, as it requires that objects are completely nested in a row: that
294is, if you are scanning along a row, and one object finishes and
295another starts, you know that you will not get back to the first one
296(if at all) until the second is completely finished for that
297row. Three-dimensional data does not have this property, which is why
298we break up the searching into 1- and 2-dimensional cases.
299
300The determination of the threshold is done in one of two ways. The
301first way is a simple sigma-clipping, where a threshold is set at a
302fixed number $n$ of standard deviations above the mean, and pixels
303above this threshold are flagged as detected. The value of $n$ is set
304with the parameter \texttt{snrCut}. As before, the value of the
305standard deviation is estimated by the MADFM, and corrected by the
306ratio derived in Appendix~\ref{app-madfm}.
307
308The second method uses the False Discovery Rate (FDR) technique
309\citep{miller01,hopkins02}, whose basis we briefly detail here. The
310false discovery rate (given by the number of false detections divided
311by the total number of detections) is fixed at a certain value
312$\alpha$ (\eg $\alpha=0.05$ implies 5\% of detections are false
313positives). In practice, an $\alpha$ value is chosen, and the ensemble
314average FDR (\ie $\langle FDR \rangle$) when the method is used will
315be less than $\alpha$.  One calculates $p$ -- the probability,
316assuming the null hypothesis is true, of obtaining a test statistic as
317extreme as the pixel value (the observed test statistic) -- for each
318pixel, and sorts them in increasing order. One then calculates $d$
319where
320\[
321d = \max_j \left\{ j : P_j < \frac{j\alpha}{c_N N} \right\},
322\]
323and then rejects all hypotheses whose $p$-values are less than or
324equal to $P_d$. (So a $P_i<P_d$ will be rejected even if $P_i \geq
325j\alpha/c_N N$.) Note that ``reject hypothesis'' here means ``accept
326the pixel as an object pixel'' (\ie we are rejecting the null
327hypothesis that the pixel belongs to the background).
328
329The $c_N$ values here are normalisation constants that depend on the
330correlated nature of the pixel values. If all the pixels are
331uncorrelated, then $c_N=1$. If $N$ pixels are correlated, then their
332tests will be dependent on each other, and so $c_N = \sum_{i=1}^N
333i^{-1}$. \citet{hopkins02} consider real radio data, where the pixels
334are correlated over the beam. In this case the sum is made over the
335$N$ pixels that make up the beam. The value of $N$ is calculated from
336the FITS header (if the correct keywords -- BMAJ, BMIN -- are not
337present, a default value of 10 pixels is assumed).
338
339The theory behind the FDR method implies a direct connection between
340the choice of $\alpha$ and the fraction of detections that will be
341false positives. However, due to the merging process, this direct
342connection is lost when looking at the final number of detections --
343see discussion in \S\ref{sec-notes}. The effect is that the number of
344false detections will be less than indicated by the $\alpha$ value
345used.
346
347If a reconstruction has been made, the residuals (defined in the sense
348of original $-$ reconstruction) are used to estimate the noise
349parameters of the cube. Otherwise they are estimated directly from the
350cube itself. In both cases, robust estimators are used.
351
352Detections must have a minimum number of pixels to be counted. This
353minimum number is given by the input parameters \texttt{minPix} (for
3542-dimensional searches) and \texttt{minChannels} (for 1-dimensional
355searches).
356
357Finally, the search only looks for positive features. If one is
358interested instead in negative features (such as absorption lines),
359set the parameter \texttt{flagNegative = true}. This will invert the
360cube (\ie multiply all pixels by $-1$) prior to the search, and then
361re-invert the cube (and the fluxes of any detections) after searching
362is complete. All outputs are done in the same manner as normal, so
363that fluxes of detections will be negative.
364
365\secB{Merging detected objects}
366\label{sec-merger}
367
368The searching step produces a list of detected objects that will have
369many repeated detections of a given object -- for instance, spectral
370detections in adjacent pixels of the same object and/or spatial
371detections in neighbouring channels. These are then combined in an
372algorithm that matches all objects judged to be ``close'', according
373to one of two criteria.
374
375One criterion is to define two thresholds -- one spatial and one in
376velocity -- and say that two objects should be merged if there is at
377least one pair of pixels that lie within these threshold distances of
378each other. These thresholds are specified by the parameters
379\texttt{threshSpatial} and \texttt{threshVelocity} (in units of pixels
380and channels respectively).
381
382Alternatively, the spatial requirement can be changed to say that
383there must be a pair of pixels that are \emph{adjacent} -- a stricter,
384but perhaps more realistic requirement, particularly when the spatial
385pixels have a large angular size (as is the case for \hi\
386surveys). This method can be selected by setting the parameter
387\texttt{flagAdjacent} to 1 (\ie \texttt{true}) in the parameter
388file. The velocity thresholding is done in the same way as the first
389option.
390
391Once the detections have been merged, they may be ``grown''. This is a
392process of increasing the size of the detection by adding adjacent
393pixels that are above some secondary threshold. This threshold is
394lower than the one used for the initial detection, but above the noise
395level, so that faint pixels are only detected when they are close to a
396bright pixel. The value of this threshold is a possible input
397parameter (\texttt{growthCut}), with a default value of
398$1.5\sigma$. The use of the growth algorithm is controlled by the
399\texttt{flagGrowth} parameter -- the default value of which is
400\texttt{false}. If the detections are grown, they are sent through the
401merging algorithm a second time, to pick up any detections that now
402overlap or have grown over each other.
403
404Finally, to be accepted, the detections must span \emph{both} a
405minimum number of channels (to remove any spurious single-channel
406spikes that may be present), and a minimum number of spatial
407pixels. These numbers, as for the original detection step, are set
408with the \texttt{minChannels} and \texttt{minPix} parameters. The
409channel requirement means there must be at least one set of
410\texttt{minChannels} consecutive channels in the source for it to be
411accepted.
Note: See TracBrowser for help on using the repository browser.