Context Navigation

source: trunk/docs/executionFlow.tex @ 167

Visit:

Last change on this file since 167 was 162, checked in by Matthew Whiting, 18 years ago
Editing of Guide. Mostly minor, with one new parameter added (beamSize).
File size: 20.9 KB

Line
1	\secA{What \duchamp\ is doing}
2	\label{sec-flow}
3
4	The execution flow of \duchamp\ is detailed here, indicating the main
5	algorithmic steps that are used. The program is written in C/C++ and
6	makes use of the \textsc{cfitsio}, \textsc{wcslib} and \textsc{pgplot}
7	libraries.
8
9	\secB{Image input}
10	\label{sec-input}
11
12	The cube is read in using basic \textsc{cfitsio} commands, and stored
13	as an array in a special C++ class. This class keeps track of the list
14	of detected objects, as well as any reconstructed arrays that are made
15	(see \S\ref{sec-recon}). The World Coordinate System
16	(WCS)\footnote{This is the information necessary for translating the
17	pixel locations to quantities such as position on the sky, frequency,
18	velocity, and so on.} information for the cube is also obtained from
19	the FITS header by \textsc{wcslib} functions \citep{greisen02,
20	calabretta02}, and this information, in the form of a \texttt{wcsprm}
21	structure, is also stored in the same class.
22
23	A sub-section of an image can be requested via the \texttt{subsection}
24	parameter -- this can be a good idea if the cube has very noisy edges,
25	which may produce many spurious detections. The generalised form of
26	the subsection that is used by \textsc{cfitsio} is
27	\texttt{[x1:x2:dx,y1:y2:dy,z1:z2:dz,...]}, such that the x-coordinates run
28	from \texttt{x1} to \texttt{x2} (inclusive), with steps of
29	\texttt{dx}. The step value can be omitted (so a subsection of the
30	form \texttt{[2:50,2:50,10:1000]} is still valid). \duchamp\ does not
31	make use of any step value present in the subsection string, and any
32	that are present are removed before the file is opened.
33
34	If one wants the full range of a coordinate then replace the range
35	with an asterisk, \eg \texttt{[2:50,2:50,*]}. If one wants to use a
36	subsection, one must set \texttt{flagSubsection = 1}. A complete
37	description of the section syntax can be found at the \textsc{fitsio}
38	web site%
39	\footnote{%
40	\href%
41	{http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node90.html}%
42	{http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node90.html}}.
43
44	\secB{Image modification}
45	\label{sec-modify}
46
47	Several modifications to the cube can be made that improve the
48	execution and efficiency of \duchamp\ (their use is optional, governed
49	by the relevant flags in the parameter file).
50
51	\secC{BLANK pixel removal}
52
53	If the imaged area of a cube is non-rectangular (see the example in
54	Fig.~\ref{fig-moment}, a cube from the HIPASS survey), BLANK pixels are
55	used to pad it out to a rectangular shape. The value of these pixels
56	is given by the FITS header keywords BLANK, BSCALE and BZERO. While
57	these pixels make the image a nice shape, they will unnecessarily
58	interfere with the processing (as well as taking up needless
59	memory). The first step, then, is to trim them from the edge. This is
60	done when the parameter \texttt{flagBlankPix=true}. If the above
61	keywords are not present, the user can specify the BLANK value by the
62	parameter \texttt{blankPixValue}.
63
64	Removing BLANK pixels is particularly important for the reconstruction
65	step, as lots of BLANK pixels on the edges will smooth out features in
66	the wavelet calculation stage. The trimming will also reduce the size
67	of the cube's array, speeding up the execution. The amount of trimming
68	is recorded, and these pixels are added back in once the
69	source-detection is completed (so that quoted pixel positions are
70	applicable to the original cube).
71
72	Rows and columns are trimmed one at a time until the first non-BLANK
73	pixel is reached, so that the image remains rectangular. In practice,
74	this means that there will be some BLANK pixels left in the trimmed
75	image (if the non-BLANK region is non-rectangular). However, these are
76	ignored in all further calculations done on the cube.
77
78	\secC{Baseline removal}
79
80	Second, the user may request the removal of baselines from the
81	spectra, via the parameter \texttt{flagBaseline}. This may be
82	necessary if there is a strong baseline ripple present, which can
83	result in spurious detections at the high points of the ripple. The
84	baseline is calculated from a wavelet reconstruction procedure (see
85	\S\ref{sec-recon}) that keeps only the two largest scales. This is
86	done separately for each spatial pixel (\ie for each spectrum in the
87	cube), and the baselines are stored and added back in before any
88	output is done. In this way the quoted fluxes and displayed spectra
89	are as one would see from the input cube itself -- even though the
90	detection (and reconstruction if applicable) is done on the
91	baseline-removed cube.
92
93	The presence of very strong signals (for instance, masers at several
94	hundred Jy) could affect the determination of the baseline, and would
95	lead to a large dip centred on the signal in the baseline-subtracted
96	spectrum. To prevent this, the signal is trimmed prior to the
97	reconstruction process at some standard threshold (at $8\sigma$ above
98	the mean). The baseline determined should thus be representative of
99	the true, signal-free baseline. Note that this trimming is only a
100	temporary measure which does not affect the source-detection.
101
102	\secC{Ignoring bright Milky Way emission}
103
104	Finally, a single set of contiguous channels can be ignored -- these
105	may exhibit very strong emission, such as that from the Milky Way as
106	seen in extragalactic \hi\ cubes (hence the references to ``Milky
107	Way'' in relation to this task -- apologies to Galactic
108	astronomers!). Such dominant channels will produce many detections
109	that are unnecessary, uninteresting (if one is interested in
110	extragalactic \hi) and large (in size and hence in memory usage), and
111	so will slow the program down and detract from the interesting
112	detections.
113
114	The use of this feature is controlled by the \texttt{flagMW}
115	parameter, and the exact channels concerned are able to be set by the
116	user (using \texttt{maxMW} and \texttt{minMW} -- these give an
117	inclusive range of channels). When employed, these channels are
118	ignored for the searching, and the scaling of the spectral output (see
119	Fig.~\ref{fig-spect}) will not take them into account. They will be
120	present in the reconstructed array, however, and so will be included
121	in the saved FITS file (see \S\ref{sec-reconIO}). When the final
122	spectra are plotted, the range of channels covered by these parameters
123	is indicated by a green hashed box.
124
125	\secB{Image reconstruction}
126	\label{sec-recon}
127
128	The user can direct \duchamp\ to reconstruct the data cube using the
129	\atrous\ wavelet procedure. A good description of the procedure can be
130	found in \citet{starck02:book}. The reconstruction is an effective way
131	of removing a lot of the noise in the image, allowing one to search
132	reliably to fainter levels, and reducing the number of spurious
133	detections. This is an optional step, but one that greatly enhances
134	the source-detection process, with the payoff that it can be
135	relatively time- and memory-intensive.
136
137	\secC{Algorithm}
138
139	The steps in the \atrous\ reconstruction are as follows:
140	\begin{enumerate}
141	\item The reconstructed array is set to 0 everywhere.
142	\item The input array is discretely convolved with a given filter
143	function. This is determined from the parameter file via the
144	\texttt{filterCode} parameter -- see Appendix~\ref{app-param} for
145	details on the filters available.
146	\item The wavelet coefficients are calculated by taking the difference
147	between the convolved array and the input array.
148	\item If the wavelet coefficients at a given point are above the
149	requested threshold (given by \texttt{snrRecon} as the number of
150	$\sigma$ above the mean and adjusted to the current scale -- see
151	Appendix~\ref{app-scaling}), add these to the reconstructed array.
152	\item The separation of the filter coefficients is doubled. (Note that
153	this step provides the name of the procedure\footnote{\atrous\ means
154	``with holes'' in French.}, as gaps or holes are created in the
155	filter coverage.)
156	\item The procedure is repeated from step 2, using the convolved array
157	as the input array.
158	\item Continue until the required maximum number of scales is reached.
159	\item Add the final smoothed (\ie convolved) array to the
160	reconstructed array. This provides the ``DC offset'', as each of the
161	wavelet coefficient arrays will have zero mean.
162	\end{enumerate}
163
164	The reconstruction has at least two iterations. The first iteration
165	makes a first pass at the wavelet reconstruction (the process outlined
166	in the 8 stages above), but the residual array will likely have some
167	structure still in it, so the wavelet filtering is done on the
168	residual, and any significant wavelet terms are added to the final
169	reconstruction. This step is repeated until the change in the measured
170	standard deviation of the background (see note below on the evaluation
171	of this quantity) is less than some fiducial amount.
172
173	It is important to note that the \atrous\ decomposition is an example
174	of a ``redundant'' transformation. If no thresholding is performed,
175	the sum of all the wavelet coefficient arrays and the final smoothed
176	array is identical to the input array. The thresholding thus removes
177	only the unwanted structure in the array.
178
179	Note that any BLANK pixels that are still in the cube will not be
180	altered by the reconstruction -- they will be left as BLANK so that
181	the shape of the valid part of the cube is preserved.
182
183	\secC{Note on Statistics}
184
185	The correct calculation of the reconstructed array needs good
186	estimators of the underlying mean and standard deviation of the
187	background noise distribution. These statistics are estimated using
188	robust methods, to avoid corruption by strong outlying points. The
189	mean of the distribution is actually estimated by the median, while
190	the median absolute deviation from the median (MADFM) is calculated
191	and corrected assuming Gaussianity to estimate the underlying standard
192	deviation $\sigma$. The Gaussianity (or Normality) assumption is
193	critical, as the MADFM does not give the same value as the usual rms
194	or standard deviation value -- for a normal distribution
195	$N(\mu,\sigma)$ we find MADFM$=0.6744888\sigma$. Since this ratio is
196	corrected for, the user need only think in the usual multiples of
197	$\sigma$ when setting \texttt{snrRecon}. See Appendix~\ref{app-madfm}
198	for a derivation of this value.
199
200	When thresholding the different wavelet scales, the value of $\sigma$
201	as measured from the wavelet array needs to be scaled to account for
202	the increased amount of correlation between neighbouring pixels (due
203	to the convolution). See Appendix~\ref{app-scaling} for details on
204	this scaling.
205
206	\secC{User control of reconstruction parameters}
207
208	The most important parameter for the user to select in relation to the
209	reconstruction is the threshold for each wavelet array. This is set
210	using the \texttt{snrRecon} parameter, and is given as a multiple of
211	the rms (estimated by the MADFM) above the mean (which for the wavelet
212	arrays should be approximately zero). There are several other
213	parameters that can be altered as well that affect the outcome of the
214	reconstruction.
215
216	By default, the cube is reconstructed in three dimensions, using a
217	3-dimensional filter and 3-dimensional convolution. This can be
218	altered, however, using the parameter \texttt{reconDim}. If set to 1,
219	this means the cube is reconstructed by considering each spectrum
220	separately, whereas \texttt{reconDim=2} will mean the cube is
221	reconstructed by doing each channel map separately. The merits of
222	these choices are discussed in \S\ref{sec-notes}, but it should be
223	noted that a 2-dimensional reconstruction can be susceptible to edge
224	effects if the spatial shape of the pixel array is not rectangular.
225
226	The user can also select the minimum scale to be used in the
227	reconstruction. The first scale exhibits the highest frequency
228	variations, and so ignoring this one can sometimes be beneficial in
229	removing excess noise. The default is to use all scales
230	(\texttt{minscale = 1}).
231
232	Finally, the filter that is used for the convolution can be selected
233	by using \texttt{filterCode} and the relevant code number -- the
234	choices are listed in Appendix~\ref{app-param}. A larger filter will
235	give a better reconstruction, but take longer and use more memory when
236	executing. When multi-dimensional reconstruction is selected, this
237	filter is used to construct a 2- or 3-dimensional equivalent.
238
239	\secB{Input/Output of reconstructed arrays}
240	\label{sec-reconIO}
241
242	The reconstruction stage can be relatively time-consuming,
243	particularly for large cubes and reconstructions in 3-D. To get around
244	this, \duchamp\ provides a shortcut to allow users to perform multiple
245	searches (\eg with different thresholds) on the same reconstruction
246	without calculating the reconstruction each time.
247
248	The first step is to choose to save the reconstructed array as a FITS
249	file by setting \texttt{flagOutputRecon = true}. The file will be
250	saved in the same directory as the input image, so the user needs to
251	have write permissions for that directory.
252
253	The filename will be derived from the input filename, with extra
254	information detailing the reconstruction that has been done. For
255	example, suppose \texttt{image.fits} has been reconstructed using a
256	3-dimensional reconstruction with filter \#2, thresholded at $4\sigma$
257	using all scales. The output filename will then be
258	\texttt{image.RECON-3-2-4-1.fits} (\ie it uses the four parameters
259	relevant for the \atrous\ reconstruction as listed in
260	Appendix~\ref{app-param}). The new FITS file will also have these
261	parameters as header keywords. If a subsection of the input image has
262	been used (see \S\ref{sec-input}), the format of the output filename
263	will be \texttt{image.sub.RECON-3-2-4-1.fits}, and the subsection that
264	has been used is also stored in the FITS header.
265
266	Likewise, the residual image, defined as the difference between the
267	input and reconstructed arrays, can also be saved in the same manner
268	by setting \texttt{flagOutputResid = true}. Its filename will be the
269	same as above, with \texttt{RESID} replacing \texttt{RECON}.
270
271	If a reconstructed image has been saved, it can be read in and used
272	instead of redoing the reconstruction. To do so, the user should set
273	\texttt{flagReconExists = true}. The user can indicate the name of the
274	reconstructed FITS file using the \texttt{reconFile} parameter, or, if
275	this is not specified, \duchamp\ searches for the file with the name
276	as defined above. If the file is not found, the reconstruction is
277	performed as normal. Note that to do this, the user needs to set
278	\texttt{flagAtrous = true} (obviously, if this is \texttt{false}, the
279	reconstruction is not needed).
280
281	\secB{Searching the image}
282	\label{sec-detection}
283
284	The image is searched for detections in two ways: spectrally (a
285	1-dimensional search in the spectrum in each spatial pixel), and
286	spatially (a 2-dimensional search in the spatial image in each
287	channel). In both cases, the algorithm finds connected pixels that are
288	above the user-specified threshold. In the case of the spatial image
289	search, the algorithm of \citet{lutz80} is used to raster-scan through
290	the image and connect groups of pixels on neighbouring rows.
291
292	Note that this algorithm cannot be applied directly to a 3-dimensional
293	case, as it requires that objects are completely nested in a row: that
294	is, if you are scanning along a row, and one object finishes and
295	another starts, you know that you will not get back to the first one
296	(if at all) until the second is completely finished for that
297	row. Three-dimensional data does not have this property, which is why
298	we break up the searching into 1- and 2-dimensional cases.
299
300	The determination of the threshold is done in one of two ways. The
301	first way is a simple sigma-clipping, where a threshold is set at a
302	fixed number $n$ of standard deviations above the mean, and pixels
303	above this threshold are flagged as detected. The value of $n$ is set
304	with the parameter \texttt{snrCut}. As before, the value of the
305	standard deviation is estimated by the MADFM, and corrected by the
306	ratio derived in Appendix~\ref{app-madfm}.
307
308	The second method uses the False Discovery Rate (FDR) technique
309	\citep{miller01,hopkins02}, whose basis we briefly detail here. The
310	false discovery rate (given by the number of false detections divided
311	by the total number of detections) is fixed at a certain value
312	$\alpha$ (\eg $\alpha=0.05$ implies 5\% of detections are false
313	positives). In practice, an $\alpha$ value is chosen, and the ensemble
314	average FDR (\ie $\langle FDR \rangle$) when the method is used will
315	be less than $\alpha$. One calculates $p$ -- the probability,
316	assuming the null hypothesis is true, of obtaining a test statistic as
317	extreme as the pixel value (the observed test statistic) -- for each
318	pixel, and sorts them in increasing order. One then calculates $d$
319	where
320	\[
321	d = \max_j \left\{ j : P_j < \frac{j\alpha}{c_N N} \right\},
322	\]
323	and then rejects all hypotheses whose $p$-values are less than or
324	equal to $P_d$. (So a $P_i<P_d$ will be rejected even if $P_i \geq
325	j\alpha/c_N N$.) Note that ``reject hypothesis'' here means ``accept
326	the pixel as an object pixel'' (\ie we are rejecting the null
327	hypothesis that the pixel belongs to the background).
328
329	The $c_N$ values here are normalisation constants that depend on the
330	correlated nature of the pixel values. If all the pixels are
331	uncorrelated, then $c_N=1$. If $N$ pixels are correlated, then their
332	tests will be dependent on each other, and so $c_N = \sum_{i=1}^N
333	i^{-1}$. \citet{hopkins02} consider real radio data, where the pixels
334	are correlated over the beam. In this case the sum is made over the
335	$N$ pixels that make up the beam. The value of $N$ is calculated from
336	the FITS header (if the correct keywords -- BMAJ, BMIN -- are not
337	present, a default value of 10 pixels is assumed).
338
339	The theory behind the FDR method implies a direct connection between
340	the choice of $\alpha$ and the fraction of detections that will be
341	false positives. However, due to the merging process, this direct
342	connection is lost when looking at the final number of detections --
343	see discussion in \S\ref{sec-notes}. The effect is that the number of
344	false detections will be less than indicated by the $\alpha$ value
345	used.
346
347	If a reconstruction has been made, the residuals (defined in the sense
348	of original $-$ reconstruction) are used to estimate the noise
349	parameters of the cube. Otherwise they are estimated directly from the
350	cube itself. In both cases, robust estimators are used.
351
352	Detections must have a minimum number of pixels to be counted. This
353	minimum number is given by the input parameters \texttt{minPix} (for
354	2-dimensional searches) and \texttt{minChannels} (for 1-dimensional
355	searches).
356
357	Finally, the search only looks for positive features. If one is
358	interested instead in negative features (such as absorption lines),
359	set the parameter \texttt{flagNegative = true}. This will invert the
360	cube (\ie multiply all pixels by $-1$) prior to the search, and then
361	re-invert the cube (and the fluxes of any detections) after searching
362	is complete. All outputs are done in the same manner as normal, so
363	that fluxes of detections will be negative.
364
365	\secB{Merging detected objects}
366	\label{sec-merger}
367
368	The searching step produces a list of detected objects that will have
369	many repeated detections of a given object -- for instance, spectral
370	detections in adjacent pixels of the same object and/or spatial
371	detections in neighbouring channels. These are then combined in an
372	algorithm that matches all objects judged to be ``close'', according
373	to one of two criteria.
374
375	One criterion is to define two thresholds -- one spatial and one in
376	velocity -- and say that two objects should be merged if there is at
377	least one pair of pixels that lie within these threshold distances of
378	each other. These thresholds are specified by the parameters
379	\texttt{threshSpatial} and \texttt{threshVelocity} (in units of pixels
380	and channels respectively).
381
382	Alternatively, the spatial requirement can be changed to say that
383	there must be a pair of pixels that are \emph{adjacent} -- a stricter,
384	but perhaps more realistic requirement, particularly when the spatial
385	pixels have a large angular size (as is the case for \hi\
386	surveys). This method can be selected by setting the parameter
387	\texttt{flagAdjacent} to 1 (\ie \texttt{true}) in the parameter
388	file. The velocity thresholding is done in the same way as the first
389	option.
390
391	Once the detections have been merged, they may be ``grown''. This is a
392	process of increasing the size of the detection by adding adjacent
393	pixels that are above some secondary threshold. This threshold is
394	lower than the one used for the initial detection, but above the noise
395	level, so that faint pixels are only detected when they are close to a
396	bright pixel. The value of this threshold is a possible input
397	parameter (\texttt{growthCut}), with a default value of
398	$1.5\sigma$. The use of the growth algorithm is controlled by the
399	\texttt{flagGrowth} parameter -- the default value of which is
400	\texttt{false}. If the detections are grown, they are sent through the
401	merging algorithm a second time, to pick up any detections that now
402	overlap or have grown over each other.
403
404	Finally, to be accepted, the detections must span \emph{both} a
405	minimum number of channels (to remove any spurious single-channel
406	spikes that may be present), and a minimum number of spatial
407	pixels. These numbers, as for the original detection step, are set
408	with the \texttt{minChannels} and \texttt{minPix} parameters. The
409	channel requirement means there must be at least one set of
410	\texttt{minChannels} consecutive channels in the source for it to be
411	accepted.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: