Context Navigation

executionFlow.tex

Visit:

Last change on this file was 208, checked in by Matthew Whiting, 18 years ago

Enabled saving and reading in of a smoothed array, in manner directly analogous to that for the recon array.
- New file : src/Cubes/readSmooth.cc
- The other new functions go in existing files eg. saveImage.cc
- Renamed some functions (like writeHeader...) to be more obvious what they do.
- The reading in is taken care of by new function Cube::readSavedArrays() -- handles both smoothed and recon'd arrays.
- Associated parameters in Param class
- Clarified names of FITS header strings in duchamp.hh.
Updated the documentation to describe the ability to smooth a cube.
Added description of feedback mechanisms in the Install appendix.
Also, Hanning class improved to guard against memory leaks.

File size: 23.4 KB

Line
1	\secA{What \duchamp\ is doing}
2	\label{sec-flow}
3
4	The execution flow of \duchamp\ is detailed here, indicating the main
5	algorithmic steps that are used. The program is written in C/C++ and
6	makes use of the \textsc{cfitsio}, \textsc{wcslib} and \textsc{pgplot}
7	libraries.
8
9	\secB{Image input}
10	\label{sec-input}
11
12	The cube is read in using basic \textsc{cfitsio} commands, and stored
13	as an array in a special C++ class. This class keeps track of the list
14	of detected objects, as well as any reconstructed arrays that are made
15	(see \S\ref{sec-recon}). The World Coordinate System
16	(WCS)\footnote{This is the information necessary for translating the
17	pixel locations to quantities such as position on the sky, frequency,
18	velocity, and so on.} information for the cube is also obtained from
19	the FITS header by \textsc{wcslib} functions \citep{greisen02,
20	calabretta02}, and this information, in the form of a \texttt{wcsprm}
21	structure, is also stored in the same class.
22
23	A sub-section of an image can be requested via the \texttt{subsection}
24	parameter -- this can be a good idea if the cube has very noisy edges,
25	which may produce many spurious detections. The generalised form of
26	the subsection that is used by \textsc{cfitsio} is
27	\texttt{[x1:x2:dx,y1:y2:dy,z1:z2:dz,...]}, such that the x-coordinates run
28	from \texttt{x1} to \texttt{x2} (inclusive), with steps of
29	\texttt{dx}. The step value can be omitted (so a subsection of the
30	form \texttt{[2:50,2:50,10:1000]} is still valid). \duchamp\ does not
31	make use of any step value present in the subsection string, and any
32	that are present are removed before the file is opened.
33
34	If one wants the full range of a coordinate then replace the range
35	with an asterisk, \eg \texttt{[2:50,2:50,*]}. If one wants to use a
36	subsection, one must set \texttt{flagSubsection = 1}. A complete
37	description of the section syntax can be found at the \textsc{fitsio}
38	web site%
39	\footnote{%
40	\href%
41	{http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node90.html}%
42	{http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node90.html}}.
43
44	\secB{Image modification}
45	\label{sec-modify}
46
47	Several modifications to the cube can be made that improve the
48	execution and efficiency of \duchamp\ (their use is optional, governed
49	by the relevant flags in the parameter file).
50
51	\secC{BLANK pixel removal}
52
53	If the imaged area of a cube is non-rectangular (see the example in
54	Fig.~\ref{fig-moment}, a cube from the HIPASS survey), BLANK pixels are
55	used to pad it out to a rectangular shape. The value of these pixels
56	is given by the FITS header keywords BLANK, BSCALE and BZERO. While
57	these pixels make the image a nice shape, they will unnecessarily
58	interfere with the processing (as well as taking up needless
59	memory). The first step, then, is to trim them from the edge. This is
60	done when the parameter \texttt{flagBlankPix=true}. If the above
61	keywords are not present, the user can specify the BLANK value by the
62	parameter \texttt{blankPixValue}.
63
64	Removing BLANK pixels is particularly important for the reconstruction
65	step, as lots of BLANK pixels on the edges will smooth out features in
66	the wavelet calculation stage. The trimming will also reduce the size
67	of the cube's array, speeding up the execution. The amount of trimming
68	is recorded, and these pixels are added back in once the
69	source-detection is completed (so that quoted pixel positions are
70	applicable to the original cube).
71
72	Rows and columns are trimmed one at a time until the first non-BLANK
73	pixel is reached, so that the image remains rectangular. In practice,
74	this means that there will be some BLANK pixels left in the trimmed
75	image (if the non-BLANK region is non-rectangular). However, these are
76	ignored in all further calculations done on the cube.
77
78	\secC{Baseline removal}
79
80	Second, the user may request the removal of baselines from the
81	spectra, via the parameter \texttt{flagBaseline}. This may be
82	necessary if there is a strong baseline ripple present, which can
83	result in spurious detections at the high points of the ripple. The
84	baseline is calculated from a wavelet reconstruction procedure (see
85	\S\ref{sec-recon}) that keeps only the two largest scales. This is
86	done separately for each spatial pixel (\ie for each spectrum in the
87	cube), and the baselines are stored and added back in before any
88	output is done. In this way the quoted fluxes and displayed spectra
89	are as one would see from the input cube itself -- even though the
90	detection (and reconstruction if applicable) is done on the
91	baseline-removed cube.
92
93	The presence of very strong signals (for instance, masers at several
94	hundred Jy) could affect the determination of the baseline, and would
95	lead to a large dip centred on the signal in the baseline-subtracted
96	spectrum. To prevent this, the signal is trimmed prior to the
97	reconstruction process at some standard threshold (at $8\sigma$ above
98	the mean). The baseline determined should thus be representative of
99	the true, signal-free baseline. Note that this trimming is only a
100	temporary measure which does not affect the source-detection.
101
102	\secC{Ignoring bright Milky Way emission}
103
104	Finally, a single set of contiguous channels can be ignored -- these
105	may exhibit very strong emission, such as that from the Milky Way as
106	seen in extragalactic \hi\ cubes (hence the references to ``Milky
107	Way'' in relation to this task -- apologies to Galactic
108	astronomers!). Such dominant channels will produce many detections
109	that are unnecessary, uninteresting (if one is interested in
110	extragalactic \hi) and large (in size and hence in memory usage), and
111	so will slow the program down and detract from the interesting
112	detections.
113
114	The use of this feature is controlled by the \texttt{flagMW}
115	parameter, and the exact channels concerned are able to be set by the
116	user (using \texttt{maxMW} and \texttt{minMW} -- these give an
117	inclusive range of channels). When employed, these channels are
118	ignored for the searching, and the scaling of the spectral output (see
119	Fig.~\ref{fig-spect}) will not take them into account. They will be
120	present in the reconstructed array, however, and so will be included
121	in the saved FITS file (see \S\ref{sec-reconIO}). When the final
122	spectra are plotted, the range of channels covered by these parameters
123	is indicated by a green hashed box.
124
125	\secB{Image reconstruction}
126	\label{sec-recon}
127
128	The user can direct \duchamp\ to reconstruct the data cube using the
129	\atrous\ wavelet procedure. A good description of the procedure can be
130	found in \citet{starck02:book}. The reconstruction is an effective way
131	of removing a lot of the noise in the image, allowing one to search
132	reliably to fainter levels, and reducing the number of spurious
133	detections. This is an optional step, but one that greatly enhances
134	the source-detection process, with the payoff that it can be
135	relatively time- and memory-intensive.
136
137	\secC{Algorithm}
138
139	The steps in the \atrous\ reconstruction are as follows:
140	\begin{enumerate}
141	\item The reconstructed array is set to 0 everywhere.
142	\item The input array is discretely convolved with a given filter
143	function. This is determined from the parameter file via the
144	\texttt{filterCode} parameter -- see Appendix~\ref{app-param} for
145	details on the filters available.
146	\item The wavelet coefficients are calculated by taking the difference
147	between the convolved array and the input array.
148	\item If the wavelet coefficients at a given point are above the
149	requested threshold (given by \texttt{snrRecon} as the number of
150	$\sigma$ above the mean and adjusted to the current scale -- see
151	Appendix~\ref{app-scaling}), add these to the reconstructed array.
152	\item The separation of the filter coefficients is doubled. (Note that
153	this step provides the name of the procedure\footnote{\atrous\ means
154	``with holes'' in French.}, as gaps or holes are created in the
155	filter coverage.)
156	\item The procedure is repeated from step 2, using the convolved array
157	as the input array.
158	\item Continue until the required maximum number of scales is reached.
159	\item Add the final smoothed (\ie convolved) array to the
160	reconstructed array. This provides the ``DC offset'', as each of the
161	wavelet coefficient arrays will have zero mean.
162	\end{enumerate}
163
164	The reconstruction has at least two iterations. The first iteration
165	makes a first pass at the wavelet reconstruction (the process outlined
166	in the 8 stages above), but the residual array will likely have some
167	structure still in it, so the wavelet filtering is done on the
168	residual, and any significant wavelet terms are added to the final
169	reconstruction. This step is repeated until the change in the measured
170	standard deviation of the background (see note below on the evaluation
171	of this quantity) is less than some fiducial amount.
172
173	It is important to note that the \atrous\ decomposition is an example
174	of a ``redundant'' transformation. If no thresholding is performed,
175	the sum of all the wavelet coefficient arrays and the final smoothed
176	array is identical to the input array. The thresholding thus removes
177	only the unwanted structure in the array.
178
179	Note that any BLANK pixels that are still in the cube will not be
180	altered by the reconstruction -- they will be left as BLANK so that
181	the shape of the valid part of the cube is preserved.
182
183	\secC{Note on Statistics}
184
185	The correct calculation of the reconstructed array needs good
186	estimators of the underlying mean and standard deviation of the
187	background noise distribution. These statistics are estimated using
188	robust methods, to avoid corruption by strong outlying points. The
189	mean of the distribution is actually estimated by the median, while
190	the median absolute deviation from the median (MADFM) is calculated
191	and corrected assuming Gaussianity to estimate the underlying standard
192	deviation $\sigma$. The Gaussianity (or Normality) assumption is
193	critical, as the MADFM does not give the same value as the usual rms
194	or standard deviation value -- for a normal distribution
195	$N(\mu,\sigma)$ we find MADFM$=0.6744888\sigma$. Since this ratio is
196	corrected for, the user need only think in the usual multiples of
197	$\sigma$ when setting \texttt{snrRecon}. See Appendix~\ref{app-madfm}
198	for a derivation of this value.
199
200	When thresholding the different wavelet scales, the value of $\sigma$
201	as measured from the wavelet array needs to be scaled to account for
202	the increased amount of correlation between neighbouring pixels (due
203	to the convolution). See Appendix~\ref{app-scaling} for details on
204	this scaling.
205
206	\secC{User control of reconstruction parameters}
207
208	The most important parameter for the user to select in relation to the
209	reconstruction is the threshold for each wavelet array. This is set
210	using the \texttt{snrRecon} parameter, and is given as a multiple of
211	the rms (estimated by the MADFM) above the mean (which for the wavelet
212	arrays should be approximately zero). There are several other
213	parameters that can be altered as well that affect the outcome of the
214	reconstruction.
215
216	By default, the cube is reconstructed in three dimensions, using a
217	3-dimensional filter and 3-dimensional convolution. This can be
218	altered, however, using the parameter \texttt{reconDim}. If set to 1,
219	this means the cube is reconstructed by considering each spectrum
220	separately, whereas \texttt{reconDim=2} will mean the cube is
221	reconstructed by doing each channel map separately. The merits of
222	these choices are discussed in \S\ref{sec-notes}, but it should be
223	noted that a 2-dimensional reconstruction can be susceptible to edge
224	effects if the spatial shape of the pixel array is not rectangular.
225
226	The user can also select the minimum scale to be used in the
227	reconstruction. The first scale exhibits the highest frequency
228	variations, and so ignoring this one can sometimes be beneficial in
229	removing excess noise. The default is to use all scales
230	(\texttt{minscale = 1}).
231
232	Finally, the filter that is used for the convolution can be selected
233	by using \texttt{filterCode} and the relevant code number -- the
234	choices are listed in Appendix~\ref{app-param}. A larger filter will
235	give a better reconstruction, but take longer and use more memory when
236	executing. When multi-dimensional reconstruction is selected, this
237	filter is used to construct a 2- or 3-dimensional equivalent.
238
239	\secB{Input/Output of reconstructed arrays}
240	\label{sec-reconIO}
241
242	The reconstruction stage can be relatively time-consuming,
243	particularly for large cubes and reconstructions in 3-D. To get around
244	this, \duchamp\ provides a shortcut to allow users to perform multiple
245	searches (\eg with different thresholds) on the same reconstruction
246	without calculating the reconstruction each time.
247
248	The first step is to choose to save the reconstructed array as a FITS
249	file by setting \texttt{flagOutputRecon = true}. The file will be
250	saved in the same directory as the input image, so the user needs to
251	have write permissions for that directory.
252
253	The filename will be derived from the input filename, with extra
254	information detailing the reconstruction that has been done. For
255	example, suppose \texttt{image.fits} has been reconstructed using a
256	3-dimensional reconstruction with filter \#2, thresholded at $4\sigma$
257	using all scales. The output filename will then be
258	\texttt{image.RECON-3-2-4-1.fits} (\ie it uses the four parameters
259	relevant for the \atrous\ reconstruction as listed in
260	Appendix~\ref{app-param}). The new FITS file will also have these
261	parameters as header keywords. If a subsection of the input image has
262	been used (see \S\ref{sec-input}), the format of the output filename
263	will be \texttt{image.sub.RECON-3-2-4-1.fits}, and the subsection that
264	has been used is also stored in the FITS header.
265
266	Likewise, the residual image, defined as the difference between the
267	input and reconstructed arrays, can also be saved in the same manner
268	by setting \texttt{flagOutputResid = true}. Its filename will be the
269	same as above, with \texttt{RESID} replacing \texttt{RECON}.
270
271	If a reconstructed image has been saved, it can be read in and used
272	instead of redoing the reconstruction. To do so, the user should set
273	\texttt{flagReconExists = true}. The user can indicate the name of the
274	reconstructed FITS file using the \texttt{reconFile} parameter, or, if
275	this is not specified, \duchamp\ searches for the file with the name
276	as defined above. If the file is not found, the reconstruction is
277	performed as normal. Note that to do this, the user needs to set
278	\texttt{flagAtrous = true} (obviously, if this is \texttt{false}, the
279	reconstruction is not needed).
280
281	\secB{Smoothing the cube}
282	\label{sec-smoothing}
283
284	An alternative to doing the wavelet reconstruction is to Hanning
285	smooth the cube. This technique can be useful in reducing the noise
286	level slightly (at the cost of making neighbouring pixels correlated
287	and blurring any signal present), and is particularly well suited to
288	the case where a particular signal width is believed to be present in
289	the data. It is also substantially faster than the wavelet
290	reconstruction.
291
292	The cube is smoothed only in the spectral domain. That is, each
293	spectrum is independently smoothed, and then put together to form the
294	smoothed cube. This is then treated in the same way as the
295	reconstructed cube, and is used for the searching algorithm (see
296	below). Note that in the case of both the reconstruction and the
297	smoothing options being requested, the reconstruction will take
298	precedence and the smoothing will \emph{not} be done.
299
300	There is only one parameter necessary to define the degree of
301	smoothing -- the Hanning width $a$ (given by the user parameter
302	\texttt{hanningWidth}). The coefficients of the Hanning filter are
303	defined by
304	\[
305	\frac{1+\cos(\pi x/a)}{2},\ \frac{-(a+1)}{2}\leq x \leq \frac{a+1}{2},
306	\]
307	and zero elsewhere. Note that the width specified must be an odd
308	integer (if the parameter provided is even, it is incremented by one).
309
310	The user is able to save the smoothed array in exactly the same manner
311	as for the reconstructed array -- set \texttt{flagOutputSmooth =
312	true}, and then the smoothed array will be saved in
313	\texttt{image.SMOOTH-a.fits}, where a is replaced by the Hanning width
314	used. Similarly, a saved file can be read in by setting
315	\texttt{flagSmoothExists = true} and either specifying a file to be
316	read with the \texttt{smoothFile} parameter or relying on \duchamp\ to
317	find the file with the name as given above.
318
319	\secB{Searching the image}
320	\label{sec-detection}
321
322	The image is searched for detections in two ways: spectrally (a
323	1-dimensional search in the spectrum in each spatial pixel), and
324	spatially (a 2-dimensional search in the spatial image in each
325	channel). In both cases, the algorithm finds connected pixels that are
326	above the user-specified threshold. In the case of the spatial image
327	search, the algorithm of \citet{lutz80} is used to raster-scan through
328	the image and connect groups of pixels on neighbouring rows.
329
330	Note that this algorithm cannot be applied directly to a 3-dimensional
331	case, as it requires that objects are completely nested in a row: that
332	is, if you are scanning along a row, and one object finishes and
333	another starts, you know that you will not get back to the first one
334	(if at all) until the second is completely finished for that
335	row. Three-dimensional data does not have this property, which is why
336	we break up the searching into 1- and 2-dimensional cases.
337
338	The basic idea behind detection is to locate sets of contiguous voxels
339	that lie above some threshold. \duchamp\ now calculates one threshold
340	for the entire cube (previous versions calculated thresholds for each
341	spectrum and image). This enables calculation of signal-to-noise
342	ratios for each source (see Section~\ref{sec-output} for details). The
343	user can manually specify a value (using the parameter
344	\texttt{threshold}) for the threshold, which will override the
345	calculated value. Note that this only applies for the first of the two
346	cases discussed below -- the FDR case ignores any manually-set
347	threshold value.
348
349	The determination of the threshold is done in one of two ways. The
350	first way is a simple sigma-clipping, where a threshold is set at a
351	fixed number $n$ of standard deviations above the mean, and pixels
352	above this threshold are flagged as detected. The value of $n$ is set
353	with the parameter \texttt{snrCut}. As before, the value of the
354	standard deviation is estimated by the MADFM, and corrected by the
355	ratio derived in Appendix~\ref{app-madfm}.
356
357	The second method uses the False Discovery Rate (FDR) technique
358	\citep{miller01,hopkins02}, whose basis we briefly detail here. The
359	false discovery rate (given by the number of false detections divided
360	by the total number of detections) is fixed at a certain value
361	$\alpha$ (\eg $\alpha=0.05$ implies 5\% of detections are false
362	positives). In practice, an $\alpha$ value is chosen, and the ensemble
363	average FDR (\ie $\langle FDR \rangle$) when the method is used will
364	be less than $\alpha$. One calculates $p$ -- the probability,
365	assuming the null hypothesis is true, of obtaining a test statistic as
366	extreme as the pixel value (the observed test statistic) -- for each
367	pixel, and sorts them in increasing order. One then calculates $d$
368	where
369	\[
370	d = \max_j \left\{ j : P_j < \frac{j\alpha}{c_N N} \right\},
371	\]
372	and then rejects all hypotheses whose $p$-values are less than or
373	equal to $P_d$. (So a $P_i<P_d$ will be rejected even if $P_i \geq
374	j\alpha/c_N N$.) Note that ``reject hypothesis'' here means ``accept
375	the pixel as an object pixel'' (\ie we are rejecting the null
376	hypothesis that the pixel belongs to the background).
377
378	The $c_N$ values here are normalisation constants that depend on the
379	correlated nature of the pixel values. If all the pixels are
380	uncorrelated, then $c_N=1$. If $N$ pixels are correlated, then their
381	tests will be dependent on each other, and so $c_N = \sum_{i=1}^N
382	i^{-1}$. \citet{hopkins02} consider real radio data, where the pixels
383	are correlated over the beam. In this case the sum is made over the
384	$N$ pixels that make up the beam. The value of $N$ is calculated from
385	the FITS header (if the correct keywords -- BMAJ, BMIN -- are not
386	present, the size of the beam is taken from the parameter
387	\texttt{beamSize}).
388
389	The theory behind the FDR method implies a direct connection between
390	the choice of $\alpha$ and the fraction of detections that will be
391	false positives. However, due to the merging process, this direct
392	connection is lost when looking at the final number of detections --
393	see discussion in \S\ref{sec-notes}. The effect is that the number of
394	false detections will be less than indicated by the $\alpha$ value
395	used.
396
397	If the cube has been reconstructed or smoothed, the residuals (defined
398	in the sense of original $-$ reconstruction) are used to estimate the
399	noise parameters of the cube. Otherwise they are estimated directly
400	from the cube itself. In both cases, robust estimators are used.
401
402	Detections must have a minimum number of pixels to be counted. This
403	minimum number is given by the input parameters \texttt{minPix} (for
404	2-dimensional searches) and \texttt{minChannels} (for 1-dimensional
405	searches).
406
407	Finally, the search only looks for positive features. If one is
408	interested instead in negative features (such as absorption lines),
409	set the parameter \texttt{flagNegative = true}. This will invert the
410	cube (\ie multiply all pixels by $-1$) prior to the search, and then
411	re-invert the cube (and the fluxes of any detections) after searching
412	is complete. All outputs are done in the same manner as normal, so
413	that fluxes of detections will be negative.
414
415	\secB{Merging detected objects}
416	\label{sec-merger}
417
418	The searching step produces a list of detected objects that will have
419	many repeated detections of a given object -- for instance, spectral
420	detections in adjacent pixels of the same object and/or spatial
421	detections in neighbouring channels. These are then combined in an
422	algorithm that matches all objects judged to be ``close'', according
423	to one of two criteria.
424
425	One criterion is to define two thresholds -- one spatial and one in
426	velocity -- and say that two objects should be merged if there is at
427	least one pair of pixels that lie within these threshold distances of
428	each other. These thresholds are specified by the parameters
429	\texttt{threshSpatial} and \texttt{threshVelocity} (in units of pixels
430	and channels respectively).
431
432	Alternatively, the spatial requirement can be changed to say that
433	there must be a pair of pixels that are \emph{adjacent} -- a stricter,
434	but perhaps more realistic requirement, particularly when the spatial
435	pixels have a large angular size (as is the case for \hi\
436	surveys). This method can be selected by setting the parameter
437	\texttt{flagAdjacent} to 1 (\ie \texttt{true}) in the parameter
438	file. The velocity thresholding is done in the same way as the first
439	option.
440
441	Once the detections have been merged, they may be ``grown''. This is a
442	process of increasing the size of the detection by adding adjacent
443	pixels that are above some secondary threshold. This threshold is
444	lower than the one used for the initial detection, but above the noise
445	level, so that faint pixels are only detected when they are close to a
446	bright pixel. The value of this threshold is a possible input
447	parameter (\texttt{growthCut}), with a default value of
448	$1.5\sigma$. The use of the growth algorithm is controlled by the
449	\texttt{flagGrowth} parameter -- the default value of which is
450	\texttt{false}. If the detections are grown, they are sent through the
451	merging algorithm a second time, to pick up any detections that now
452	overlap or have grown over each other.
453
454	Finally, to be accepted, the detections must span \emph{both} a
455	minimum number of channels (to remove any spurious single-channel
456	spikes that may be present), and a minimum number of spatial
457	pixels. These numbers, as for the original detection step, are set
458	with the \texttt{minChannels} and \texttt{minPix} parameters. The
459	channel requirement means there must be at least one set of
460	\texttt{minChannels} consecutive channels in the source for it to be
461	accepted.

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: tags/release-1.0.7/docs/executionFlow.tex

Download in other formats: