source: trunk/docs/hints.tex @ 1213

Last change on this file since 1213 was 1029, checked in by MatthewWhiting, 12 years ago

Further edits to the Guide. Nearing completion.

File size: 18.8 KB
Line 
1% -----------------------------------------------------------------------
2% hints.tex: Section giving some tips & hints on how Duchamp is best
3%            used.
4% -----------------------------------------------------------------------
5% Copyright (C) 2006, Matthew Whiting, ATNF
6%
7% This program is free software; you can redistribute it and/or modify it
8% under the terms of the GNU General Public License as published by the
9% Free Software Foundation; either version 2 of the License, or (at your
10% option) any later version.
11%
12% Duchamp is distributed in the hope that it will be useful, but WITHOUT
13% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
14% FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
15% for more details.
16%
17% You should have received a copy of the GNU General Public License
18% along with Duchamp; if not, write to the Free Software Foundation,
19% Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA
20%
21% Correspondence concerning Duchamp may be directed to:
22%    Internet email: Matthew.Whiting [at] atnf.csiro.au
23%    Postal address: Dr. Matthew Whiting
24%                    Australia Telescope National Facility, CSIRO
25%                    PO Box 76
26%                    Epping NSW 1710
27%                    AUSTRALIA
28% -----------------------------------------------------------------------
29\secA{Notes and hints on the use of \duchamp}
30\label{sec-notes}
31
32In using \duchamp, the user has to make a number of decisions about
33the way the program runs. This section is designed to give the user
34some idea about what to choose.
35
36\secB{Memory usage}
37
38A lot of attention has been paid to the memory usage in \duchamp,
39recognising that data cubes are going to be increasing in size with
40new generation correlators and wider fields of view. However, users
41with large cubes should be aware of the likely usage for different
42modes of operation and plan their \duchamp execution carefully.
43
44At the start of the program, memory is allocated sufficient for:
45\begin{itemize}
46\item The entire pixel array (as requested, subject to any
47subsection).
48\item The spatial extent, which holds the map of detected pixels (for
49output into the detection map).
50\item If smoothing or reconstruction has been selected, another array
51of the same size as the pixel array. This will hold the
52smoothed/reconstructed array (the original needs to be kept to do the
53correct parameterisation of detected sources).
54\item If baseline-subtraction has been selected, a further array of
55the same size as the pixel array. This holds the baseline values,
56which need to be added back in prior to parameterisation.
57\end{itemize}
58All of these will be float type, except for the detection map, which
59is short.
60
61There will, of course, be additional allocation during the course of
62the program. The detection list will progressively grow, with each
63detection having a memory footprint as described in
64\S\ref{sec-scan}. But perhaps more important and with a larger
65impact will be the temporary space allocated for various algorithms.
66
67The largest of these will be the wavelet reconstruction. This will
68require an additional allocation of twice the size of the array being
69reconstructed, one for the coefficients and one for the wavelets -
70each scale will overwrite the previous one. So, for the 1D case, this
71means an additional allocation of twice the spectral dimension (since
72we only reconstruct one spectrum at a time), but the 3D case will
73require an additional allocation of twice the cube size (this means
74there needs to be available at least four times the size of the input
75cube for 3D reconstruction, plus the additional overheads of
76detections and so forth).
77
78The smoothing has less of an impact, since it only operates on the
79lower dimensions, but it will make an additional allocation of twice
80the relevant size (spectral dimension for spectral smoothing, or
81spatial image size for the spatial Gaussian smoothing).
82
83The other large allocation of temporary space will be for calculating
84robust statistics. The median-based calculations require at least
85partial sorting of the data, and so cannot be done on the original
86image cube. This is done for the entire cube and so the temporary
87memory increase can be large.
88
89
90\secB{Timing considerations}
91
92Another intersting question from a user's perspective is how long you
93can expect \duchamp to take. This is a difficult question to answer in
94general, as different users will have different sized data sets, as
95well as machines with different capabilities (in terms of the CPU
96speed and I/O \& memory bandwidths). Additionally, the time required
97will depend slightly on the number of sources found and their size
98(very large sources can take a while to fully parameterise).
99
100Having said that, in \citet{whiting12} a brief analysis was done
101looking at different modes of execution applied to a single HIPASS
102cube (\#201) using a MacBook Pro (2.66GHz, 8MB RAM). Two sets of
103thresholds were used, either $10^8$~Jy~beam$^{-1}$ (no sources will be
104found, so that the time taken is dominated by preprocessing), or
10535~mJy~beam$^{-1}$ (or $\sim2.58\sigma$, chosen so that the time taken
106will include that required to process sources).  The basic searches,
107with no pre-processing done, took less than a second for the
108high-threshold search, but between 1 and 3~min for the low-threshold
109case -- the numbers of sources detected ranged from 3000 (rejecting
110sources with less than 3 channels and 2 spatial pixels) to 42000
111(keeping all sources).
112
113When smoothing, the raw time for the spectral smoothing was only a few
114seconds, with a small dependence on the width of the smoothing
115filter. And because the number of spurious sources is markedly
116decreased (the final catalogues ranged from 17 to 174 sources,
117depending on the width of the smoothing), searching with the low
118threshold did not add much more than a second to the time. The spatial
119smoothing was more computationally intensive, taking about 4 minutes
120to complete the high-threshold search.
121
122The wavelet reconstruction time primarily depended on the
123dimensionality of the reconstruction, with the 1D taking 20~s, the 2D
124taking 30 - 40~s and the 3D taking 2 - 4~min. The spread in times for
125a given dimensionality was caused by different reconstruction
126thresholds, with lower thresholds taking longer (since more pixels are
127above the threshold and so need to be added to the final spectrum). In
128all cases the reconstruction time dominated the total time for the
129low-threshold search, since the number of sources found was again
130smaller than the basic searches.
131
132
133\secB{Why do preprocessing?}
134
135The preprocessing options provided by \duchamp, particularly the
136ability to smooth or reconstruct via multi-resolution wavelet
137decomposition, provide an opportunity to beat the effects of the
138random noise that will be present in the data. This noise will
139ultimately limit ones ability to detect objects and form a complete
140and reliable catalogue. Two effects are important here. First, the
141noise reduces the completeness of the final catalogue by suppressing
142the flux of real sources such that they fall below the detection
143threshold. Secondly, the noise provides false positive detections
144through noise peaks that fall above the threshold, thereby reducing
145the reliability of the catalogue.
146
147\citet{whiting12} examined the effect on completeness and reliability
148for the reconstruction and smoothing (1D cases only) when applied to a
149simple simulated dataset. Both had the effect of reducing the number
150of spurious sources, which means the searches can be done to fainter
151thresholds. This led to completeness levels of about one flux unit
152(equal to one standard-deviation of the noise) fainter than searches
153without pre-processing, with $>95\%$ reliability. The smoothing did
154slightly better, with the completeness level nearly half a flux unit
155fainter than the reconstruction, although this was helped by the
156sources in the simulation all having the same spectral size.
157
158\secB{Reconstruction considerations}
159
160The \atrous wavelet reconstruction approach is designed to remove a
161large amount of random noise while preserving as much structure as
162possible on the full range of spatial and/or spectral scales present
163in the data. While it is relatively more expensive in terms of memory
164and CPU usage (see previous sections), its effect on, in particular,
165the reliability of the final catalogue makes it worth investigating.
166
167There are, however, a number of subtleties to it that need to be
168considered by potential users. \citet{whiting12} shows a set of
169examples of reconstruction applied to simulated and real data. The
170real data, in this case a HIPASS cube, shows differences in the
171quality of the reconstructed spectrum depending on the dimensionality
172of the reconstruction. The two-dimensional reconstruction (where the
173cube is reconstructed one channel map at a time) shows much larger
174channel-to-channel noise, with a number of narrow peaks surviving the
175reconstruction process. The problem here is that there are spatial
176correlations between pixels due to the beam, which allow beam-sized
177noise fluctuations to rise above the threshold more frequently in
178one-dimension. The other effect is that when compared to a spectrum
179from the 1D reconstruction, each channel is independently
180reconstructed, and does not depend on its neighbouring channels. This
181is also why the 3D reconstruction (which also suffers from the beam
182effects) has improved noise in the output spectrum, since the
183information on neighbouring channels is taken into account.
184
185Caution is also advised when looking at subsections of a cube. Due to
186the multi-scale nature of the algorithm, the wavelet coefficients at a
187given pixel are influenced by pixels at very large separations,
188particularly given that edges are dealt with by assuming reflection
189(so the whole array is visible to all pixels). Also, if one decreases
190the dimensions of the array being reconstructed, there may be fewer
191scales used in the reconstruction. These points mean that the
192reconstruction of a subsection of a cube will differ from the same
193subsection of the reconstructed cube. The difference may be small
194(depending on the relative size difference and the amount of structure
195at large scales), but there will be differences at some level.
196
197Note also that BLANK pixels have no effect on the reconstruction: they
198remain as BLANK in the output, and do not contribute to the discrete
199convolution when they otherwise would. The use of the Milky Way
200channel range, however, has no effect on the reconstruction -- these
201are applied after the preprocessing, either in the searching or the
202rejection stage.
203
204\secB{Smoothing considerations}
205
206The smoothing approach differs from the wavelet reconstruction in that
207it has a single scale associated with it. The user has two choices to
208make - which dimension to smooth in (spatially or spectrally), and
209what size kernel to smooth with. \citet{whiting12} show examples of
210how different smoothing widths (in one-dimension in this case) can
211highlight sources of different sizes. If one has some \textit{a
212  priori} idea of the typical size scale of objects one wishes to
213detect, then choosing a single smoothing scale can be quite
214beneficial.
215
216Note also that beam effects can be important here too, when smoothing
217spatial data on scales close to that of the beam. This can enhance
218beam-sized noise fluctuations and potentially introduce spurious
219sources. As always, examining the smoothed array (after saving via
220\texttt{flagOutputSmooth}) is a good idea.
221
222
223\secB{Threshold method}
224
225When it comes to searching, the FDR method produces more reliable
226results than simple sigma-clipping, particularly in the absence of
227reconstruction.  However, it does not work in exactly the way one
228would expect for a given value of \texttt{alpha}. For instance,
229setting fairly liberal values of \texttt{alpha} (say, 0.1) will often
230lead to a much smaller fraction of false detections (\ie much less
231than 10\%). This is the effect of the merging algorithms, that combine
232the sources after the detection stage, and reject detections not
233meeting the minimum pixel or channel requirements.  It is thus better
234to aim for larger \texttt{alpha} values than those derived from a
235straight conversion of the desired false detection rate.
236
237If the FDR method is not used, caution is required when choosing the
238S/N cutoff. Typical cubes have very large numbers of pixels, so even
239an apparently large cutoff will still result in a not-insignificant
240number of detections simply due to random fluctuations of the noise
241background. For instance, a $4\sigma$ threshold on a cube of Gaussian
242noise of size $100\times100\times1024$ will result in $\sim340$
243single-pixel detections. This is where the minimum channel and pixel
244requirements are important in rejecting spurious detections.
245
246
247%  \secB{Preprocessing}
248
249%  \secC{Should I do any preprocessing?}
250
251%  The main choice is whether to alter the cube to try and enhance the
252%  detectability of objects, by either smoothing or reconstructing via
253%  the \atrous method. The main benefits of both methods are the marked
254%  reduction in the noise level, leading to regularly-shaped detections,
255%  and good reliability for faint sources.
256
257%  The main drawback with the \atrous method is the long execution time:
258%  to reconstruct a $170\times160\times1024$ (\hipass) cube often
259%  requires three iterations and takes about 20-25 minutes to run
260%  completely. Note that this is for the more complete three-dimensional
261%  reconstruction: using \texttt{reconDim = 1} makes the reconstruction
262%  quicker (the full program then takes less than 5 minutes), but it is
263%  still the largest part of the time.
264
265%  The smoothing procedure is computationally simpler, and thus quicker,
266%  than the reconstruction. The spectral Hanning method adds only a very
267%  small overhead on the execution, and the spatial Gaussian method,
268%  while taking longer, will be done (for the above example) in less than
269%  2 minutes. Note that these times will depend on the size of the
270%  filter/kernel used: a larger filter means more calculations.
271
272%  The searching part of the procedure is much quicker: searching an
273%  un-reconstructed cube leads to execution times of less than a
274%  minute. Alternatively, using the ability to read in previously-saved
275%  reconstructed arrays makes running the reconstruction more than once a
276%  more feasible prospect.
277
278%  On the positive side, the shape of the detections in a cube that has
279%  been reconstructed or smoothed will be much more regular and smooth --
280%  the ragged edges that objects in the raw cube possess are smoothed by
281%  the removal of most of the noise. This enables better determination of
282%  the shapes and characteristics of objects.
283
284%  \secC{Reconstruction vs Smoothing}
285
286%  While the time overhead is larger for the reconstruction case, it will
287%  potentially provide a better recovery of real sources than the
288%  smoothing case. This is because it probes the full range of scales
289%  present in the cube (or spectral domain), rather than the specific
290%  scale determined by the Hanning filter or Gaussian kernel used in the
291%  smoothing.
292
293%  When considering the reconstruction method, note that the 2D
294%  reconstruction (\texttt{reconDim = 2}) can be susceptible to edge
295%  effects. If the valid area in the cube (\ie the part that is not
296%  BLANK) has non-rectangular edges, the convolution can produce
297%  artefacts in the reconstruction that mimic the edges and can lead
298%  (depending on the selection threshold) to some spurious
299%  sources. Caution is advised with such data -- the user is advised to
300%  check carefully the reconstructed cube for the presence of such
301%  artefacts.
302
303%  A more important effect that can be important for 2D reconstructions
304%  is the fact that the pixels in the spatial domain typically exhibit
305%  some correlation due to the beam. Since each channel is reconstructed
306%  independently, beam-sized noise fluctuations can rise above the
307%  reconstruction threshold more frequency than in the 1D case, providing
308%  a greater number of spurious single-channel spikes in a given
309%  reconstructed spectrum. This effect will also be present in 3D
310%  reconstructions, although to a lesser degree since information in the
311%  spectral direction is also taken into account.
312
313%  If one chooses the reconstruction method, a further decision is
314%  required on the signal-to-noise cutoff used in determining acceptable
315%  wavelet coefficients. A larger value will remove more noise from the
316%  cube, at the expense of losing fainter sources, while a smaller value
317%  will include more noise, which may produce spurious detections, but
318%  will be more sensitive to faint sources. Values of less than about
319%  $3\sigma$ tend to not reduce the noise a great deal and can lead to
320%  many spurious sources (this depends, of course on the cube itself).
321
322%  The smoothing options have less parameters to consider: basically just
323%  the size of the smoothing function or kernel. Spectrally smoothing
324%  with a Hanning filter of width 3 (the smallest possible) is very
325%  efficient at removing spurious one-channel objects that may result
326%  just from statistical fluctuations of the noise. One may want to use
327%  larger widths or kernels of larger size to look for features of a
328%  particular scale in the cube.
329
330%  When it comes to searching, the FDR method produces more reliable
331%  results than simple sigma-clipping, particularly in the absence of
332%  reconstruction.  However, it does not work in exactly the way one
333%  would expect for a given value of \texttt{alpha}. For instance,
334%  setting fairly liberal values of \texttt{alpha} (say, 0.1) will often
335%  lead to a much smaller fraction of false detections (\ie much less
336%  than 10\%). This is the effect of the merging algorithms, that combine
337%  the sources after the detection stage, and reject detections not
338%  meeting the minimum pixel or channel requirements.  It is thus better
339%  to aim for larger \texttt{alpha} values than those derived from a
340%  straight conversion of the desired false detection rate.
341
342%  If the FDR method is not used, caution is required when choosing the
343%  S/N cutoff. Typical cubes have very large numbers of pixels, so even
344%  an apparently large cutoff will still result in a not-insignificant
345%  number of detections simply due to random fluctuations of the noise
346%  background. For instance, a $4\sigma$ threshold on a cube of Gaussian
347%  noise of size $100\times100\times1024$ will result in $\sim340$
348%  single-pixel detections. This is where the minimum channel and pixel
349%  requirements are important in rejecting spurious detections.
350
351%  %Finally, as \duchamp is still undergoing development, there are some
352%  %elements that are not fully developed. In particular, it is not as
353%  %clever as I would like at avoiding interference. The ability to place
354%  %requirements on the minimum number of channels and pixels partially
355%  %circumvents this problem, but work is being done to make \duchamp
356%  %smarter at rejecting signals that are clearly (to a human eye at
357%  %least) interference. See the following section for further
358%  %improvements that are planned.
Note: See TracBrowser for help on using the repository browser.