% ----------------------------------------------------------------------- % hints.tex: Section giving some tips & hints on how Duchamp is best % used. % ----------------------------------------------------------------------- % Copyright (C) 2006, Matthew Whiting, ATNF % % This program is free software; you can redistribute it and/or modify it % under the terms of the GNU General Public License as published by the % Free Software Foundation; either version 2 of the License, or (at your % option) any later version. % % Duchamp is distributed in the hope that it will be useful, but WITHOUT % ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or % FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License % for more details. % % You should have received a copy of the GNU General Public License % along with Duchamp; if not, write to the Free Software Foundation, % Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA % % Correspondence concerning Duchamp may be directed to: % Internet email: Matthew.Whiting [at] atnf.csiro.au % Postal address: Dr. Matthew Whiting % Australia Telescope National Facility, CSIRO % PO Box 76 % Epping NSW 1710 % AUSTRALIA % ----------------------------------------------------------------------- \secA{Notes and hints on the use of \duchamp} \label{sec-notes} In using \duchamp, the user has to make a number of decisions about the way the program runs. This section is designed to give the user some idea about what to choose. \secB{Memory usage} A lot of attention has been paid to the memory usage in \duchamp, recognising that data cubes are going to be increasing in size with new generation correlators and wider fields of view. However, users with large cubes should be aware of the likely usage for different modes of operation and plan their \duchamp execution carefully. At the start of the program, memory is allocated sufficient for: \begin{itemize} \item The entire pixel array (as requested, subject to any subsection). \item The spatial extent, which holds the map of detected pixels (for output into the detection map). \item If smoothing or reconstruction has been selected, another array of the same size as the pixel array. This will hold the smoothed/reconstructed array (the original needs to be kept to do the correct parameterisation of detected sources). \item If baseline-subtraction has been selected, a further array of the same size as the pixel array. This holds the baseline values, which need to be added back in prior to parameterisation. \end{itemize} All of these will be float type, except for the detection map, which is short. There will, of course, be additional allocation during the course of the program. The detection list will progressively grow, with each detection having a memory footprint as described in \S\ref{sec-scan}. But perhaps more important and with a larger impact will be the temporary space allocated for various algorithms. The largest of these will be the wavelet reconstruction. This will require an additional allocation of twice the size of the array being reconstructed, one for the coefficients and one for the wavelets - each scale will overwrite the previous one. So, for the 1D case, this means an additional allocation of twice the spectral dimension (since we only reconstruct one spectrum at a time), but the 3D case will require an additional allocation of twice the cube size (this means there needs to be available at least four times the size of the input cube for 3D reconstruction, plus the additional overheads of detections and so forth). The smoothing has less of an impact, since it only operates on the lower dimensions, but it will make an additional allocation of twice the relevant size (spectral dimension for spectral smoothing, or spatial image size for the spatial Gaussian smoothing). The other large allocation of temporary space will be for calculating robust statistics. The median-based calculations require at least partial sorting of the data, and so cannot be done on the original image cube. This is done for the entire cube and so the temporary memory increase can be large. \secB{Timing considerations} Another intersting question from a user's perspective is how long you can expect \duchamp to take. This is a difficult question to answer in general, as different users will have different sized data sets, as well as machines with different capabilities (in terms of the CPU speed and I/O \& memory bandwidths). Additionally, the time required will depend slightly on the number of sources found and their size (very large sources can take a while to fully parameterise). Having said that, in \citet{whiting12} a brief analysis was done looking at different modes of execution applied to a single HIPASS cube (\#201) using a MacBook Pro (2.66GHz, 8MB RAM). Two sets of thresholds were used, either $10^8$~Jy~beam$^{-1}$ (no sources will be found, so that the time taken is dominated by preprocessing), or 35~mJy~beam$^{-1}$ (or $\sim2.58\sigma$, chosen so that the time taken will include that required to process sources). The basic searches, with no pre-processing done, took less than a second for the high-threshold search, but between 1 and 3~min for the low-threshold case -- the numbers of sources detected ranged from 3000 (rejecting sources with less than 3 channels and 2 spatial pixels) to 42000 (keeping all sources). When smoothing, the raw time for the spectral smoothing was only a few seconds, with a small dependence on the width of the smoothing filter. And because the number of spurious sources is markedly decreased (the final catalogues ranged from 17 to 174 sources, depending on the width of the smoothing), searching with the low threshold did not add much more than a second to the time. The spatial smoothing was more computationally intensive, taking about 4 minutes to complete the high-threshold search. The wavelet reconstruction time primarily depended on the dimensionality of the reconstruction, with the 1D taking 20~s, the 2D taking 30 - 40~s and the 3D taking 2 - 4~min. The spread in times for a given dimensionality was caused by different reconstruction thresholds, with lower thresholds taking longer (since more pixels are above the threshold and so need to be added to the final spectrum). In all cases the reconstruction time dominated the total time for the low-threshold search, since the number of sources found was again smaller than the basic searches. \secB{Why do preprocessing?} The preprocessing options provided by \duchamp, particularly the ability to smooth or reconstruct via multi-resolution wavelet decomposition, provide an opportunity to beat the effects of the random noise that will be present in the data. This noise will ultimately limit ones ability to detect objects and form a complete and reliable catalogue. Two effects are important here. First, the noise reduces the completeness of the final catalogue by suppressing the flux of real sources such that they fall below the detection threshold. Secondly, the noise provides false positive detections through noise peaks that fall above the threshold, thereby reducing the reliability of the catalogue. \citet{whiting12} examined the effect on completeness and reliability for the reconstruction and smoothing (1D cases only) when applied to a simple simulated dataset. Both had the effect of reducing the number of spurious sources, which means the searches can be done to fainter thresholds. This led to completeness levels of about one flux unit (equal to one standard-deviation of the noise) fainter than searches without pre-processing, with $>95\%$ reliability. The smoothing did slightly better, with the completeness level nearly half a flux unit fainter than the reconstruction, although this was helped by the sources in the simulation all having the same spectral size. \secB{Reconstruction considerations} The \atrous wavelet reconstruction approach is designed to remove a large amount of random noise while preserving as much structure as possible on the full range of spatial and/or spectral scales present in the data. While it is relatively more expensive in terms of memory and CPU usage (see previous sections), its effect on, in particular, the reliability of the final catalogue makes it worth investigating. There are, however, a number of subtleties to it that need to be considered by potential users. \citet{whiting12} shows a set of examples of reconstruction applied to simulated and real data. The real data, in this case a HIPASS cube, shows differences in the quality of the reconstructed spectrum depending on the dimensionality of the reconstruction. The two-dimensional reconstruction (where the cube is reconstructed one channel map at a time) shows much larger channel-to-channel noise, with a number of narrow peaks surviving the reconstruction process. The problem here is that there are spatial correlations between pixels due to the beam, which allow beam-sized noise fluctuations to rise above the threshold more frequently in one-dimension. The other effect is that when compared to a spectrum from the 1D reconstruction, each channel is independently reconstructed, and does not depend on its neighbouring channels. This is also why the 3D reconstruction (which also suffers from the beam effects) has improved noise in the output spectrum, since the information on neighbouring channels is taken into account. Caution is also advised when looking at subsections of a cube. Due to the multi-scale nature of the algorithm, the wavelet coefficients at a given pixel are influenced by pixels at very large separations, particularly given that edges are dealt with by assuming reflection (so the whole array is visible to all pixels). Also, if one decreases the dimensions of the array being reconstructed, there may be fewer scales used in the reconstruction. These points mean that the reconstruction of a subsection of a cube will differ from the same subsection of the reconstructed cube. The difference may be small (depending on the relative size difference and the amount of structure at large scales), but there will be differences at some level. Note also that BLANK pixels have no effect on the reconstruction: they remain as BLANK in the output, and do not contribute to the discrete convolution when they otherwise would. The use of the Milky Way channel range, however, has no effect on the reconstruction -- these are applied after the preprocessing, either in the searching or the rejection stage. \secB{Smoothing considerations} The smoothing approach differs from the wavelet reconstruction in that it has a single scale associated with it. The user has two choices to make - which dimension to smooth in (spatially or spectrally), and what size kernel to smooth with. \citet{whiting12} show examples of how different smoothing widths (in one-dimension in this case) can highlight sources of different sizes. If one has some \textit{a priori} idea of the typical size scale of objects one wishes to detect, then choosing a single smoothing scale can be quite beneficial. Note also that beam effects can be important here too, when smoothing spatial data on scales close to that of the beam. This can enhance beam-sized noise fluctuations and potentially introduce spurious sources. As always, examining the smoothed array (after saving via \texttt{flagOutputSmooth}) is a good idea. \secB{Threshold method} When it comes to searching, the FDR method produces more reliable results than simple sigma-clipping, particularly in the absence of reconstruction. However, it does not work in exactly the way one would expect for a given value of \texttt{alpha}. For instance, setting fairly liberal values of \texttt{alpha} (say, 0.1) will often lead to a much smaller fraction of false detections (\ie much less than 10\%). This is the effect of the merging algorithms, that combine the sources after the detection stage, and reject detections not meeting the minimum pixel or channel requirements. It is thus better to aim for larger \texttt{alpha} values than those derived from a straight conversion of the desired false detection rate. If the FDR method is not used, caution is required when choosing the S/N cutoff. Typical cubes have very large numbers of pixels, so even an apparently large cutoff will still result in a not-insignificant number of detections simply due to random fluctuations of the noise background. For instance, a $4\sigma$ threshold on a cube of Gaussian noise of size $100\times100\times1024$ will result in $\sim340$ single-pixel detections. This is where the minimum channel and pixel requirements are important in rejecting spurious detections. % \secB{Preprocessing} % % \secC{Should I do any preprocessing?} % % The main choice is whether to alter the cube to try and enhance the % detectability of objects, by either smoothing or reconstructing via % the \atrous method. The main benefits of both methods are the marked % reduction in the noise level, leading to regularly-shaped detections, % and good reliability for faint sources. % % The main drawback with the \atrous method is the long execution time: % to reconstruct a $170\times160\times1024$ (\hipass) cube often % requires three iterations and takes about 20-25 minutes to run % completely. Note that this is for the more complete three-dimensional % reconstruction: using \texttt{reconDim = 1} makes the reconstruction % quicker (the full program then takes less than 5 minutes), but it is % still the largest part of the time. % % The smoothing procedure is computationally simpler, and thus quicker, % than the reconstruction. The spectral Hanning method adds only a very % small overhead on the execution, and the spatial Gaussian method, % while taking longer, will be done (for the above example) in less than % 2 minutes. Note that these times will depend on the size of the % filter/kernel used: a larger filter means more calculations. % % The searching part of the procedure is much quicker: searching an % un-reconstructed cube leads to execution times of less than a % minute. Alternatively, using the ability to read in previously-saved % reconstructed arrays makes running the reconstruction more than once a % more feasible prospect. % % On the positive side, the shape of the detections in a cube that has % been reconstructed or smoothed will be much more regular and smooth -- % the ragged edges that objects in the raw cube possess are smoothed by % the removal of most of the noise. This enables better determination of % the shapes and characteristics of objects. % % \secC{Reconstruction vs Smoothing} % % While the time overhead is larger for the reconstruction case, it will % potentially provide a better recovery of real sources than the % smoothing case. This is because it probes the full range of scales % present in the cube (or spectral domain), rather than the specific % scale determined by the Hanning filter or Gaussian kernel used in the % smoothing. % % When considering the reconstruction method, note that the 2D % reconstruction (\texttt{reconDim = 2}) can be susceptible to edge % effects. If the valid area in the cube (\ie the part that is not % BLANK) has non-rectangular edges, the convolution can produce % artefacts in the reconstruction that mimic the edges and can lead % (depending on the selection threshold) to some spurious % sources. Caution is advised with such data -- the user is advised to % check carefully the reconstructed cube for the presence of such % artefacts. % % A more important effect that can be important for 2D reconstructions % is the fact that the pixels in the spatial domain typically exhibit % some correlation due to the beam. Since each channel is reconstructed % independently, beam-sized noise fluctuations can rise above the % reconstruction threshold more frequency than in the 1D case, providing % a greater number of spurious single-channel spikes in a given % reconstructed spectrum. This effect will also be present in 3D % reconstructions, although to a lesser degree since information in the % spectral direction is also taken into account. % % If one chooses the reconstruction method, a further decision is % required on the signal-to-noise cutoff used in determining acceptable % wavelet coefficients. A larger value will remove more noise from the % cube, at the expense of losing fainter sources, while a smaller value % will include more noise, which may produce spurious detections, but % will be more sensitive to faint sources. Values of less than about % $3\sigma$ tend to not reduce the noise a great deal and can lead to % many spurious sources (this depends, of course on the cube itself). % % The smoothing options have less parameters to consider: basically just % the size of the smoothing function or kernel. Spectrally smoothing % with a Hanning filter of width 3 (the smallest possible) is very % efficient at removing spurious one-channel objects that may result % just from statistical fluctuations of the noise. One may want to use % larger widths or kernels of larger size to look for features of a % particular scale in the cube. % % When it comes to searching, the FDR method produces more reliable % results than simple sigma-clipping, particularly in the absence of % reconstruction. However, it does not work in exactly the way one % would expect for a given value of \texttt{alpha}. For instance, % setting fairly liberal values of \texttt{alpha} (say, 0.1) will often % lead to a much smaller fraction of false detections (\ie much less % than 10\%). This is the effect of the merging algorithms, that combine % the sources after the detection stage, and reject detections not % meeting the minimum pixel or channel requirements. It is thus better % to aim for larger \texttt{alpha} values than those derived from a % straight conversion of the desired false detection rate. % % If the FDR method is not used, caution is required when choosing the % S/N cutoff. Typical cubes have very large numbers of pixels, so even % an apparently large cutoff will still result in a not-insignificant % number of detections simply due to random fluctuations of the noise % background. For instance, a $4\sigma$ threshold on a cube of Gaussian % noise of size $100\times100\times1024$ will result in $\sim340$ % single-pixel detections. This is where the minimum channel and pixel % requirements are important in rejecting spurious detections. % % %Finally, as \duchamp is still undergoing development, there are some % %elements that are not fully developed. In particular, it is not as % %clever as I would like at avoiding interference. The ability to place % %requirements on the minimum number of channels and pixels partially % %circumvents this problem, but work is being done to make \duchamp % %smarter at rejecting signals that are clearly (to a human eye at % %least) interference. See the following section for further % %improvements that are planned.