[303] | 1 | % ----------------------------------------------------------------------- |
---|
| 2 | % hints.tex: Section giving some tips & hints on how Duchamp is best |
---|
| 3 | % used. |
---|
| 4 | % ----------------------------------------------------------------------- |
---|
| 5 | % Copyright (C) 2006, Matthew Whiting, ATNF |
---|
| 6 | % |
---|
| 7 | % This program is free software; you can redistribute it and/or modify it |
---|
| 8 | % under the terms of the GNU General Public License as published by the |
---|
| 9 | % Free Software Foundation; either version 2 of the License, or (at your |
---|
| 10 | % option) any later version. |
---|
| 11 | % |
---|
| 12 | % Duchamp is distributed in the hope that it will be useful, but WITHOUT |
---|
| 13 | % ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or |
---|
| 14 | % FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License |
---|
| 15 | % for more details. |
---|
| 16 | % |
---|
| 17 | % You should have received a copy of the GNU General Public License |
---|
| 18 | % along with Duchamp; if not, write to the Free Software Foundation, |
---|
| 19 | % Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA |
---|
| 20 | % |
---|
| 21 | % Correspondence concerning Duchamp may be directed to: |
---|
| 22 | % Internet email: Matthew.Whiting [at] atnf.csiro.au |
---|
| 23 | % Postal address: Dr. Matthew Whiting |
---|
| 24 | % Australia Telescope National Facility, CSIRO |
---|
| 25 | % PO Box 76 |
---|
| 26 | % Epping NSW 1710 |
---|
| 27 | % AUSTRALIA |
---|
| 28 | % ----------------------------------------------------------------------- |
---|
[158] | 29 | \secA{Notes and hints on the use of \duchamp} |
---|
| 30 | \label{sec-notes} |
---|
| 31 | |
---|
| 32 | In using \duchamp, the user has to make a number of decisions about |
---|
| 33 | the way the program runs. This section is designed to give the user |
---|
| 34 | some idea about what to choose. |
---|
| 35 | |
---|
[1011] | 36 | \secB{Memory usage} |
---|
| 37 | |
---|
| 38 | A lot of attention has been paid to the memory usage in \duchamp, |
---|
| 39 | recognising that data cubes are going to be increasing in size with |
---|
| 40 | new generation correlators and wider fields of view. However, users |
---|
| 41 | with large cubes should be aware of the likely usage for different |
---|
| 42 | modes of operation and plan their \duchamp execution carefully. |
---|
| 43 | |
---|
| 44 | At the start of the program, memory is allocated sufficient for: |
---|
| 45 | \begin{itemize} |
---|
| 46 | \item The entire pixel array (as requested, subject to any |
---|
| 47 | subsection). |
---|
| 48 | \item The spatial extent, which holds the map of detected pixels (for |
---|
| 49 | output into the detection map). |
---|
| 50 | \item If smoothing or reconstruction has been selected, another array |
---|
| 51 | of the same size as the pixel array. This will hold the |
---|
| 52 | smoothed/reconstructed array (the original needs to be kept to do the |
---|
| 53 | correct parameterisation of detected sources). |
---|
| 54 | \item If baseline-subtraction has been selected, a further array of |
---|
| 55 | the same size as the pixel array. This holds the baseline values, |
---|
| 56 | which need to be added back in prior to parameterisation. |
---|
| 57 | \end{itemize} |
---|
| 58 | All of these will be float type, except for the detection map, which |
---|
| 59 | is short. |
---|
| 60 | |
---|
| 61 | There will, of course, be additional allocation during the course of |
---|
| 62 | the program. The detection list will progressively grow, with each |
---|
| 63 | detection having a memory footprint as described in |
---|
[1028] | 64 | \S\ref{sec-scan}. But perhaps more important and with a larger |
---|
[1011] | 65 | impact will be the temporary space allocated for various algorithms. |
---|
| 66 | |
---|
| 67 | The largest of these will be the wavelet reconstruction. This will |
---|
| 68 | require an additional allocation of twice the size of the array being |
---|
| 69 | reconstructed, one for the coefficients and one for the wavelets - |
---|
| 70 | each scale will overwrite the previous one. So, for the 1D case, this |
---|
| 71 | means an additional allocation of twice the spectral dimension (since |
---|
| 72 | we only reconstruct one spectrum at a time), but the 3D case will |
---|
| 73 | require an additional allocation of twice the cube size (this means |
---|
| 74 | there needs to be available at least four times the size of the input |
---|
| 75 | cube for 3D reconstruction, plus the additional overheads of |
---|
| 76 | detections and so forth). |
---|
| 77 | |
---|
| 78 | The smoothing has less of an impact, since it only operates on the |
---|
| 79 | lower dimensions, but it will make an additional allocation of twice |
---|
| 80 | the relevant size (spectral dimension for spectral smoothing, or |
---|
| 81 | spatial image size for the spatial Gaussian smoothing). |
---|
| 82 | |
---|
| 83 | The other large allocation of temporary space will be for calculating |
---|
| 84 | robust statistics. The median-based calculations require at least |
---|
| 85 | partial sorting of the data, and so cannot be done on the original |
---|
| 86 | image cube. This is done for the entire cube and so the temporary |
---|
| 87 | memory increase can be large. |
---|
| 88 | |
---|
| 89 | |
---|
[1023] | 90 | \secB{Timing considerations} |
---|
[1011] | 91 | |
---|
[1023] | 92 | Another intersting question from a user's perspective is how long you |
---|
| 93 | can expect \duchamp to take. This is a difficult question to answer in |
---|
| 94 | general, as different users will have different sized data sets, as |
---|
| 95 | well as machines with different capabilities (in terms of the CPU |
---|
| 96 | speed and I/O \& memory bandwidths). Additionally, the time required |
---|
| 97 | will depend slightly on the number of sources found and their size |
---|
| 98 | (very large sources can take a while to fully parameterise). |
---|
[993] | 99 | |
---|
[1023] | 100 | Having said that, in \citet{whiting12} a brief analysis was done |
---|
| 101 | looking at different modes of execution applied to a single HIPASS |
---|
| 102 | cube (\#201) using a MacBook Pro (2.66GHz, 8MB RAM). Two sets of |
---|
| 103 | thresholds were used, either $10^8$~Jy~beam$^{-1}$ (no sources will be |
---|
| 104 | found, so that the time taken is dominated by preprocessing), or |
---|
| 105 | 35~mJy~beam$^{-1}$ (or $\sim2.58\sigma$, chosen so that the time taken |
---|
| 106 | will include that required to process sources). The basic searches, |
---|
| 107 | with no pre-processing done, took less than a second for the |
---|
| 108 | high-threshold search, but between 1 and 3~min for the low-threshold |
---|
| 109 | case -- the numbers of sources detected ranged from 3000 (rejecting |
---|
| 110 | sources with less than 3 channels and 2 spatial pixels) to 42000 |
---|
| 111 | (keeping all sources). |
---|
[1011] | 112 | |
---|
[1023] | 113 | When smoothing, the raw time for the spectral smoothing was only a few |
---|
| 114 | seconds, with a small dependence on the width of the smoothing |
---|
| 115 | filter. And because the number of spurious sources is markedly |
---|
| 116 | decreased (the final catalogues ranged from 17 to 174 sources, |
---|
| 117 | depending on the width of the smoothing), searching with the low |
---|
| 118 | threshold did not add much more than a second to the time. The spatial |
---|
| 119 | smoothing was more computationally intensive, taking about 4 minutes |
---|
| 120 | to complete the high-threshold search. |
---|
[158] | 121 | |
---|
[1023] | 122 | The wavelet reconstruction time primarily depended on the |
---|
| 123 | dimensionality of the reconstruction, with the 1D taking 20~s, the 2D |
---|
| 124 | taking 30 - 40~s and the 3D taking 2 - 4~min. The spread in times for |
---|
[1029] | 125 | a given dimensionality was caused by different reconstruction |
---|
[1023] | 126 | thresholds, with lower thresholds taking longer (since more pixels are |
---|
| 127 | above the threshold and so need to be added to the final spectrum). In |
---|
| 128 | all cases the reconstruction time dominated the total time for the |
---|
| 129 | low-threshold search, since the number of sources found was again |
---|
| 130 | smaller than the basic searches. |
---|
[285] | 131 | |
---|
| 132 | |
---|
[1029] | 133 | \secB{Why do preprocessing?} |
---|
[158] | 134 | |
---|
[1029] | 135 | The preprocessing options provided by \duchamp, particularly the |
---|
| 136 | ability to smooth or reconstruct via multi-resolution wavelet |
---|
| 137 | decomposition, provide an opportunity to beat the effects of the |
---|
| 138 | random noise that will be present in the data. This noise will |
---|
| 139 | ultimately limit ones ability to detect objects and form a complete |
---|
| 140 | and reliable catalogue. Two effects are important here. First, the |
---|
| 141 | noise reduces the completeness of the final catalogue by suppressing |
---|
| 142 | the flux of real sources such that they fall below the detection |
---|
| 143 | threshold. Secondly, the noise provides false positive detections |
---|
| 144 | through noise peaks that fall above the threshold, thereby reducing |
---|
| 145 | the reliability of the catalogue. |
---|
[158] | 146 | |
---|
[1029] | 147 | \citet{whiting12} examined the effect on completeness and reliability |
---|
| 148 | for the reconstruction and smoothing (1D cases only) when applied to a |
---|
| 149 | simple simulated dataset. Both had the effect of reducing the number |
---|
| 150 | of spurious sources, which means the searches can be done to fainter |
---|
| 151 | thresholds. This led to completeness levels of about one flux unit |
---|
| 152 | (equal to one standard-deviation of the noise) fainter than searches |
---|
| 153 | without pre-processing, with $>95\%$ reliability. The smoothing did |
---|
| 154 | slightly better, with the completeness level nearly half a flux unit |
---|
| 155 | fainter than the reconstruction, although this was helped by the |
---|
| 156 | sources in the simulation all having the same spectral size. |
---|
| 157 | |
---|
[1023] | 158 | \secB{Reconstruction considerations} |
---|
[1011] | 159 | |
---|
[1029] | 160 | The \atrous wavelet reconstruction approach is designed to remove a |
---|
| 161 | large amount of random noise while preserving as much structure as |
---|
| 162 | possible on the full range of spatial and/or spectral scales present |
---|
| 163 | in the data. While it is relatively more expensive in terms of memory |
---|
| 164 | and CPU usage (see previous sections), its effect on, in particular, |
---|
| 165 | the reliability of the final catalogue makes it worth investigating. |
---|
[292] | 166 | |
---|
[1029] | 167 | There are, however, a number of subtleties to it that need to be |
---|
| 168 | considered by potential users. \citet{whiting12} shows a set of |
---|
| 169 | examples of reconstruction applied to simulated and real data. The |
---|
| 170 | real data, in this case a HIPASS cube, shows differences in the |
---|
| 171 | quality of the reconstructed spectrum depending on the dimensionality |
---|
| 172 | of the reconstruction. The two-dimensional reconstruction (where the |
---|
| 173 | cube is reconstructed one channel map at a time) shows much larger |
---|
| 174 | channel-to-channel noise, with a number of narrow peaks surviving the |
---|
| 175 | reconstruction process. The problem here is that there are spatial |
---|
| 176 | correlations between pixels due to the beam, which allow beam-sized |
---|
| 177 | noise fluctuations to rise above the threshold more frequently in |
---|
| 178 | one-dimension. The other effect is that when compared to a spectrum |
---|
| 179 | from the 1D reconstruction, each channel is independently |
---|
| 180 | reconstructed, and does not depend on its neighbouring channels. This |
---|
| 181 | is also why the 3D reconstruction (which also suffers from the beam |
---|
| 182 | effects) has improved noise in the output spectrum, since the |
---|
| 183 | information on neighbouring channels is taken into account. |
---|
| 184 | |
---|
| 185 | Caution is also advised when looking at subsections of a cube. Due to |
---|
| 186 | the multi-scale nature of the algorithm, the wavelet coefficients at a |
---|
| 187 | given pixel are influenced by pixels at very large separations, |
---|
| 188 | particularly given that edges are dealt with by assuming reflection |
---|
| 189 | (so the whole array is visible to all pixels). Also, if one decreases |
---|
| 190 | the dimensions of the array being reconstructed, there may be fewer |
---|
| 191 | scales used in the reconstruction. These points mean that the |
---|
| 192 | reconstruction of a subsection of a cube will differ from the same |
---|
| 193 | subsection of the reconstructed cube. The difference may be small |
---|
| 194 | (depending on the relative size difference and the amount of structure |
---|
| 195 | at large scales), but there will be differences at some level. |
---|
| 196 | |
---|
| 197 | Note also that BLANK pixels have no effect on the reconstruction: they |
---|
| 198 | remain as BLANK in the output, and do not contribute to the discrete |
---|
[1262] | 199 | convolution when they otherwise would. Flagging channels with the |
---|
| 200 | \texttt{flaggedChannels} parameter, however, has no effect on the |
---|
| 201 | reconstruction -- this are applied after the preprocessing, either in |
---|
| 202 | the searching or the rejection stage. |
---|
[1029] | 203 | |
---|
[1023] | 204 | \secB{Smoothing considerations} |
---|
[158] | 205 | |
---|
[1029] | 206 | The smoothing approach differs from the wavelet reconstruction in that |
---|
| 207 | it has a single scale associated with it. The user has two choices to |
---|
| 208 | make - which dimension to smooth in (spatially or spectrally), and |
---|
| 209 | what size kernel to smooth with. \citet{whiting12} show examples of |
---|
| 210 | how different smoothing widths (in one-dimension in this case) can |
---|
| 211 | highlight sources of different sizes. If one has some \textit{a |
---|
| 212 | priori} idea of the typical size scale of objects one wishes to |
---|
| 213 | detect, then choosing a single smoothing scale can be quite |
---|
| 214 | beneficial. |
---|
[964] | 215 | |
---|
[1029] | 216 | Note also that beam effects can be important here too, when smoothing |
---|
| 217 | spatial data on scales close to that of the beam. This can enhance |
---|
| 218 | beam-sized noise fluctuations and potentially introduce spurious |
---|
| 219 | sources. As always, examining the smoothed array (after saving via |
---|
| 220 | \texttt{flagOutputSmooth}) is a good idea. |
---|
| 221 | |
---|
| 222 | |
---|
[1023] | 223 | \secB{Threshold method} |
---|
[158] | 224 | |
---|
| 225 | When it comes to searching, the FDR method produces more reliable |
---|
| 226 | results than simple sigma-clipping, particularly in the absence of |
---|
| 227 | reconstruction. However, it does not work in exactly the way one |
---|
| 228 | would expect for a given value of \texttt{alpha}. For instance, |
---|
| 229 | setting fairly liberal values of \texttt{alpha} (say, 0.1) will often |
---|
| 230 | lead to a much smaller fraction of false detections (\ie much less |
---|
| 231 | than 10\%). This is the effect of the merging algorithms, that combine |
---|
| 232 | the sources after the detection stage, and reject detections not |
---|
| 233 | meeting the minimum pixel or channel requirements. It is thus better |
---|
| 234 | to aim for larger \texttt{alpha} values than those derived from a |
---|
| 235 | straight conversion of the desired false detection rate. |
---|
| 236 | |
---|
[292] | 237 | If the FDR method is not used, caution is required when choosing the |
---|
| 238 | S/N cutoff. Typical cubes have very large numbers of pixels, so even |
---|
| 239 | an apparently large cutoff will still result in a not-insignificant |
---|
| 240 | number of detections simply due to random fluctuations of the noise |
---|
| 241 | background. For instance, a $4\sigma$ threshold on a cube of Gaussian |
---|
| 242 | noise of size $100\times100\times1024$ will result in $\sim340$ |
---|
[964] | 243 | single-pixel detections. This is where the minimum channel and pixel |
---|
| 244 | requirements are important in rejecting spurious detections. |
---|
[292] | 245 | |
---|
[1023] | 246 | |
---|
[1262] | 247 | %%% Local Variables: |
---|
| 248 | %%% mode: latex |
---|
| 249 | %%% TeX-master: "Guide" |
---|
| 250 | %%% End: |
---|