1 | \secA{What \duchamp is doing} |
---|
2 | \label{sec-flow} |
---|
3 | |
---|
4 | The execution flow of \duchamp is detailed here, indicating the main |
---|
5 | algorithmic steps that are used. The program is written in C/C++ and |
---|
6 | makes use of the \textsc{cfitsio}, \textsc{wcslib} and \textsc{pgplot} |
---|
7 | libraries. |
---|
8 | |
---|
9 | \secB{Image input} |
---|
10 | \label{sec-input} |
---|
11 | |
---|
12 | The cube is read in using basic \textsc{cfitsio} commands, and stored |
---|
13 | as an array in a special C++ class. This class keeps track of the list |
---|
14 | of detected objects, as well as any reconstructed arrays that are made |
---|
15 | (see \S\ref{sec-recon}). The World Coordinate System |
---|
16 | (WCS)\footnote{This is the information necessary for translating the |
---|
17 | pixel locations to quantities such as position on the sky, frequency, |
---|
18 | velocity, and so on.} information for the cube is also obtained from |
---|
19 | the FITS header by \textsc{wcslib} functions \citep{greisen02, |
---|
20 | calabretta02}, and this information, in the form of a \texttt{wcsprm} |
---|
21 | structure, is also stored in the same class. |
---|
22 | |
---|
23 | A sub-section of a cube can be requested by defining the subsection |
---|
24 | with the \texttt{subsection} parameter and setting |
---|
25 | \texttt{flagSubsection=true} -- this can be a good idea if the cube |
---|
26 | has very noisy edges, which may produce many spurious detections. |
---|
27 | |
---|
28 | There are two ways of specifying the \texttt{subsection} string. The |
---|
29 | first is the generalised form |
---|
30 | \texttt{[x1:x2:dx,y1:y2:dy,z1:z2:dz,...]}, as used by the |
---|
31 | \textsc{cfitsio} library. This has one set of colon-separated numbers |
---|
32 | for each axis in the FITS file. In this manner, the x-coordinates run |
---|
33 | from \texttt{x1} to \texttt{x2} (inclusive), with steps of |
---|
34 | \texttt{dx}. The step value can be omitted, so a subsection of the |
---|
35 | form \texttt{[2:50,2:50,10:1000]} is still valid. In fact, \duchamp |
---|
36 | does not make use of any step value present in the subsection string, |
---|
37 | and any that are present are removed before the file is opened. |
---|
38 | |
---|
39 | If the entire range of a coordinate is required, one can replace the |
---|
40 | range with a single asterisk, \eg \texttt{[2:50,2:50,*]}. Thus, the |
---|
41 | subsection string \texttt{[*,*,*]} is simply the entire cube. A |
---|
42 | complete description of this section syntax can be found at the |
---|
43 | \textsc{fitsio} web site% |
---|
44 | \footnote{% |
---|
45 | \href% |
---|
46 | {http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node91.html}% |
---|
47 | {http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node91.html}}. |
---|
48 | |
---|
49 | |
---|
50 | Making full use of the subsection requires knowledge of the size of |
---|
51 | each of the dimensions. If one wants to, for instance, trim a certain |
---|
52 | number of pixels off the edges of the cube, without examining the cube |
---|
53 | to obtain the actual size, one can use the second form of the |
---|
54 | subsection string. This just gives a number for each axis, \eg |
---|
55 | \texttt{[5,5,5]} (which would trim 5 pixels from the start \emph{and} |
---|
56 | end of each axis). |
---|
57 | |
---|
58 | All types of subsections can be combined \eg \texttt{[5,2:98,*]}. |
---|
59 | |
---|
60 | |
---|
61 | %A sub-section of an image can be requested via the \texttt{subsection} |
---|
62 | %parameter -- this can be a good idea if the cube has very noisy edges, |
---|
63 | %which may produce many spurious detections. The generalised form of |
---|
64 | %the subsection that is used by \textsc{cfitsio} is |
---|
65 | %\texttt{[x1:x2:dx,y1:y2:dy,z1:z2:dz,...]}, such that the x-coordinates run |
---|
66 | %from \texttt{x1} to \texttt{x2} (inclusive), with steps of |
---|
67 | %\texttt{dx}. The step value can be omitted (so a subsection of the |
---|
68 | %form \texttt{[2:50,2:50,10:1000]} is still valid). \duchamp does not |
---|
69 | %make use of any step value present in the subsection string, and any |
---|
70 | %that are present are removed before the file is opened. |
---|
71 | % |
---|
72 | %If one wants the full range of a coordinate then replace the range |
---|
73 | %with an asterisk, \eg \texttt{[2:50,2:50,*]}. If one wants to use a |
---|
74 | %subsection, one must set \texttt{flagSubsection = 1}. A complete |
---|
75 | %description of the section syntax can be found at the \textsc{fitsio} |
---|
76 | %web site% |
---|
77 | %\footnote{% |
---|
78 | %\href% |
---|
79 | %{http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node91.html}% |
---|
80 | %{http://heasarc.gsfc.nasa.gov/docs/software/fitsio/c/c\_user/node91.html}}. |
---|
81 | % |
---|
82 | |
---|
83 | \secB{Image modification} |
---|
84 | \label{sec-modify} |
---|
85 | |
---|
86 | Several modifications to the cube can be made that improve the |
---|
87 | execution and efficiency of \duchamp (their use is optional, governed |
---|
88 | by the relevant flags in the parameter file). |
---|
89 | |
---|
90 | \secC{BLANK pixel removal} |
---|
91 | |
---|
92 | If the imaged area of a cube is non-rectangular (see the example in |
---|
93 | Fig.~\ref{fig-moment}, a cube from the HIPASS survey), BLANK pixels are |
---|
94 | used to pad it out to a rectangular shape. The value of these pixels |
---|
95 | is given by the FITS header keywords BLANK, BSCALE and BZERO. While |
---|
96 | these pixels make the image a nice shape, they will unnecessarily |
---|
97 | interfere with the processing (as well as taking up needless |
---|
98 | memory). The first step, then, is to trim them from the edge. This is |
---|
99 | done when the parameter \texttt{flagBlankPix=true}. If the above |
---|
100 | keywords are not present, the user can specify the BLANK value by the |
---|
101 | parameter \texttt{blankPixValue}. |
---|
102 | |
---|
103 | Removing BLANK pixels is particularly important for the reconstruction |
---|
104 | step, as lots of BLANK pixels on the edges will smooth out features in |
---|
105 | the wavelet calculation stage. The trimming will also reduce the size |
---|
106 | of the cube's array, speeding up the execution. The amount of trimming |
---|
107 | is recorded, and these pixels are added back in once the |
---|
108 | source-detection is completed (so that quoted pixel positions are |
---|
109 | applicable to the original cube). |
---|
110 | |
---|
111 | Rows and columns are trimmed one at a time until the first non-BLANK |
---|
112 | pixel is reached, so that the image remains rectangular. In practice, |
---|
113 | this means that there will be some BLANK pixels left in the trimmed |
---|
114 | image (if the non-BLANK region is non-rectangular). However, these are |
---|
115 | ignored in all further calculations done on the cube. |
---|
116 | |
---|
117 | \secC{Baseline removal} |
---|
118 | |
---|
119 | Second, the user may request the removal of baselines from the |
---|
120 | spectra, via the parameter \texttt{flagBaseline}. This may be |
---|
121 | necessary if there is a strong baseline ripple present, which can |
---|
122 | result in spurious detections at the high points of the ripple. The |
---|
123 | baseline is calculated from a wavelet reconstruction procedure (see |
---|
124 | \S\ref{sec-recon}) that keeps only the two largest scales. This is |
---|
125 | done separately for each spatial pixel (\ie for each spectrum in the |
---|
126 | cube), and the baselines are stored and added back in before any |
---|
127 | output is done. In this way the quoted fluxes and displayed spectra |
---|
128 | are as one would see from the input cube itself -- even though the |
---|
129 | detection (and reconstruction if applicable) is done on the |
---|
130 | baseline-removed cube. |
---|
131 | |
---|
132 | The presence of very strong signals (for instance, masers at several |
---|
133 | hundred Jy) could affect the determination of the baseline, and would |
---|
134 | lead to a large dip centred on the signal in the baseline-subtracted |
---|
135 | spectrum. To prevent this, the signal is trimmed prior to the |
---|
136 | reconstruction process at some standard threshold (at $8\sigma$ above |
---|
137 | the mean). The baseline determined should thus be representative of |
---|
138 | the true, signal-free baseline. Note that this trimming is only a |
---|
139 | temporary measure which does not affect the source-detection. |
---|
140 | |
---|
141 | \secC{Ignoring bright Milky Way emission} |
---|
142 | |
---|
143 | Finally, a single set of contiguous channels can be ignored -- these |
---|
144 | may exhibit very strong emission, such as that from the Milky Way as |
---|
145 | seen in extragalactic \hi cubes (hence the references to ``Milky |
---|
146 | Way'' in relation to this task -- apologies to Galactic |
---|
147 | astronomers!). Such dominant channels will produce many detections |
---|
148 | that are unnecessary, uninteresting (if one is interested in |
---|
149 | extragalactic \hi) and large (in size and hence in memory usage), and |
---|
150 | so will slow the program down and detract from the interesting |
---|
151 | detections. |
---|
152 | |
---|
153 | The use of this feature is controlled by the \texttt{flagMW} |
---|
154 | parameter, and the exact channels concerned are able to be set by the |
---|
155 | user (using \texttt{maxMW} and \texttt{minMW} -- these give an |
---|
156 | inclusive range of channels). When employed, these channels are |
---|
157 | ignored for the searching, and the scaling of the spectral output (see |
---|
158 | Fig.~\ref{fig-spect}) will not take them into account. They will be |
---|
159 | present in the reconstructed array, however, and so will be included |
---|
160 | in the saved FITS file (see \S\ref{sec-reconIO}). When the final |
---|
161 | spectra are plotted, the range of channels covered by these parameters |
---|
162 | is indicated by a green hashed box. |
---|
163 | |
---|
164 | \secB{Image reconstruction} |
---|
165 | \label{sec-recon} |
---|
166 | |
---|
167 | The user can direct \duchamp to reconstruct the data cube using the |
---|
168 | \atrous wavelet procedure. A good description of the procedure can be |
---|
169 | found in \citet{starck02:book}. The reconstruction is an effective way |
---|
170 | of removing a lot of the noise in the image, allowing one to search |
---|
171 | reliably to fainter levels, and reducing the number of spurious |
---|
172 | detections. This is an optional step, but one that greatly enhances |
---|
173 | the source-detection process, with the payoff that it can be |
---|
174 | relatively time- and memory-intensive. |
---|
175 | |
---|
176 | \secC{Algorithm} |
---|
177 | |
---|
178 | The steps in the \atrous reconstruction are as follows: |
---|
179 | \begin{enumerate} |
---|
180 | \item The reconstructed array is set to 0 everywhere. |
---|
181 | \item The input array is discretely convolved with a given filter |
---|
182 | function. This is determined from the parameter file via the |
---|
183 | \texttt{filterCode} parameter -- see Appendix~\ref{app-param} for |
---|
184 | details on the filters available. |
---|
185 | \item The wavelet coefficients are calculated by taking the difference |
---|
186 | between the convolved array and the input array. |
---|
187 | \item If the wavelet coefficients at a given point are above the |
---|
188 | requested threshold (given by \texttt{snrRecon} as the number of |
---|
189 | $\sigma$ above the mean and adjusted to the current scale -- see |
---|
190 | Appendix~\ref{app-scaling}), add these to the reconstructed array. |
---|
191 | \item The separation of the filter coefficients is doubled. (Note that |
---|
192 | this step provides the name of the procedure\footnote{\atrous means |
---|
193 | ``with holes'' in French.}, as gaps or holes are created in the |
---|
194 | filter coverage.) |
---|
195 | \item The procedure is repeated from step 2, using the convolved array |
---|
196 | as the input array. |
---|
197 | \item Continue until the required maximum number of scales is reached. |
---|
198 | \item Add the final smoothed (\ie convolved) array to the |
---|
199 | reconstructed array. This provides the ``DC offset'', as each of the |
---|
200 | wavelet coefficient arrays will have zero mean. |
---|
201 | \end{enumerate} |
---|
202 | |
---|
203 | The reconstruction has at least two iterations. The first iteration |
---|
204 | makes a first pass at the wavelet reconstruction (the process outlined |
---|
205 | in the 8 stages above), but the residual array will likely have some |
---|
206 | structure still in it, so the wavelet filtering is done on the |
---|
207 | residual, and any significant wavelet terms are added to the final |
---|
208 | reconstruction. This step is repeated until the change in the measured |
---|
209 | standard deviation of the background (see note below on the evaluation |
---|
210 | of this quantity) is less than some fiducial amount. |
---|
211 | |
---|
212 | It is important to note that the \atrous decomposition is an example |
---|
213 | of a ``redundant'' transformation. If no thresholding is performed, |
---|
214 | the sum of all the wavelet coefficient arrays and the final smoothed |
---|
215 | array is identical to the input array. The thresholding thus removes |
---|
216 | only the unwanted structure in the array. |
---|
217 | |
---|
218 | Note that any BLANK pixels that are still in the cube will not be |
---|
219 | altered by the reconstruction -- they will be left as BLANK so that |
---|
220 | the shape of the valid part of the cube is preserved. |
---|
221 | |
---|
222 | \secC{Note on Statistics} |
---|
223 | |
---|
224 | The correct calculation of the reconstructed array needs good |
---|
225 | estimators of the underlying mean and standard deviation of the |
---|
226 | background noise distribution. These statistics are estimated using |
---|
227 | robust methods, to avoid corruption by strong outlying points. The |
---|
228 | mean of the distribution is actually estimated by the median, while |
---|
229 | the median absolute deviation from the median (MADFM) is calculated |
---|
230 | and corrected assuming Gaussianity to estimate the underlying standard |
---|
231 | deviation $\sigma$. The Gaussianity (or Normality) assumption is |
---|
232 | critical, as the MADFM does not give the same value as the usual rms |
---|
233 | or standard deviation value -- for a Normal distribution |
---|
234 | $N(\mu,\sigma)$ we find MADFM$=0.6744888\sigma$. Since this ratio is |
---|
235 | corrected for, the user need only think in the usual multiples of |
---|
236 | $\sigma$ when setting \texttt{snrRecon}. See Appendix~\ref{app-madfm} |
---|
237 | for a derivation of this value. |
---|
238 | |
---|
239 | When thresholding the different wavelet scales, the value of $\sigma$ |
---|
240 | as measured from the wavelet array needs to be scaled to account for |
---|
241 | the increased amount of correlation between neighbouring pixels (due |
---|
242 | to the convolution). See Appendix~\ref{app-scaling} for details on |
---|
243 | this scaling. |
---|
244 | |
---|
245 | \secC{User control of reconstruction parameters} |
---|
246 | |
---|
247 | The most important parameter for the user to select in relation to the |
---|
248 | reconstruction is the threshold for each wavelet array. This is set |
---|
249 | using the \texttt{snrRecon} parameter, and is given as a multiple of |
---|
250 | the rms (estimated by the MADFM) above the mean (which for the wavelet |
---|
251 | arrays should be approximately zero). There are several other |
---|
252 | parameters that can be altered as well that affect the outcome of the |
---|
253 | reconstruction. |
---|
254 | |
---|
255 | By default, the cube is reconstructed in three dimensions, using a |
---|
256 | 3-dimensional filter and 3-dimensional convolution. This can be |
---|
257 | altered, however, using the parameter \texttt{reconDim}. If set to 1, |
---|
258 | this means the cube is reconstructed by considering each spectrum |
---|
259 | separately, whereas \texttt{reconDim=2} will mean the cube is |
---|
260 | reconstructed by doing each channel map separately. The merits of |
---|
261 | these choices are discussed in \S\ref{sec-notes}, but it should be |
---|
262 | noted that a 2-dimensional reconstruction can be susceptible to edge |
---|
263 | effects if the spatial shape of the pixel array is not rectangular. |
---|
264 | |
---|
265 | The user can also select the minimum scale to be used in the |
---|
266 | reconstruction. The first scale exhibits the highest frequency |
---|
267 | variations, and so ignoring this one can sometimes be beneficial in |
---|
268 | removing excess noise. The default is to use all scales |
---|
269 | (\texttt{minscale = 1}). |
---|
270 | |
---|
271 | Finally, the filter that is used for the convolution can be selected |
---|
272 | by using \texttt{filterCode} and the relevant code number -- the |
---|
273 | choices are listed in Appendix~\ref{app-param}. A larger filter will |
---|
274 | give a better reconstruction, but take longer and use more memory when |
---|
275 | executing. When multi-dimensional reconstruction is selected, this |
---|
276 | filter is used to construct a 2- or 3-dimensional equivalent. |
---|
277 | |
---|
278 | \secB{Input/Output of reconstructed arrays} |
---|
279 | \label{sec-reconIO} |
---|
280 | |
---|
281 | The reconstruction stage can be relatively time-consuming, |
---|
282 | particularly for large cubes and reconstructions in 3-D. To get around |
---|
283 | this, \duchamp provides a shortcut to allow users to perform multiple |
---|
284 | searches (\eg with different thresholds) on the same reconstruction |
---|
285 | without calculating the reconstruction each time. |
---|
286 | |
---|
287 | The first step is to choose to save the reconstructed array as a FITS |
---|
288 | file by setting \texttt{flagOutputRecon = true}. The file will be |
---|
289 | saved in the same directory as the input image, so the user needs to |
---|
290 | have write permissions for that directory. |
---|
291 | |
---|
292 | The filename will be derived from the input filename, with extra |
---|
293 | information detailing the reconstruction that has been done. For |
---|
294 | example, suppose \texttt{image.fits} has been reconstructed using a |
---|
295 | 3-dimensional reconstruction with filter \#2, thresholded at $4\sigma$ |
---|
296 | using all scales. The output filename will then be |
---|
297 | \texttt{image.RECON-3-2-4-1.fits} (\ie it uses the four parameters |
---|
298 | relevant for the \atrous reconstruction as listed in |
---|
299 | Appendix~\ref{app-param}). The new FITS file will also have these |
---|
300 | parameters as header keywords. If a subsection of the input image has |
---|
301 | been used (see \S\ref{sec-input}), the format of the output filename |
---|
302 | will be \texttt{image.sub.RECON-3-2-4-1.fits}, and the subsection that |
---|
303 | has been used is also stored in the FITS header. |
---|
304 | |
---|
305 | Likewise, the residual image, defined as the difference between the |
---|
306 | input and reconstructed arrays, can also be saved in the same manner |
---|
307 | by setting \texttt{flagOutputResid = true}. Its filename will be the |
---|
308 | same as above, with \texttt{RESID} replacing \texttt{RECON}. |
---|
309 | |
---|
310 | If a reconstructed image has been saved, it can be read in and used |
---|
311 | instead of redoing the reconstruction. To do so, the user should set |
---|
312 | the parameter \texttt{flagReconExists = true}. The user can indicate |
---|
313 | the name of the reconstructed FITS file using the \texttt{reconFile} |
---|
314 | parameter, or, if this is not specified, \duchamp searches for the |
---|
315 | file with the name as defined above. If the file is not found, the |
---|
316 | reconstruction is performed as normal. Note that to do this, the user |
---|
317 | needs to set |
---|
318 | \texttt{flagAtrous = true} (obviously, if this is \texttt{false}, the |
---|
319 | reconstruction is not needed). |
---|
320 | |
---|
321 | \secB{Smoothing the cube} |
---|
322 | \label{sec-smoothing} |
---|
323 | |
---|
324 | An alternative to doing the wavelet reconstruction is to Hanning |
---|
325 | smooth the cube. This technique can be useful in reducing the noise |
---|
326 | level slightly (at the cost of making neighbouring pixels correlated |
---|
327 | and blurring any signal present), and is particularly well suited to |
---|
328 | the case where a particular signal width is believed to be present in |
---|
329 | the data. It is also substantially faster than the wavelet |
---|
330 | reconstruction. |
---|
331 | |
---|
332 | The cube is smoothed only in the spectral domain. That is, each |
---|
333 | spectrum is independently smoothed, and then put back together to form |
---|
334 | the smoothed cube. This is then treated in the same way as the |
---|
335 | reconstructed cube, and is used for the searching algorithm (see |
---|
336 | below). Note that in the case of both the reconstruction and the |
---|
337 | smoothing options being requested, the reconstruction will take |
---|
338 | precedence and the smoothing will \emph{not} be done. |
---|
339 | |
---|
340 | There is only one parameter necessary to define the degree of |
---|
341 | smoothing -- the Hanning width $a$ (given by the user parameter |
---|
342 | \texttt{hanningWidth}). The coefficients $c(x)$ of the Hanning filter |
---|
343 | are defined by |
---|
344 | \[ |
---|
345 | c(x) = |
---|
346 | \begin{cases} |
---|
347 | \frac{1}{2}\left(1+\cos(\frac{\pi x}{a})\right) &|x| \leq (a+1)/2\\ |
---|
348 | 0 &|x| > (a+1)/2. |
---|
349 | \end{cases} |
---|
350 | \] |
---|
351 | where $a,x \in \mathbb{Z}$. Note that the width specified must be an |
---|
352 | odd integer (if the parameter provided is even, it is incremented by |
---|
353 | one). |
---|
354 | |
---|
355 | The user is able to save the smoothed array in exactly the same manner |
---|
356 | as for the reconstructed array -- set \texttt{flagOutputSmooth = |
---|
357 | true}, and then the smoothed array will be saved in |
---|
358 | \texttt{image.SMOOTH-a.fits}, where a is replaced by the Hanning width |
---|
359 | used. Similarly, a saved file can be read in by setting |
---|
360 | \texttt{flagSmoothExists = true} and either specifying a file to be |
---|
361 | read with the \texttt{smoothFile} parameter or relying on \duchamp to |
---|
362 | find the file with the name as given above. |
---|
363 | |
---|
364 | \secB{Searching the image} |
---|
365 | \label{sec-detection} |
---|
366 | |
---|
367 | The basic idea behind detection is to locate sets of contiguous voxels |
---|
368 | that lie above some threshold. \duchamp now calculates one threshold |
---|
369 | for the entire cube (previous versions calculated thresholds for each |
---|
370 | spectrum and image). This enables calculation of signal-to-noise |
---|
371 | ratios for each source (see Section~\ref{sec-output} for details). The |
---|
372 | user can manually specify a value (using the parameter |
---|
373 | \texttt{threshold}) for the threshold, which will override the |
---|
374 | calculated value. Note that this only applies for the first of the two |
---|
375 | cases discussed below -- the FDR case ignores any manually-set |
---|
376 | threshold value. |
---|
377 | |
---|
378 | The image is searched for detections in two ways: spectrally (\ie a |
---|
379 | 1-dimensional search in the spectrum in each spatial pixel), and |
---|
380 | spatially (a 2-dimensional search in the spatial image in each |
---|
381 | channel). In both cases, the algorithm finds connected pixels that are |
---|
382 | above the user-specified threshold. In the case of the spatial image |
---|
383 | search, the algorithm of \citet{lutz80} is used to raster-scan through |
---|
384 | the image and connect groups of pixels on neighbouring rows. |
---|
385 | |
---|
386 | Note that this algorithm cannot be applied directly to a 3-dimensional |
---|
387 | case, as it requires that objects are completely nested in a row: that |
---|
388 | is, if you are scanning along a row, and one object finishes and |
---|
389 | another starts, you know that you will not get back to the first one |
---|
390 | (if at all) until the second is completely finished for that |
---|
391 | row. Three-dimensional data does not have this property, which is why |
---|
392 | we break up the searching into 1- and 2-dimensional cases. |
---|
393 | |
---|
394 | Detections must have a minimum number of pixels to be counted. This |
---|
395 | number is given by the input parameters \texttt{minPix} (for |
---|
396 | 2-dimensional searches) and \texttt{minChannels} (for 1-dimensional |
---|
397 | searches). |
---|
398 | |
---|
399 | Finally, the search only looks for positive features. If one is |
---|
400 | interested instead in negative features (such as absorption lines), |
---|
401 | set the parameter \texttt{flagNegative = true}. This will invert the |
---|
402 | cube (\ie multiply all pixels by $-1$) prior to the search, and then |
---|
403 | re-invert the cube (and the fluxes of any detections) after searching |
---|
404 | is complete. All outputs are done in the same manner as normal, so |
---|
405 | that fluxes of detections will be negative. |
---|
406 | |
---|
407 | \secC{Calculating statistics} |
---|
408 | |
---|
409 | A crucial part of the detection process is estimating the statistics |
---|
410 | that define the detection threshold. To determine a threshold, we need |
---|
411 | to estimate from the data two parameters: the middle of the pixel |
---|
412 | distribution, and its spread. For both cases, we again use robust |
---|
413 | methods, using the median and MADFM. If the cube has been |
---|
414 | reconstructed or smoothed, the residuals (defined in the sense of |
---|
415 | original $-$ reconstruction) are used to estimate the noise parameters |
---|
416 | of the cube. Otherwise they are estimated directly from the cube |
---|
417 | itself. In both cases, robust estimators are used. |
---|
418 | |
---|
419 | The parameters that are estimated should be representative of the |
---|
420 | noise in the cube. For the case of small objects embedded in many |
---|
421 | noise pixels (\eg the case of \hi surveys), using the full cube will |
---|
422 | provide good estimators. It is possible, however, to use only a |
---|
423 | subsection of the cube by setting the parameter \texttt{flagStatSec = |
---|
424 | true} and providing the desired subsection to the \texttt{StatSec} |
---|
425 | parameter. This subsection works in exactly the same way as the pixel |
---|
426 | subsection discussed in \S\ref{sec-input}. |
---|
427 | |
---|
428 | \secC{Determining the threshold} |
---|
429 | |
---|
430 | Once the statistics have been calculated, the threshold is determined |
---|
431 | in one of two ways. The first way is a simple sigma-clipping, where a |
---|
432 | threshold is set at a fixed number $n$ of standard deviations above |
---|
433 | the mean, and pixels above this threshold are flagged as detected. The |
---|
434 | value of $n$ is set with the parameter \texttt{snrCut}. As before, the |
---|
435 | value of the standard deviation is estimated by the MADFM, and |
---|
436 | corrected by the ratio derived in Appendix~\ref{app-madfm}. |
---|
437 | |
---|
438 | The second method uses the False Discovery Rate (FDR) technique |
---|
439 | \citep{miller01,hopkins02}, whose basis we briefly detail here. The |
---|
440 | false discovery rate (given by the number of false detections divided |
---|
441 | by the total number of detections) is fixed at a certain value |
---|
442 | $\alpha$ (\eg $\alpha=0.05$ implies 5\% of detections are false |
---|
443 | positives). In practice, an $\alpha$ value is chosen, and the ensemble |
---|
444 | average FDR (\ie $\langle FDR \rangle$) when the method is used will |
---|
445 | be less than $\alpha$. One calculates $p$ -- the probability, |
---|
446 | assuming the null hypothesis is true, of obtaining a test statistic as |
---|
447 | extreme as the pixel value (the observed test statistic) -- for each |
---|
448 | pixel, and sorts them in increasing order. One then calculates $d$ |
---|
449 | where |
---|
450 | \[ |
---|
451 | d = \max_j \left\{ j : P_j < \frac{j\alpha}{c_N N} \right\}, |
---|
452 | \] |
---|
453 | and then rejects all hypotheses whose $p$-values are less than or |
---|
454 | equal to $P_d$. (So a $P_i<P_d$ will be rejected even if $P_i \geq |
---|
455 | j\alpha/c_N N$.) Note that ``reject hypothesis'' here means ``accept |
---|
456 | the pixel as an object pixel'' (\ie we are rejecting the null |
---|
457 | hypothesis that the pixel belongs to the background). |
---|
458 | |
---|
459 | The $c_N$ values here are normalisation constants that depend on the |
---|
460 | correlated nature of the pixel values. If all the pixels are |
---|
461 | uncorrelated, then $c_N=1$. If $N$ pixels are correlated, then their |
---|
462 | tests will be dependent on each other, and so $c_N = \sum_{i=1}^N |
---|
463 | i^{-1}$. \citet{hopkins02} consider real radio data, where the pixels |
---|
464 | are correlated over the beam. In this case the sum is made over the |
---|
465 | $N$ pixels that make up the beam. The value of $N$ is calculated from |
---|
466 | the FITS header (if the correct keywords -- BMAJ, BMIN -- are not |
---|
467 | present, the size of the beam is taken from the parameter |
---|
468 | \texttt{beamSize}). |
---|
469 | |
---|
470 | The theory behind the FDR method implies a direct connection between |
---|
471 | the choice of $\alpha$ and the fraction of detections that will be |
---|
472 | false positives. However, due to the merging process, this direct |
---|
473 | connection is lost when looking at the final number of detections -- |
---|
474 | see discussion in \S\ref{sec-notes}. The effect is that the number of |
---|
475 | false detections will be less than indicated by the $\alpha$ value |
---|
476 | used. |
---|
477 | |
---|
478 | |
---|
479 | \secB{Merging detected objects} |
---|
480 | \label{sec-merger} |
---|
481 | |
---|
482 | The searching step produces a list of detected objects that will have |
---|
483 | many repeated detections of a given object -- for instance, spectral |
---|
484 | detections in adjacent pixels of the same object and/or spatial |
---|
485 | detections in neighbouring channels. These are then combined in an |
---|
486 | algorithm that matches all objects judged to be ``close'', according |
---|
487 | to one of two criteria. |
---|
488 | |
---|
489 | One criterion is to define two thresholds -- one spatial and one in |
---|
490 | velocity -- and say that two objects should be merged if there is at |
---|
491 | least one pair of pixels that lie within these threshold distances of |
---|
492 | each other. These thresholds are specified by the parameters |
---|
493 | \texttt{threshSpatial} and \texttt{threshVelocity} (in units of pixels |
---|
494 | and channels respectively). |
---|
495 | |
---|
496 | Alternatively, the spatial requirement can be changed to say that |
---|
497 | there must be a pair of pixels that are \emph{adjacent} -- a stricter, |
---|
498 | but perhaps more realistic requirement, particularly when the spatial |
---|
499 | pixels have a large angular size (as is the case for |
---|
500 | \hi surveys). This |
---|
501 | method can be selected by setting the parameter |
---|
502 | \texttt{flagAdjacent} to 1 (\ie \texttt{true}) in the parameter |
---|
503 | file. The velocity thresholding is done in the same way as the first |
---|
504 | option. |
---|
505 | |
---|
506 | Once the detections have been merged, they may be ``grown''. This is a |
---|
507 | process of increasing the size of the detection by adding adjacent |
---|
508 | pixels that are above some secondary threshold. This threshold is |
---|
509 | lower than the one used for the initial detection, but above the noise |
---|
510 | level, so that faint pixels are only detected when they are close to a |
---|
511 | bright pixel. The value of this threshold is a possible input |
---|
512 | parameter (\texttt{growthCut}), with a default value of |
---|
513 | $1.5\sigma$. The use of the growth algorithm is controlled by the |
---|
514 | \texttt{flagGrowth} parameter -- the default value of which is |
---|
515 | \texttt{false}. If the detections are grown, they are sent through the |
---|
516 | merging algorithm a second time, to pick up any detections that now |
---|
517 | overlap or have grown over each other. |
---|
518 | |
---|
519 | Finally, to be accepted, the detections must span \emph{both} a |
---|
520 | minimum number of channels (to remove any spurious single-channel |
---|
521 | spikes that may be present), and a minimum number of spatial |
---|
522 | pixels. These numbers, as for the original detection step, are set |
---|
523 | with the \texttt{minChannels} and \texttt{minPix} parameters. The |
---|
524 | channel requirement means there must be at least one set of |
---|
525 | \texttt{minChannels} consecutive channels in the source for it to be |
---|
526 | accepted. |
---|