Context Navigation

hints.tex @ 1323

Visit:

Last change on this file since 1323 was 1262, checked in by MatthewWhiting, 11 years ago
Adding text to the Guide on the baseline changes, as well as the new maximum pixels/voxels/channels parameters.
File size: 12.7 KB

Rev	Line
[303]	1	% -----------------------------------------------------------------------
	2	% hints.tex: Section giving some tips & hints on how Duchamp is best
	3	% used.
	4	% -----------------------------------------------------------------------
	5	% Copyright (C) 2006, Matthew Whiting, ATNF
	6	%
	7	% This program is free software; you can redistribute it and/or modify it
	8	% under the terms of the GNU General Public License as published by the
	9	% Free Software Foundation; either version 2 of the License, or (at your
	10	% option) any later version.
	11	%
	12	% Duchamp is distributed in the hope that it will be useful, but WITHOUT
	13	% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
	14	% FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
	15	% for more details.
	16	%
	17	% You should have received a copy of the GNU General Public License
	18	% along with Duchamp; if not, write to the Free Software Foundation,
	19	% Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA
	20	%
	21	% Correspondence concerning Duchamp may be directed to:
	22	% Internet email: Matthew.Whiting [at] atnf.csiro.au
	23	% Postal address: Dr. Matthew Whiting
	24	% Australia Telescope National Facility, CSIRO
	25	% PO Box 76
	26	% Epping NSW 1710
	27	% AUSTRALIA
	28	% -----------------------------------------------------------------------
[158]	29	\secA{Notes and hints on the use of \duchamp}
	30	\label{sec-notes}
	31
	32	In using \duchamp, the user has to make a number of decisions about
	33	the way the program runs. This section is designed to give the user
	34	some idea about what to choose.
	35
[1011]	36	\secB{Memory usage}
	37
	38	A lot of attention has been paid to the memory usage in \duchamp,
	39	recognising that data cubes are going to be increasing in size with
	40	new generation correlators and wider fields of view. However, users
	41	with large cubes should be aware of the likely usage for different
	42	modes of operation and plan their \duchamp execution carefully.
	43
	44	At the start of the program, memory is allocated sufficient for:
	45	\begin{itemize}
	46	\item The entire pixel array (as requested, subject to any
	47	subsection).
	48	\item The spatial extent, which holds the map of detected pixels (for
	49	output into the detection map).
	50	\item If smoothing or reconstruction has been selected, another array
	51	of the same size as the pixel array. This will hold the
	52	smoothed/reconstructed array (the original needs to be kept to do the
	53	correct parameterisation of detected sources).
	54	\item If baseline-subtraction has been selected, a further array of
	55	the same size as the pixel array. This holds the baseline values,
	56	which need to be added back in prior to parameterisation.
	57	\end{itemize}
	58	All of these will be float type, except for the detection map, which
	59	is short.
	60
	61	There will, of course, be additional allocation during the course of
	62	the program. The detection list will progressively grow, with each
	63	detection having a memory footprint as described in
[1028]	64	\S\ref{sec-scan}. But perhaps more important and with a larger
[1011]	65	impact will be the temporary space allocated for various algorithms.
	66
	67	The largest of these will be the wavelet reconstruction. This will
	68	require an additional allocation of twice the size of the array being
	69	reconstructed, one for the coefficients and one for the wavelets -
	70	each scale will overwrite the previous one. So, for the 1D case, this
	71	means an additional allocation of twice the spectral dimension (since
	72	we only reconstruct one spectrum at a time), but the 3D case will
	73	require an additional allocation of twice the cube size (this means
	74	there needs to be available at least four times the size of the input
	75	cube for 3D reconstruction, plus the additional overheads of
	76	detections and so forth).
	77
	78	The smoothing has less of an impact, since it only operates on the
	79	lower dimensions, but it will make an additional allocation of twice
	80	the relevant size (spectral dimension for spectral smoothing, or
	81	spatial image size for the spatial Gaussian smoothing).
	82
	83	The other large allocation of temporary space will be for calculating
	84	robust statistics. The median-based calculations require at least
	85	partial sorting of the data, and so cannot be done on the original
	86	image cube. This is done for the entire cube and so the temporary
	87	memory increase can be large.
	88
	89
[1023]	90	\secB{Timing considerations}
[1011]	91
[1023]	92	Another intersting question from a user's perspective is how long you
	93	can expect \duchamp to take. This is a difficult question to answer in
	94	general, as different users will have different sized data sets, as
	95	well as machines with different capabilities (in terms of the CPU
	96	speed and I/O \& memory bandwidths). Additionally, the time required
	97	will depend slightly on the number of sources found and their size
	98	(very large sources can take a while to fully parameterise).
[993]	99
[1023]	100	Having said that, in \citet{whiting12} a brief analysis was done
	101	looking at different modes of execution applied to a single HIPASS
	102	cube (\#201) using a MacBook Pro (2.66GHz, 8MB RAM). Two sets of
	103	thresholds were used, either $10^8$~Jy~beam$^{-1}$ (no sources will be
	104	found, so that the time taken is dominated by preprocessing), or
	105	35~mJy~beam$^{-1}$ (or $\sim2.58\sigma$, chosen so that the time taken
	106	will include that required to process sources). The basic searches,
	107	with no pre-processing done, took less than a second for the
	108	high-threshold search, but between 1 and 3~min for the low-threshold
	109	case -- the numbers of sources detected ranged from 3000 (rejecting
	110	sources with less than 3 channels and 2 spatial pixels) to 42000
	111	(keeping all sources).
[1011]	112
[1023]	113	When smoothing, the raw time for the spectral smoothing was only a few
	114	seconds, with a small dependence on the width of the smoothing
	115	filter. And because the number of spurious sources is markedly
	116	decreased (the final catalogues ranged from 17 to 174 sources,
	117	depending on the width of the smoothing), searching with the low
	118	threshold did not add much more than a second to the time. The spatial
	119	smoothing was more computationally intensive, taking about 4 minutes
	120	to complete the high-threshold search.
[158]	121
[1023]	122	The wavelet reconstruction time primarily depended on the
	123	dimensionality of the reconstruction, with the 1D taking 20~s, the 2D
	124	taking 30 - 40~s and the 3D taking 2 - 4~min. The spread in times for
[1029]	125	a given dimensionality was caused by different reconstruction
[1023]	126	thresholds, with lower thresholds taking longer (since more pixels are
	127	above the threshold and so need to be added to the final spectrum). In
	128	all cases the reconstruction time dominated the total time for the
	129	low-threshold search, since the number of sources found was again
	130	smaller than the basic searches.
[285]	131
	132
[1029]	133	\secB{Why do preprocessing?}
[158]	134
[1029]	135	The preprocessing options provided by \duchamp, particularly the
	136	ability to smooth or reconstruct via multi-resolution wavelet
	137	decomposition, provide an opportunity to beat the effects of the
	138	random noise that will be present in the data. This noise will
	139	ultimately limit ones ability to detect objects and form a complete
	140	and reliable catalogue. Two effects are important here. First, the
	141	noise reduces the completeness of the final catalogue by suppressing
	142	the flux of real sources such that they fall below the detection
	143	threshold. Secondly, the noise provides false positive detections
	144	through noise peaks that fall above the threshold, thereby reducing
	145	the reliability of the catalogue.
[158]	146
[1029]	147	\citet{whiting12} examined the effect on completeness and reliability
	148	for the reconstruction and smoothing (1D cases only) when applied to a
	149	simple simulated dataset. Both had the effect of reducing the number
	150	of spurious sources, which means the searches can be done to fainter
	151	thresholds. This led to completeness levels of about one flux unit
	152	(equal to one standard-deviation of the noise) fainter than searches
	153	without pre-processing, with $>95\%$ reliability. The smoothing did
	154	slightly better, with the completeness level nearly half a flux unit
	155	fainter than the reconstruction, although this was helped by the
	156	sources in the simulation all having the same spectral size.
	157
[1023]	158	\secB{Reconstruction considerations}
[1011]	159
[1029]	160	The \atrous wavelet reconstruction approach is designed to remove a
	161	large amount of random noise while preserving as much structure as
	162	possible on the full range of spatial and/or spectral scales present
	163	in the data. While it is relatively more expensive in terms of memory
	164	and CPU usage (see previous sections), its effect on, in particular,
	165	the reliability of the final catalogue makes it worth investigating.
[292]	166
[1029]	167	There are, however, a number of subtleties to it that need to be
	168	considered by potential users. \citet{whiting12} shows a set of
	169	examples of reconstruction applied to simulated and real data. The
	170	real data, in this case a HIPASS cube, shows differences in the
	171	quality of the reconstructed spectrum depending on the dimensionality
	172	of the reconstruction. The two-dimensional reconstruction (where the
	173	cube is reconstructed one channel map at a time) shows much larger
	174	channel-to-channel noise, with a number of narrow peaks surviving the
	175	reconstruction process. The problem here is that there are spatial
	176	correlations between pixels due to the beam, which allow beam-sized
	177	noise fluctuations to rise above the threshold more frequently in
	178	one-dimension. The other effect is that when compared to a spectrum
	179	from the 1D reconstruction, each channel is independently
	180	reconstructed, and does not depend on its neighbouring channels. This
	181	is also why the 3D reconstruction (which also suffers from the beam
	182	effects) has improved noise in the output spectrum, since the
	183	information on neighbouring channels is taken into account.
	184
	185	Caution is also advised when looking at subsections of a cube. Due to
	186	the multi-scale nature of the algorithm, the wavelet coefficients at a
	187	given pixel are influenced by pixels at very large separations,
	188	particularly given that edges are dealt with by assuming reflection
	189	(so the whole array is visible to all pixels). Also, if one decreases
	190	the dimensions of the array being reconstructed, there may be fewer
	191	scales used in the reconstruction. These points mean that the
	192	reconstruction of a subsection of a cube will differ from the same
	193	subsection of the reconstructed cube. The difference may be small
	194	(depending on the relative size difference and the amount of structure
	195	at large scales), but there will be differences at some level.
	196
	197	Note also that BLANK pixels have no effect on the reconstruction: they
	198	remain as BLANK in the output, and do not contribute to the discrete
[1262]	199	convolution when they otherwise would. Flagging channels with the
	200	\texttt{flaggedChannels} parameter, however, has no effect on the
	201	reconstruction -- this are applied after the preprocessing, either in
	202	the searching or the rejection stage.
[1029]	203
[1023]	204	\secB{Smoothing considerations}
[158]	205
[1029]	206	The smoothing approach differs from the wavelet reconstruction in that
	207	it has a single scale associated with it. The user has two choices to
	208	make - which dimension to smooth in (spatially or spectrally), and
	209	what size kernel to smooth with. \citet{whiting12} show examples of
	210	how different smoothing widths (in one-dimension in this case) can
	211	highlight sources of different sizes. If one has some \textit{a
	212	priori} idea of the typical size scale of objects one wishes to
	213	detect, then choosing a single smoothing scale can be quite
	214	beneficial.
[964]	215
[1029]	216	Note also that beam effects can be important here too, when smoothing
	217	spatial data on scales close to that of the beam. This can enhance
	218	beam-sized noise fluctuations and potentially introduce spurious
	219	sources. As always, examining the smoothed array (after saving via
	220	\texttt{flagOutputSmooth}) is a good idea.
	221
	222
[1023]	223	\secB{Threshold method}
[158]	224
	225	When it comes to searching, the FDR method produces more reliable
	226	results than simple sigma-clipping, particularly in the absence of
	227	reconstruction. However, it does not work in exactly the way one
	228	would expect for a given value of \texttt{alpha}. For instance,
	229	setting fairly liberal values of \texttt{alpha} (say, 0.1) will often
	230	lead to a much smaller fraction of false detections (\ie much less
	231	than 10\%). This is the effect of the merging algorithms, that combine
	232	the sources after the detection stage, and reject detections not
	233	meeting the minimum pixel or channel requirements. It is thus better
	234	to aim for larger \texttt{alpha} values than those derived from a
	235	straight conversion of the desired false detection rate.
	236
[292]	237	If the FDR method is not used, caution is required when choosing the
	238	S/N cutoff. Typical cubes have very large numbers of pixels, so even
	239	an apparently large cutoff will still result in a not-insignificant
	240	number of detections simply due to random fluctuations of the noise
	241	background. For instance, a $4\sigma$ threshold on a cube of Gaussian
	242	noise of size $100\times100\times1024$ will result in $\sim340$
[964]	243	single-pixel detections. This is where the minimum channel and pixel
	244	requirements are important in rejecting spurious detections.
[292]	245
[1023]	246
[1262]	247	%%% Local Variables:
	248	%%% mode: latex
	249	%%% TeX-master: "Guide"
	250	%%% End:

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/docs/hints.tex @ 1323

Download in other formats: