Binning

To be useful, the measured correlations need to be binned in some way to find the average correlation among many pairs of nearly the same separation. The different ways to bin the results may be specified using the bin_type parameter in Corr2.

“Log”

The default way to bin the results in TreeCorr is uniformly in log(r), where r is defined according to the specified metric (cf. Metrics). This corresponds to bin_type = "Log", although one normally omits this, as it is the default.

For most correlation functions, which tend to be approximately power laws, this binning is the most appropriate, since it naturally handles a large dynamic range in the separation.

The exact binning is specified using any 3 of the following 4 parameters:

  • nbins How many bins to use.

  • bin_size The width of the bins in log(r).

  • min_sep The minimum separation r to include.

  • max_sep The maximum separation r to include.

For a pair with a metric distance r, the index of the corresponding bin in the output array is int(log(r) - log(min_sep))/bin_size).

Note

If nbins is the omitted value, then bin_size might need to be decreased slightly to accommodate an integer number of bins with the given min_sep and max_sep.

“Linear”

For use cases where the scales of interest span only a relatively small range of distances, it may be more convenient to use linear binning rather than logarithmic. A notable example of this is BAO investigations, where the interesting region is near the BAO peak. In these cases, using bin_type = "Linear" may be preferred.

As with “Log”, the binning may be specified using any 3 of the following 4 parameters:

  • nbins How many bins to use.

  • bin_size The width of the bins in r.

  • min_sep The minimum separation r to include.

  • max_sep The maximum separation r to include.

For a pair with a metric distance r, the index of the corresponding bin in the output array is int((r - min_sep)/bin_size).

Note

If nbins is the omitted value, then bin_size might need to be decreased slightly to accommodate an integer number of bins with the given min_sep and max_sep.

“TwoD”

To bin the correlation in two dimensions, (x,y), you can use bin_type = "TwoD". This will keep track of not only the distance between two points, but also the direction. The results are then binned linearly in both the delta x and delta y values.

The exact binning is specified using any 2 of the following 3 parameters:

  • nbins How many bins to use in each direction.

  • bin_size The width of the bins in dx and dy.

  • max_sep The maximum absolute value of dx or dy to include.

For a pair with a directed separation (dx,dy), the indices of the corresponding bin in the 2-d output array are int((dx + max_sep)/bin_size), int((dy + max_sep)/bin_size).

The binning is symmetric around (0,0), so the minimum separation in either direction is -max_sep, and the maximum is +max_sep. If is also permissible to specify min_sep to exclude small separations from being accumulated, but the binning will still include a bin that crosses over (dx,dy) = (0,0) if nbins is odd, or four bins that touch (0,0) if nbins is even.

Note that this metric is only valid when the input positions are given as x,y (not ra, dec), and the metric is “Euclidean”. If you have a use case for other combinations, please open an issue with your specific case, and we can try to figure out how it should be implemented.

Output quantities

For all of the different binning options, the Correlation object will have the following attributes related to the locations of the bins:

  • rnom The separation at the nominal centers of the bins. For “Linear” binning, these will be spaced uniformly.

  • logr The log of the separation at the nominal centers of the bins. For “Log” binning, these will be spaced uniformly. This is always the (natural) log of rnom.

  • left_edges The separation at the left edges of the bins. For “Linear” binning, these are half-way between the rnom values of successive bins. For “Log” binning, these are the geometric mean of successive rnom values, rather than the arithmetic mean. For “TwoD” binning, these are like “Linear” but for the x separations only.

  • right_edges Analogously, the separation at the right edges of the bins.

  • meanr The mean separation of all the pairs of points that actually ended up falling in each bin.

  • meanlogr The mean log(separation) of all the pairs of points that actually ended up falling in each bin.

The last two quantities are only available after finishing a calculation (e.g. with process).

In addition to the above, “TwoD” binning also includes the following:

  • dxnom The x separation at the nominal centers of the 2-D bins.

  • dynom The y separation at the nominal centers of the 2-D bins.

  • bottom_edges The y separation at the bottom edges of the 2-D bins. Like left_edges, but for the y values rather than the x values.

  • top_edges The y separation at the top edges of the 2-D bins. Like right_edges, but for the y values rather than the x values.

There is some subtlety about which separation to use when comparing measured correlation functions to theoretical predictions. See Appendix D of Singh et al, 2020, who show that one can find percent level differences among the different options. (See their Figure D2 in particular.) The difference is smaller as the bin size decreases, although they point out that it is not always feasible to make the bin size very small, e.g. because of issues calculating the covariance matrix.

In most cases, if the true signal is expected to be locally well approximated by a power law, then using meanlogr is probably the most appropriate choice. This most closely approximates the signal-based weighting that they recommend, but if you are concerned about the percent level effects of this choice, you would be well-advised to investigate the different options with simulations to see exactly what impact the choice has on your science.

Other options for binning

There are a few other options that affect the binning, which can be set when constructing any of the Corr2 or Corr3 classes.

sep_units

The optional parameter sep_units lets you specify what units you want for the binned separations if the separations are angles.

Valid options are “arcsec”, “arcmin”, “degrees”, “hours”, or “radians”. The default if not specified is “radians”.

Note that this is only valid when the distance metric is an angle. E.g. if RA and Dec values are given for the positions, and no distance values are specified, then the default metric, “Euclidean”, is the angular separation on the sky. “Arc” similarly is always an angle.

If the distance metric is a physical distance, then this parameter is invalid, and the output separation will match the physical distance units in the input catalog. E.g. if the distance from Earth is given as r, then the output units will match the units of the r values. Or if positions are given as x, y (and maybe z), then the units will be whatever the units are for these values.

bin_slop

One of the main reasons that TreeCorr is able to compute correlation functions so quickly is that it allows the bin edges to be a little bit fuzzy. A pairs whose separation is very close to a dividing line between two bins might be placed in the next bin over from where an exact calculation would put it.

This is normally completely fine for any real-world application. Indeed, by deciding to bin your correlation function with some non-zero bin size, you have implicitly defined a resolution below which you don’t care about the exact separation values.

The approximation TreeCorr makes is to allow some additional imprecision that is a fraction of this level, which is quantified by the parameter bin_slop. Specifically, bin_slop specifies the maximum possible error that any pair can have, given as a fraction of the bin size.

Note

For logarithmic binning, this refers to the size of the bin in logarithmic units, so it specifies the maximum error allowed for log(r) relative to the size of the bin in logarithmic units, log(R) - log(L), where L and R are the left and right edges of the bin.

You can think of it as turning all of your rectangular bins into overlapping trapezoids, where bin_slop defines the ratio of the angled portion to the flat mean width. Larger bin_slop allows for more overlap (and is thus faster), while smaller bin_slop gets closer to putting each pair perfectly into the bin it belongs in.

The default bin_slop for the “Log” bin type is such that bin_slop * bin_size is 0.1. Or if bin_size < 0.1, then we use bin_slop = 1. This has been found to give fairly good accuracy across a variety of applications. However, for high precision measurements, it may be appropriate to use a smaller value than this. Especially if your bins are fairly large.

A typical test to perform on your data is to cut bin_slop in half and see if your results change significantly. If not, you are probably fine, but if they change by an appreciable amount (according to whatever you think that means for your science), then your original bin_slop was too large.

To understand the impact of the bin_slop parameter, it helps to start by thinking about when it is set to 0. If bin_slop = 0, then TreeCorr does essentially a brute-force calculation, where each pair of points is always placed into the correct bin.

But if bin_slop > 0, then any given pair is allowed to be placed in the wrong bin so long as the true separation is within this fraction of a bin from the edge. For example, if a bin nominally goes from 10 to 20 arcmin (with linear binning), then with bin_slop = 0.05, TreeCorr will accumulate pairs with separations ranging from 9.5 to 20.5 arcmin into this bin. (I.e. the slop is 0.05 of the bin width on each side.) Note that some of the pairs with separations from 9.5 to 10.5 would possibly fall into the lower bin instead. Likewise some from 19.5 to 20.5 would fall in the higher bin. So both edges are a little fuzzy.

Furthermore, for a given set of pairs that we accumulate as a group, we only allow a total of the given bin_slop on both sides of the bin combined. For instance, with the above example, if a group of pairs to accumulate had a mean separation of 15 arcmin, but the range was \(\pm 5.5\), then the total slop would be considered 0.5 + 0.5 = 1.0, which is too much. So that group would be further subdivided. However, if the range was only \(\pm 5.25\), then the total slop would be 0.25 + 0.25 = 0.5, which would be considered acceptable.

For large number of objects, the shifts up and down tend to cancel out, so there is typically very little bias in the results. Statistically, about as many pairs scatter up as scatter down, so the resulting counts come out pretty close to correct. Furthermore, the total number of pairs within the specified range is always correct, since each pair is placed in some bin.

angle_slop

For some calculations, the angular orientation of the line between two points is relevant to the correlation. For most complex quantities (e.g. shear), the value used in the correlation has to be projected onto the the line between the two points (for two-point correlations) or the centroid of the triangle (for three-point correlations). Even for scalar quantities, three-point correlations using the multipole method also use the angular direction of the lines between points.

In such cases, another optional parameter, called angle_slop provides a slightly different accuracy/speed trade off than the one provided by bin_slop. It sets the maximum allowed angular difference between the line connecting two cells to be accumulated and the lines connecting any two of their consituent points. If the line connecting some pair has a direction that is more than angle_slop radians different from the cell pair direction, then the cells will be split further, even if they would pass the bin_slop criterion.

This parameter allows one to use fairly large bins for shear (or other non-spin-0 complex field) correlations and tune the accuracy of the projections without worrying that some pairs of cells will have large projection errors because the range of separations for a pair of cells happen to fit precisely into a single bin.

brute

Sometimes, it can be useful to force the code to do the full brute force calculation, skipping all of the approximations that are inherent to the tree traversal algorithm. This of course is much slower, but this option can be useful for testing purposes especially. For instance, comparisons to brute force results have been invaluable in TreeCorr development of the faster algorithms. Some science cases also use comparison to brute force results to confirm that they are not significantly impacted by using non-zero bin_slop.

Setting brute = True is roughly equivalent to setting bin_slop = 0. However, there is a distinction between these two cases. Internally, the former will always traverse the tree all the way to the leaves. So every pair will be calculated individually. This really is the brute force calculation.

However, bin_slop = 0 will allow for the traversal to stop early if all possible pairs in a given pair of cells fall into the same bin. This can be quite a large speedup in some cases. And especially for NN correlations, there is no disadvantage to doing so.

For shear correlations, there can be a slight difference between using bin_slop = 0 and brute = True because the shear projections won’t be precisely equal in the two cases. Shear correlations require parallel transporting the shear values to the centers of the cells, and then when accumulating pairs, the shears are projected onto the line joining the two points. Both of these lead to slight differences in the results of a bin_slop = 0 calculation compared to the true brute force calculation. If the difference is seen to matter for you, then the above angle_slop parameter can be used to increase the accuracy of the projections, separate from the bin_slop considerations.

Additionally, there is one other way to use the brute parameter. If you set brute to 1 or 2, rather than True or False, then the forced traversal to the leaf cells will only apply to cat1 or cat2 respectively. The cells for the other catalog will use the normal criterion based on the bin_slop and angle_slop parameters to decide whether it is acceptable to use a non-leaf cell or to continue traversing the tree.