.. _pdfdistmet: ************ PDF Distance ************ See :ref:`the tutorial ` for a description of PDFs. There are multiple ways to define the distance between PDFs. Two of the metrics are non-parametric: 1. The `Hellinger distance `_ between the PDFs (computed over the same set of bins): .. math:: d_{\rm Hellinger}(p_1,p_2) = \frac{1}{\sqrt{2}}\left\{\sum_{\tilde{I}} \left[ \sqrt{p_1(\tilde{I})} - \sqrt{p_{2}(\tilde{I})} \right]^2\right\}^{1/2}. where :math:`p_i` are the histogram values at the bin :math:`\tilde{I}`. 2. The `Kolmogorov-Smirnov Distance `_ between the ECDFs of the PDFs: .. math:: d_{\rm KS}(P_1, P_2) = {\rm sup} \left| P_1(\tilde{I}) - P_2(\tilde{I}) \right| where :math:`P_i` is the ECDF at the value :math:`\tilde{I}`. There is also one parametric distance metric included in `~turbustat.statistics.PDF_Distance`: the t-statistic of the difference in the fitted log-normal widths: .. math:: d_{\rm LN} = \frac{\left| w_1 - w_2 \right|}{\sqrt{\sigma_{w_1}^2 + \sigma_{w_1}^2}} where :math:`w_i` is the width of the log-normal distribution fit. More information on the distance metric definitions can be found in `Koch et al. 2017 `_. Using ----- **The data in this tutorial are available** `here `_. We need to import the `~turbustat.statistics.PDF_Distance` class, along with a few other common packages: >>> from turbustat.statistics import PDF_Distance >>> from astropy.io import fits >>> import matplotlib.pyplot as plt And we load in the two data sets. `~turbustat.statistics.PDF_Distance` can be given two 2D images or cubes. For this example, we will use two integrated intensity images:: >>> moment0 = fits.open(osjoin(data_path, "Design4_flatrho_0021_00_radmc_moment0.fits"))[0] # doctest: +SKIP >>> moment0_fid = fits.open(osjoin(data_path, "Fiducial0_flatrho_0021_00_radmc_moment0.fits"))[0] # doctest: +SKIP These two images are given as the inputs to `~turbustat.statistics.PDF_Distance`. Other parameters can be set here, including the minimum images values to be included in the histograms (`min_val1`/`min_val2`), whether to fit a log-normal distribution (`do_fit`), and what type of normalization to use on the data (`normalization_type`; see the :ref:`PDF tutorial `): >>> pdf = PDF_Distance(moment0_fid, moment0, min_val1=0.0, min_val2=0.0, ... do_fit=True, normalization_type=None) # doctest: +SKIP This will create and run two `~turbustat.statistics.PDF` instances using a common set of bins for the histograms. These can be accessed as `~turbustat.statistics.PDF_Distance.pdf1` and `~turbustat.statistics.PDF_Distance.pdf2`. To calculate the distances, we run: >>> pdf.distance_metric(verbose=True) # doctest: +SKIP Optimization terminated successfully. Current function value: 6.335450 Iterations: 36 Function evaluations: 72 Optimization terminated successfully. Current function value: 6.007851 Iterations: 34 Function evaluations: 69 Likelihood Results ============================================================================== Dep. Variable: y Log-Likelihood: -1.0380e+05 Model: Likelihood AIC: 2.076e+05 Method: Maximum Likelihood BIC: 2.076e+05 Date: Wed, 14 Nov 2018 Time: 09:58:10 No. Observations: 16384 Df Residuals: 16382 Df Model: 2 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ par0 0.4553 0.003 181.019 0.000 0.450 0.460 par1 299.8377 1.067 281.114 0.000 297.747 301.928 ============================================================================== Likelihood Results ============================================================================== Dep. Variable: y Log-Likelihood: -98433. Model: Likelihood AIC: 1.969e+05 Method: Maximum Likelihood BIC: 1.969e+05 Date: Wed, 14 Nov 2018 Time: 09:58:10 No. Observations: 16384 Df Residuals: 16382 Df Model: 2 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ par0 0.4360 0.002 181.019 0.000 0.431 0.441 par1 225.6771 0.769 293.602 0.000 224.171 227.184 ============================================================================== .. image:: images/pdf_distmet.png This returns a summary of the log-normal fits (if `do_fit=True`) and a plot of the PDF and ECDF of each data set. The solid lines in the plot are the fitted distributions. By default, all three distance metrics are run. For these images, the distances are: >>> pdf.hellinger_distance # doctest: +SKIP 0.23007068347013115 >>> pdf.ks_distance # doctest: +SKIP 0.24285888671875 >>> pdf.lognormal_distance # doctest: +SKIP 5.561198154785891 Each distance metric can be run separately by running its function in `~turbustat.statistics.PDF_Distance`, or by setting the `statistic` keyword in `~turbustat.statistics.PDF_Distance.distance_metric`. Because of the Hellinger distance requires that the PDF histograms have the same bins, there is no input to give a pre-computed fiducial `~turbustat.statistics.PDF`, unlike most of the other distance metric classes. References ---------- `Boyden et al. 2016 `_ `Koch et al. 2017 `_ `Boyden et al. 2018 `_