Help for explaining the output #1

ttsesm · 2021-03-13T16:08:44Z

Hi, I was trying your code related to heavytailed distributions. So in my case I have a dataset which is heavily tailed, thus I used a sample of it with the code here and I got the following output:

xmin=125	Power Law	1.44857249	116024
xmin=125	Exponential	0.00078316	106812
.../scipy/optimize/optimize.py:282: RuntimeWarning: Values in x were outside bounds during a minimize step, clipping to bounds
  warnings.warn("Values in x were outside bounds during a "
xmin=125	Log-Normal	7.01483688 1.01	106695
xmin=125	Pairwise Power Law	1e-06 3.5953322 0.07234651	105206
xmin=125	Power Law With Exp Cutoff	2.0 0.0001	122655
xmin=125	Yule	1.44910929	116018
xmin=125	Poisson	1401.41099181	3986280
.../heavytailed/poisson.py:49: RuntimeWarning: overflow encountered in exp
  normfactor = 1. / np.exp(mu + np.log(1 - sp_pois.cdf(xmin - 1, mu)))
.../heavytailed/poisson.py:52: RuntimeWarning: overflow encountered in exp
  total -= np.exp(x * np.log(mu) - gammaln(x + 1)) * normfactor
.../heavytailed/poisson.py:52: RuntimeWarning: invalid value encountered in double_scalars
  total -= np.exp(x * np.log(mu) - gammaln(x + 1)) * normfactor

and the following graphs:

from what I understand the pairwise power law seems to fit better on the input data, but I am not sure I understand what that means. Thus, I would appreciate if you could provide some feedback.

Thanks.

The text was updated successfully, but these errors were encountered:

XiangwenWang · 2021-03-15T00:58:41Z

Hi there,

I haven't touched the codes for roughly 2 years (as I'm no longer in academia), but last time I ran the code, there were no such warnings. Probably that's caused by some Numpy/Scipy updates. I think some distributions/models didn't reach the optimum results, but for now let's pretend that they are fine.

The purpose of this package is to check the tail behavior, particularly its distribution, of a dataset, or say to answer the question "among the candidate models, which one could best describe the tail distribution of this dataset?" The general idea is to use some statistical techniques to get a conclusion. Here I relied on AIC for model selection, as suggested in "Power-law distributions in empirical data" (a well-cited manuscript).

For the output table, the first column represents where the tail starts; the second column is the model name; the third to the second last columns are the obtained parameters in corresponding models, and there might be several parameters in a model. The last column is the AIC, and during model comparison, that's the only column we need to focus on. Basically, the model/distribution with the smallest AIC is the best-fitted model. For this demo, the model "Pairwise Power Law" presented the smallest AIC, which is 105206, and we would call it the best model (among the candidate models) for describing the tail of the sample dataset.

But there is a crucial problem - what if none of the candidates is correct (or at least a good approximation)? That's why I further plot the distributions against the data for verification. Usually we draw histograms for this purpose, but here as we focus on fat tails, histograms could be noisy and misleading. Instead, I draw the surviving distributions, 1-CDF(x), which is more smooth and clear. From these plots, we could confirm that the pairwise power law is the best model among those candidates, and it does provide a good approximation based on our visual inspection.

We could use statistical tests (like K-S test) to replace the visual inspection, but as I focused on analyzing empirical "big" data, which are influenced by many factors (resulting in complex behaviors, such as small bumps on its distribution), they usually could not pass such statistical tests. So visual inspection is still the best option for the verification step.

If none of the candidate models could pass the rough visual inspection, we probably want to include a customized model into the candidates.

Hope that helps.

BTW, I think some of the statements above are in the appendix in one of the papers I listed in the readme (the first one I believe), and that's why I forgot to include them into the readme file. Sorry about that.

ttsesm · 2021-03-19T16:54:53Z

@XiangwenWang thanks for the feedback, it helps quite a lot since it is easier to understand the numbers and the output graphs. I will try to have a look in the papers as well (thanks for the links). If I still have any questions I will let you know.

XiangwenWang · 2021-03-24T00:49:23Z

You are welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help for explaining the output #1

Help for explaining the output #1

ttsesm commented Mar 13, 2021

XiangwenWang commented Mar 15, 2021 •

edited

Loading

ttsesm commented Mar 19, 2021

XiangwenWang commented Mar 24, 2021

Help for explaining the output #1

Help for explaining the output #1

Comments

ttsesm commented Mar 13, 2021

XiangwenWang commented Mar 15, 2021 • edited Loading

ttsesm commented Mar 19, 2021

XiangwenWang commented Mar 24, 2021

XiangwenWang commented Mar 15, 2021 •

edited

Loading