Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help for explaining the output #1

Open
ttsesm opened this issue Mar 13, 2021 · 3 comments
Open

Help for explaining the output #1

ttsesm opened this issue Mar 13, 2021 · 3 comments

Comments

@ttsesm
Copy link

ttsesm commented Mar 13, 2021

Hi, I was trying your code related to heavytailed distributions. So in my case I have a dataset which is heavily tailed, thus I used a sample of it with the code here and I got the following output:

xmin=125	Power Law	1.44857249	116024
xmin=125	Exponential	0.00078316	106812
.../scipy/optimize/optimize.py:282: RuntimeWarning: Values in x were outside bounds during a minimize step, clipping to bounds
  warnings.warn("Values in x were outside bounds during a "
xmin=125	Log-Normal	7.01483688 1.01	106695
xmin=125	Pairwise Power Law	1e-06 3.5953322 0.07234651	105206
xmin=125	Power Law With Exp Cutoff	2.0 0.0001	122655
xmin=125	Yule	1.44910929	116018
xmin=125	Poisson	1401.41099181	3986280
.../heavytailed/poisson.py:49: RuntimeWarning: overflow encountered in exp
  normfactor = 1. / np.exp(mu + np.log(1 - sp_pois.cdf(xmin - 1, mu)))
.../heavytailed/poisson.py:52: RuntimeWarning: overflow encountered in exp
  total -= np.exp(x * np.log(mu) - gammaln(x + 1)) * normfactor
.../heavytailed/poisson.py:52: RuntimeWarning: invalid value encountered in double_scalars
  total -= np.exp(x * np.log(mu) - gammaln(x + 1)) * normfactor

and the following graphs:

log_normal
exponential
pairwise_power_law
poisson
power_law_with_exp_cutoff
power_law
yule

from what I understand the pairwise power law seems to fit better on the input data, but I am not sure I understand what that means. Thus, I would appreciate if you could provide some feedback.

Thanks.

@XiangwenWang
Copy link
Owner

XiangwenWang commented Mar 15, 2021

Hi there,

I haven't touched the codes for roughly 2 years (as I'm no longer in academia), but last time I ran the code, there were no such warnings. Probably that's caused by some Numpy/Scipy updates. I think some distributions/models didn't reach the optimum results, but for now let's pretend that they are fine.

The purpose of this package is to check the tail behavior, particularly its distribution, of a dataset, or say to answer the question "among the candidate models, which one could best describe the tail distribution of this dataset?" The general idea is to use some statistical techniques to get a conclusion. Here I relied on AIC for model selection, as suggested in "Power-law distributions in empirical data" (a well-cited manuscript).

For the output table, the first column represents where the tail starts; the second column is the model name; the third to the second last columns are the obtained parameters in corresponding models, and there might be several parameters in a model. The last column is the AIC, and during model comparison, that's the only column we need to focus on. Basically, the model/distribution with the smallest AIC is the best-fitted model. For this demo, the model "Pairwise Power Law" presented the smallest AIC, which is 105206, and we would call it the best model (among the candidate models) for describing the tail of the sample dataset.

But there is a crucial problem - what if none of the candidates is correct (or at least a good approximation)? That's why I further plot the distributions against the data for verification. Usually we draw histograms for this purpose, but here as we focus on fat tails, histograms could be noisy and misleading. Instead, I draw the surviving distributions, 1-CDF(x), which is more smooth and clear. From these plots, we could confirm that the pairwise power law is the best model among those candidates, and it does provide a good approximation based on our visual inspection.

We could use statistical tests (like K-S test) to replace the visual inspection, but as I focused on analyzing empirical "big" data, which are influenced by many factors (resulting in complex behaviors, such as small bumps on its distribution), they usually could not pass such statistical tests. So visual inspection is still the best option for the verification step.

If none of the candidate models could pass the rough visual inspection, we probably want to include a customized model into the candidates.

Hope that helps.

BTW, I think some of the statements above are in the appendix in one of the papers I listed in the readme (the first one I believe), and that's why I forgot to include them into the readme file. Sorry about that.

@ttsesm
Copy link
Author

ttsesm commented Mar 19, 2021

@XiangwenWang thanks for the feedback, it helps quite a lot since it is easier to understand the numbers and the output graphs. I will try to have a look in the papers as well (thanks for the links). If I still have any questions I will let you know.

@XiangwenWang
Copy link
Owner

You are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants