Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include SMCSMC #34

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

Chris1221
Copy link

Hello everyone,

I'm a PhD student working with Gerton Lunter, and here I've added our algorithm smcsmc to the analysis pipeline. As the input format is similar to msmc, there is not a lot of additional overhead needed.

Details about the algorithm may be found here and the implementation is (just released) here.

I have been able to successfully run through the whole pipeline with five replicates of the Gutenkunst model, with comparable accuracy to msmc. Here's a plot with the Gutenkunst model in black, looking at the 2nd (European-acting) population. I ran this through for both 2 and 4 haploids.

image

I'm working on getting smcsmc on conda-forge but for the moment the easiest way to install it is manually. Since I'm working on getting it up, I haven't included any other manual install scripts like msmc.

Changes Summary

  • Created a new set of rules relating to smcsmc essentially duplicating the rules for the msmc analysis with different scripts. The structure is the same where the overall rule is compound_smcsmc which creates a plot. I've added a guide to this plot, which is slightly different than msmc.
    • I've accordingly added fields to the config.json and put functional ones in the n_t/README.md
  • Added plotting scripts to n_t/plots.py for plotting replicates of smcsmc.
  • (Untested) Add smcsmc plots to plots.plot_all_ne_estimates. I haven't had a chance to run the entire pipeline so I haven't had a chance to test this yet.

Other

  • I've created a separate submodule within our python package for PopSim related code. This is probably best packaged here, but I haven't figured out the best way to do this yet. One of these functions is found in the Snakefile and the submodule is just smcsmc.popsim. Right now it's pretty basic.
  • Working with a list of haplotype numbers works fine (like MSMC) as smcsmc can do any number of haplotypes (though performance wise, 8 is a soft cap) but
  • I haven't tried this with any non-human simulations, nor have I adapted it for the 2 population case yet. These are on my TODO list.
  • I also fixed a very small bug in the Snakefile where the output was being put into the config file directory rather than the specified output directory in the config file. Just a typo.

Thanks for putting together such an excellent initiative. Gerton and I are hoping to join in the call today and discuss this with everyone.

Thank you all,

Chris

@Chris1221
Copy link
Author

Nice to speak with everyone on the call today -- my apologies for not finding the developer documentation earlier. I will edit this to conform to the stdpopsim format.

@andrewkern
Copy link
Member

no worries at all- thanks for joining the team!

@dortegadelv
Copy link

Nice chatting with everyone! My homework for the next call will be to perform demographic inferences using SMC++ under the Gutenkunst model. These results will be helpful to compare the parameter estimates from SMC++ to the ones from fastsimcoal and dadi obtained by Chris Kyriazis. SMC++ assumes no gene flow after the populations split. Therefore, I will do the Gutenkunst model analysis including and not including gene flow to see how gene flow would bias the inferences.

@ckyriazis
Copy link
Contributor

Sounds good @dortegadelv - let me know if you need any help adding it to the two population pipeline thats already in place. I should have a more finalized version up on the github by the end of the week.

@dortegadelv
Copy link

Using the pipeline you have already developed would be great @ckyriazis ! Please let me know once you have the finalized version. Thanks!

@Chris1221 Chris1221 force-pushed the master branch 5 times, most recently from 65f2f8e to e9a11c5 Compare June 24, 2019 23:28
@Chris1221
Copy link
Author

My apologies to anyone who received a bunch of emails from this thread -- cleaning up and squashing the history required considerably more git-fu than I am comfortable with. Everything is now squashed in, and smcsmc should be able to be installed from conda for testing. I read the developer documentation, rebased, and squashed all commits into a single one.

  • It doesn't seem like the analysis repo has unit tests, so I have not included any. Please let me know if this is not the case, and I will write them accordingly.
  • It also doesn't seem like we are documenting the methods used, only the models in stdpopsim, so if you would like some documentation, again, very happy to do it just let me know.

I also had some issues getting flake8 to work with the Snakemake -- and it appears as though the code is not all pep8 compliant in any case, so I haven't tried too hard to make this the case.

I think that's all, but when you have a moment please let me know what the next steps would be. I'm looking to do a whole run with the whole pipeline (so far, I've only tested and confirmed that my smcsmc bits are working, and I assume that since the rest have worked for you that they will work together) just to make sure that everything plays together nicely.

Thanks everyone, best wishes.

@andrewkern
Copy link
Member

Hey @Chris1221- I just merged @jradrion's PR with the changes to generation_time. That will probably break this PR... sorry for the headache....

@andrewkern andrewkern requested a review from jradrion July 9, 2019 17:00
@jradrion
Copy link
Contributor

jradrion commented Jul 9, 2019

@Chris1221 I can go ahead and review this code and implement the minor changes needed for your PR to play with this most recent merge that @andrewkern just made. No need to edit your PR at this point.

@Chris1221
Copy link
Author

Thanks @andrewkern and @jradrion -- I appreciate the heads up. I was just thinking that I've actually made a few (very minor) changes to the smcsmc api right before I put it on conda, they shouldn't affect this at all but please let me know if you run into any trouble and I'll do my best to help out however I can.

@Chris1221
Copy link
Author

Hopefully you should be able to install smcsmc on linux just with

conda config --add channels conda-forge
conda config --add channels terhorst

conda install -c luntergroup smcsmc

And I have the above code set up to run on a qsub cluster (it's just the c option in the argument dictionary) -- if you let me know what architecture your cluster system uses I might be able to add to smcsmc to be able to use that as well.

@andrewkern
Copy link
Member

so I believe the channels should be all set. we'll just need to add smcsmc to the requirements.txt file in the repo

@jradrion
Copy link
Contributor

jradrion commented Jul 9, 2019

Actually, @Chris1221, if it's not too much trouble. Would you be able to integrate the changes for smcsmc into the most recent PR merge that Andy just made? This way I can review smcsmc once rather than needing an additional re-review. Sorry for the confusion.

@Chris1221
Copy link
Author

Yep, no problem @jradrion -- I'll also rerun it just to check and make sure that everything is working before you review.

@Chris1221
Copy link
Author

Sorry for the delay, the changes are merged in now. I'm just rerunning to make sure that everything is working with the current smcsmc.

@jradrion
Copy link
Contributor

@Chris1221 No worries! I'll give it a run on our machine this afternoon.

@Chris1221
Copy link
Author

Perfect, thanks. I can confirm that everything is working fine for me with this test config. The number of particles is set to 10, so there won't be any accuracy in this run -- I usually use 5000 or 10000 for reference.

{
    "seed" : 12345,
    "population_id" : 0,
    "num_sampled_genomes_per_replicate" : 20,  
    "num_sampled_genomes_msmc" : "2,8",
    "num_sampled_genomes_smcsmc" : "4",
    "num_smcsmc_particles": 10,
    "num_msmc_iterations" : 20,
    "num_smcsmc_iterations": 1,
    "replicates" : 1,
    "species" : "homo_sapiens",
    "model" : "GutenkunstThreePopOutOfAfrica",
    "genetic_map" : "HapmapII_GRCh37",
    "chrm_list" : "all",
}

Also, I think you guys are using a SLURM cluster, but smcsmc is currently only set up to run on SGE since that's what we have available. I could spend some time porting it over, but don't actually have a way to test it at the moment, so you may want to remove line 341 from modified n_t/Snakefile (which is the 'use the cluster' directive to smcsmc). I'll put a comment there in the code to make it more clear. It's a goal of mine to be able to run on any cluster system but unfortunately I haven't gotten around to it just yet.

n_t/Snakefile Outdated
'Np': str(num_smcsmc_particles),
# Submission Parameters
'chunks': '100',
'c': '',
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this line to not use an SGE/qsub cluster.

@jradrion
Copy link
Contributor

@Chris1221 Sorry, got sidetracked yesterday and I'm just getting to this today. Can you confirm that your conda install of smcsmc did not inadvertently remove smcpp? When I attempted the install via conda install -c luntergroup smcsmc, in addition to a bunch of downgrades conda wanted to remove the following:

The following packages will be REMOVED:

  ad-1.3.2-py36_0
  smcpp-1.15.2-py36h6bb024c_0

@Chris1221
Copy link
Author

That's extremely strange, do you mind posting the full output and which packages it wants to downgrade? I wonder why it wants to get rid of smcpp -- I don't mention it anywhere in the recipe.

@jradrion
Copy link
Contributor

No problem! I don't have a ton of experience using conda, so let me know if I did something incorrectly.

(stdpopsim) jadrion@sesame:~/soft/popgensims/analysis/n_t_slimValidation$ conda install -c luntergroup smcsmc
Collecting package metadata: done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.6.14
  latest version: 4.7.5

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /home/jadrion/miniconda3/envs/stdpopsim

  added / updated specs:
    - smcsmc


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |             main           3 KB
    aioeasywebdav-2.4.0        |        py37_1000          18 KB  conda-forge
    aiohttp-3.5.4              |   py37h14c3975_0         597 KB  conda-forge
    asn1crypto-0.24.0          |        py37_1003         154 KB  conda-forge
    bcolz-1.2.1                |py37h637b7d7_1001         775 KB  conda-forge
    bcrypt-3.1.6               |   py37h516909a_1          42 KB  conda-forge
    bokeh-1.2.0                |           py37_0         3.9 MB  conda-forge
    certifi-2019.6.16          |           py37_1         149 KB  conda-forge
    cffi-1.12.3                |   py37h8022711_0         218 KB  conda-forge
    chardet-3.0.4              |        py37_1003         167 KB  conda-forge
    cryptography-2.7           |   py37h72c5cf5_0         610 KB  conda-forge
    cython-0.29.12             |   py37he1b5a44_0         2.2 MB  conda-forge
    cytoolz-0.9.0.1            |py37h14c3975_1001         414 KB  conda-forge
    datrie-0.7.1               |   py37h14c3975_0         156 KB  conda-forge
    distributed-1.28.1         |           py37_0         852 KB  conda-forge
    docutils-0.14              |        py37_1001         692 KB  conda-forge
    gobject-introspection-1.58.2|py37h5503ade_1001         1.2 MB  conda-forge
    google-api-core-1.13.0     |           py37_0          79 KB  conda-forge
    googleapis-common-protos-1.6.0|           py37_0          63 KB  conda-forge
    gperftools-2.7             |       h767d802_2         2.3 MB  conda-forge
    heapdict-1.0.0             |        py37_1000           7 KB  conda-forge
    idna-2.8                   |        py37_1000         100 KB  conda-forge
    idna_ssl-1.1.0             |        py37_1000           6 KB  conda-forge
    kiwisolver-1.1.0           |   py37hc9558a2_0          86 KB  conda-forge
    markupsafe-1.1.1           |   py37h14c3975_0          26 KB  conda-forge
    matplotlib-3.1.1           |           py37_0           6 KB  conda-forge
    matplotlib-base-3.1.1      |   py37hfd891ef_0         6.6 MB  conda-forge
    mock-3.0.5                 |           py37_0          44 KB  conda-forge
    msgpack-python-0.6.1       |   py37h6bb024c_0          88 KB  conda-forge
    msprime-0.7.1              |   py37hc8e6159_0         192 KB  conda-forge
    multidict-4.5.2            |py37h14c3975_1000         141 KB  conda-forge
    numcodecs-0.6.3            |   py37hf484d3e_0         931 KB  conda-forge
    numexpr-2.6.9              |py37h637b7d7_1000         194 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    pandas-0.24.2              |   py37hb3f55d8_0        11.1 MB  conda-forge
    paramiko-2.6.0             |           py37_0         268 KB  conda-forge
    perl-5.26.2                |    h516909a_1006        15.4 MB  conda-forge
    pillow-6.1.0               |   py37he7afcd5_0         634 KB  conda-forge
    pomegranate-0.11.0         |   py37h9de70de_0         3.5 MB  conda-forge
    protobuf-3.8.0             |   py37he1b5a44_1         683 KB  conda-forge
    psutil-5.6.3               |   py37h516909a_0         322 KB  conda-forge
    pycparser-2.19             |           py37_1         171 KB  conda-forge
    pygraphviz-1.5             |py37h14c3975_1000         117 KB  conda-forge
    pynacl-1.3.0               |py37h14c3975_1000         1.5 MB  conda-forge
    pyopenssl-19.0.0           |           py37_0          81 KB  conda-forge
    pyqt-5.9.2                 |   py37hcca6a23_0         5.7 MB  conda-forge
    pyrsistent-0.15.3          |   py37h516909a_0          89 KB  conda-forge
    pysam-0.15.2               |   py37h4b7d16d_3         2.2 MB  bioconda
    pysocks-1.7.0              |           py37_0          26 KB  conda-forge
    pytables-3.5.2             |   py37ha1aa75f_0         1.5 MB  conda-forge
    pyyaml-5.1.1               |   py37h516909a_0         184 KB  conda-forge
    ratelimiter-1.2.0          |        py37_1000          12 KB  conda-forge
    requests-2.22.0            |           py37_1          84 KB  conda-forge
    s3transfer-0.2.1           |           py37_0          91 KB  conda-forge
    scikit-allel-1.2.1         |   py37hb3f55d8_0         1.5 MB  conda-forge
    scikit-learn-0.21.2        |   py37hcdab131_1         6.7 MB  conda-forge
    scipy-1.3.0                |   py37h921218d_0        18.8 MB  conda-forge
    sip-4.19.8                 |py37hf484d3e_1000         290 KB  conda-forge
    smcsmc-1.0                 |   py37ha8d69ae_0         4.6 MB  luntergroup
    statsmodels-0.10.0         |   py37hc1659b7_0         9.5 MB  conda-forge
    tornado-6.0.3              |   py37h516909a_0         637 KB  conda-forge
    typing_extensions-3.7.4    |           py37_0          38 KB  conda-forge
    urllib3-1.25.3             |           py37_0         187 KB  conda-forge
    wrapt-1.11.2               |   py37h516909a_0          46 KB  conda-forge
    yarl-1.3.0                 |py37h14c3975_1000         132 KB  conda-forge
    zarr-2.3.2                 |           py37_0         220 KB  conda-forge
    ------------------------------------------------------------
                                           Total:       111.1 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main
  boost              conda-forge/linux-64::boost-1.70.0-py37h9de70de_1
  boost-cpp          conda-forge/linux-64::boost-cpp-1.70.0-ha2d47e9_0
  gperftools         conda-forge/linux-64::gperftools-2.7-h767d802_2
  perl               conda-forge/linux-64::perl-5.26.2-h516909a_1006
  smcsmc             luntergroup/linux-64::smcsmc-1.0-py37ha8d69ae_0

The following packages will be REMOVED:

  ad-1.3.2-py36_0
  smcpp-1.15.2-py36h6bb024c_0

The following packages will be UPDATED:

  aioeasywebdav                                2.2.0-py36_0 --> 2.4.0-py37_1000
  certifi                                  2019.6.16-py36_0 --> 2019.6.16-py37_1
  cython                             0.29.10-py36he1b5a44_0 --> 0.29.12-py37he1b5a44_0
  gobject-introspec~               1.58.2-py36h2da5eee_1000 --> 1.58.2-py37h5503ade_1001
  google-api-core                             1.11.0-py36_0 --> 1.13.0-py37_0
  matplotlib                                   3.1.0-py36_1 --> 3.1.1-py37_0
  matplotlib-base                      3.1.0-py36hfd891ef_1 --> 3.1.1-py37hfd891ef_0
  openssl                                 1.1.1b-h14c3975_1 --> 1.1.1c-h516909a_0
  paramiko                                     2.5.0-py36_0 --> 2.6.0-py37_0
  pillow                               6.0.0-py36he7afcd5_0 --> 6.1.0-py37he7afcd5_0
  pomegranate            bioconda::pomegranate-0.3.7-py36_2 --> conda-forge::pomegranate-0.11.0-py37h9de70de_0
  protobuf                             3.8.0-py36he1b5a44_0 --> 3.8.0-py37he1b5a44_1
  pyrsistent                          0.15.2-py36h516909a_0 --> 0.15.3-py37h516909a_0
  python                 pkgs/main::python-3.6.8-h0371630_0 --> conda-forge::python-3.7.3-h5b0a415_0
  requests                                    2.22.0-py36_0 --> 2.22.0-py37_1
  scikit-learn                        0.21.2-py36h627018c_0 --> 0.21.2-py37hcdab131_1
  statsmodels                       0.9.0-py36h3010b51_1000 --> 0.10.0-py37hc1659b7_0
  tornado                              6.0.2-py36h516909a_0 --> 6.0.3-py37h516909a_0
  typing_extensions                         3.7.2-py36_1000 --> 3.7.4-py37_0
  urllib3                                     1.24.3-py36_0 --> 1.25.3-py37_0
  wrapt                               1.11.1-py36h516909a_0 --> 1.11.2-py37h516909a_0

The following packages will be DOWNGRADED:

  aiohttp                              3.5.4-py36h14c3975_0 --> 3.5.4-py37h14c3975_0
  asn1crypto                               0.24.0-py36_1003 --> 0.24.0-py37_1003
  bcolz                             1.2.1-py36h637b7d7_1001 --> 1.2.1-py37h637b7d7_1001
  bcrypt                               3.1.6-py36h516909a_1 --> 3.1.6-py37h516909a_1
  bokeh                                        1.2.0-py36_0 --> 1.2.0-py37_0
  cffi                                1.12.3-py36h8022711_0 --> 1.12.3-py37h8022711_0
  chardet                                   3.0.4-py36_1003 --> 3.0.4-py37_1003
  cryptography                           2.7-py36h72c5cf5_0 --> 2.7-py37h72c5cf5_0
  cytoolz                         0.9.0.1-py36h14c3975_1001 --> 0.9.0.1-py37h14c3975_1001
  datrie                               0.7.1-py36h14c3975_0 --> 0.7.1-py37h14c3975_0
  distributed                                 1.28.1-py36_0 --> 1.28.1-py37_0
  docutils                                   0.14-py36_1001 --> 0.14-py37_1001
  googleapis-common~                           1.6.0-py36_0 --> 1.6.0-py37_0
  h5py                        2.9.0-nompi_py36hf008753_1102 --> 2.9.0-nompi_py37hf008753_1102
  heapdict                                  1.0.0-py36_1000 --> 1.0.0-py37_1000
  idna                                        2.8-py36_1000 --> 2.8-py37_1000
  idna_ssl                                  1.1.0-py36_1000 --> 1.1.0-py37_1000
  jsonschema                                   3.0.1-py36_0 --> 3.0.1-py37_0
  kiwisolver                           1.1.0-py36hc9558a2_0 --> 1.1.0-py37hc9558a2_0
  markupsafe                           1.1.1-py36h14c3975_0 --> 1.1.1-py37h14c3975_0
  mock                                         3.0.5-py36_0 --> 3.0.5-py37_0
  msgpack-python                       0.6.1-py36h6bb024c_0 --> 0.6.1-py37h6bb024c_0
  msprime                              0.7.1-py36hc8e6159_0 --> 0.7.1-py37hc8e6159_0
  multidict                         4.5.2-py36h14c3975_1000 --> 4.5.2-py37h14c3975_1000
  numcodecs                            0.6.3-py36hf484d3e_0 --> 0.6.3-py37hf484d3e_0
  numexpr                           2.6.9-py36h637b7d7_1000 --> 2.6.9-py37h637b7d7_1000
  numpy                               1.16.4-py36h95a1406_0 --> 1.16.4-py37h95a1406_0
  pandas                              0.24.2-py36hb3f55d8_0 --> 0.24.2-py37hb3f55d8_0
  pip                                         19.1.1-py36_0 --> 19.1.1-py37_0
  psutil                               5.6.3-py36h516909a_0 --> 5.6.3-py37h516909a_0
  pycparser                                     2.19-py36_1 --> 2.19-py37_1
  pygraphviz                          1.5-py36h14c3975_1000 --> 1.5-py37h14c3975_1000
  pynacl                            1.3.0-py36h14c3975_1000 --> 1.3.0-py37h14c3975_1000
  pyopenssl                                   19.0.0-py36_0 --> 19.0.0-py37_0
  pyqt                                 5.9.2-py36hcca6a23_0 --> 5.9.2-py37hcca6a23_0
  pysam                               0.15.2-py36h4b7d16d_3 --> 0.15.2-py37h4b7d16d_3
  pysocks                                      1.7.0-py36_0 --> 1.7.0-py37_0
  pytables                             3.5.2-py36ha1aa75f_0 --> 3.5.2-py37ha1aa75f_0
  pyyaml                               5.1.1-py36h516909a_0 --> 5.1.1-py37h516909a_0
  ratelimiter                               1.2.0-py36_1000 --> 1.2.0-py37_1000
  s3transfer                                   0.2.1-py36_0 --> 0.2.1-py37_0
  scikit-allel                         1.2.1-py36hb3f55d8_0 --> 1.2.1-py37hb3f55d8_0
  scipy                                1.3.0-py36h921218d_0 --> 1.3.0-py37h921218d_0
  setuptools                                  41.0.1-py36_0 --> 41.0.1-py37_0
  sip                              4.19.8-py36hf484d3e_1000 --> 4.19.8-py37hf484d3e_1000
  six                                      1.12.0-py36_1000 --> 1.12.0-py37_1000
  tskit                                0.1.5-py36hd352d35_0 --> 0.1.5-py37hd352d35_0
  wheel                                       0.33.4-py36_0 --> 0.33.4-py37_0
  yarl                              1.3.0-py36h14c3975_1000 --> 1.3.0-py37h14c3975_1000
  zarr                                         2.3.2-py36_0 --> 2.3.2-py37_0

@Chris1221
Copy link
Author

Thanks, that's very helpful. Don't worry, it's definitely not you, I think I must have messed something up. Maybe I was too stringent with the version pinning... If you don't mind a delay, I'll take a look at this today and see if I can figure out what's going on, I'm very confused by the number of packages that it wants to downgrade, almost none of those are necessary. Sorry about this!

@jradrion
Copy link
Contributor

@Chris1221 No worries, let me know when I should try the install again.

@Chris1221
Copy link
Author

Would you mind trying now @jradrion? I did not explicitly make an smcsmc conda package for python 3.6.8, which we're all using for popsim, so I've done that now and uploaded it. When you tried to install smcsmc, it only found a version for python 3.7 and so switched over your entire conda environment to python 3.7 (you can see the python version being changed in the update section above:)

...
python                 pkgs/main::python-3.6.8-h0371630_0 --> conda-forge::python-3.7.3-h5b0a415_0
...

scmpp was being deleted because there is no 3.7 version of scmpp, and the rest of the weird changes were also a consequence of this I believe. I think it should be alright now -- I've bumped the version of smcsmc to 1.0.1 and put it up on conda with a 36 and 37 variant to cover both cases now. Sorry about the confusion, I should have thought of this beforehand!

@jradrion
Copy link
Contributor

Looks great @Chris1221! This install worked and everything is looking good. I probably won't have the opportunity to fully test everything until I return from SMBE. I'll post an update when I get the chance.

@Chris1221
Copy link
Author

Oh great, thanks @jradrion. Enjoy the conference, I'm attending via twitter ;)

@jradrion
Copy link
Contributor

So I've been having some issues with the Snakefile hanging right around 80% completion, but it looks like the problem is with MSMC. I'll have to dig into this further, because my MSMC rules were previously working. @Chris1221 can you confirm that you were able to run the Snakefile in it's entirety? The reason I ask is that I was also getting a wildcard error in the ne_files_smcsmc(wildcards): function. I needed to add samps=num_sampled_genomes_smcsmc to that function to get around this.

@Chris1221
Copy link
Author

Hi @jradrion, I actually have not been able to run the entire snakefile, though the issues that I've been having have been with smcpp, msmc seems to be working for me. Sorry about that, yes I think you're right that that needs to be added. I've been running it from the compound_smcsmc rule which doesn't use the ne_files_smcsmc function.

Copy link
Contributor

@jradrion jradrion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Chris1221 I identified where my Snakefile was hanging, and it's actually at rule compound_smcsmc. The last message printed to the screen was INFO:smcsmc.model:Waiting for chunk 0 to appear.... Any thoughts?

@Chris1221
Copy link
Author

Yep that sounds right. That message means that smcsmc is waiting for the chunks to finish running (we split it up and process in parallel). When the c flag (some detail above) is on, smcsmc tries to shoot out cluster jobs for each of the chunks, and waits until they produce the correct files for each iteration. If you turn off the c flag (by removing all of line 341 in n_t/Snakefile, I put a review next to it further up the chain here), it will run locally by multithreading instead. It may very well take a while though. Can I ask what kind of cluster you are using? I'd like to be able to have the script spitting off the jobs be smarter and detect cluster environments automatically but I don't really know that much about it.

@jradrion
Copy link
Contributor

jradrion commented Aug 1, 2019

Ahh, yes I should have paid more attention to your comments. Removing line 341 does the trick. Here's the output from chr22.
smcsmc_estimated_Ne_4

@Chris1221
Copy link
Author

Hooray! Okay, I'm very happy that it's working for you. The configuration file that I provided is very much a 'validating the software works' example. To get the accuracy that is shown in my first plots requires the particle count to be increased to ~ 10k and increasing the iterations to ~15-30. But I'm very happy that the software is at least running on the data and the pipeline is working on more than just my local setup. Perhaps I should take out the cluster directive, or recode it into the configuration file, would that perhaps be more appropriate? At least in that case it would run on anyone's computer, albeit slowly.

@jradrion
Copy link
Contributor

jradrion commented Aug 1, 2019

It works! Of course, I understand that these are easy-to-test parameters, and that accuracy could likely be dramatically improved. I should have specified that I'm not running this on a cluster, just a Linux box with 80 cores! I think ultimately it would be nice to be able to run it on a cluster, but for now it might simplify things to remove the cluster directive.

@Chris1221 Chris1221 force-pushed the master branch 4 times, most recently from 197caf4 to c0b35c0 Compare August 1, 2019 20:01
@Chris1221
Copy link
Author

Neat! Makes sense, I've done that in the latest (squashed) commit. I also made the input format for the number of sampled genomes the same as msmc is now (separated by a comma rather than white space) and removed a comment that I left in by accident.

@jradrion
Copy link
Contributor

jradrion commented Aug 1, 2019

Looks good! I'm also running into problems with the smcsmc implementation in plots.py. Are you able to debug that? I think you're going to need the output from smcpp and msmc as well. I can give it a pass if you are not currently able to generate the smcpp and msmc outputs. One issue I see is that the lines to split the sample size from your output file names will not work the same way as they do for the msmc output, as msmc files start with the sample size where yours are all labeled "results.out.csv" in separate directories that indicate the sample sizes.

Copy link
Contributor

@jradrion jradrion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Chris1221 I couldn't find a way to modify your PR, but I have attached a correction to the function in question from plots.py. This now allows the plots for all methods to be output in the same pdf. This is working on my end but I'm happy to try it again when you commit.

def plot_all_ne_estimates(sp_infiles, smcpp_infiles, msmc_infiles, smcsmc_infiles, outfile,
                             model, n_samp, generation_time, species,
                             pop_id = 0, steps=None):
 
     ddb = msprime.DemographyDebugger(**model.asdict())
     if steps is None:
         end_time = ddb.epochs[-2].end_time + 10000
         steps = np.linspace(1,end_time,end_time+1)
     num_samples = [0 for _ in range(ddb.num_populations)]
     num_samples[pop_id] = n_samp
     coal_rate, P = ddb.coalescence_rate_trajectory(steps=steps,
         num_samples=num_samples, double_step_validation=False)
     steps = steps * generation_time
 
     num_msmc = set([os.path.basename(infile).split(".")[0] for infile in msmc_infiles])
     num_smcsmc = set([infile.split("/")[-2].split(".")[0] for infile in smcsmc_infiles])
 
     num_msmc = sorted([int(x) for x in num_msmc])
     num_smcsmc = sorted([int(x) for x in num_smcsmc])
 
     f, ax = plt.subplots(1,2+len(num_msmc) + len(num_smcsmc), sharex=True,sharey=True,figsize=(14, 7))
     for infile in smcpp_infiles:
         nt = pandas.read_csv(infile, usecols=[1, 2], skiprows=0)
         line1, = ax[0].plot(nt['x'], nt['y'], alpha=0.8)
     ax[0].plot(steps, 1/(2*coal_rate), c="black")
     ax[0].set_title("smc++")
     for infile in sp_infiles:
         nt = pandas.read_csv(infile, sep="\t", skiprows=5)
         line2, = ax[1].plot(nt['year'], nt['Ne_median'],alpha=0.8)
     ax[1].plot(steps, 1/(2*coal_rate), c="black")
     ax[1].set_title("stairwayplot")
 
     plot_counter=2
     for i,sample_size in enumerate(num_msmc):
         for infile in msmc_infiles:
             fn = os.path.basename(infile)
             samp = fn.split(".")[0]
             if(int(samp) == sample_size):
                 nt = pandas.read_csv(infile, usecols=[1, 2], skiprows=0)
                 line3, = ax[plot_counter].plot(nt['x'], nt['y'],alpha=0.8)
         ax[plot_counter].plot(steps, 1/(2*coal_rate), c="black")
         ax[plot_counter].set_title(f"msmc, ({sample_size} samples)")
         plot_counter+=1
 
     for i,sample_size in enumerate(num_smcsmc):
         for infile in smcsmc_infiles:
             samp = infile.split("/")[-2].split(".")[0]
             if(int(samp) == sample_size):
                 nt = pandas.read_csv(infile, usecols=[1, 2], skiprows=0)
                 line3, = ax[plot_counter].plot(nt['x'], nt['y'],alpha=0.8)
         ax[plot_counter].plot(steps, 1/(2*coal_rate), c="black")
         ax[plot_counter].set_title(f"smcsmc, ({sample_size} samples)")
         plot_counter+=1
     plt.suptitle(f"{species}, population id {pop_id}", fontsize = 16)
     for i in range(2+len(num_msmc)+len(num_smcsmc)):
         ax[i].set(xscale="log", yscale="log")
         ax[i].set_xlabel("time (years ago)")
 
 
     red_patch = mpatches.Patch(color='black', label='Coalescence rate derived Ne')
     ax[0].legend(frameon=False, fontsize=10, handles=[red_patch])
     ax[0].set_ylabel("population size")
     f.savefig(outfile, bbox_inches='tight', alpha=0.8)

@andrewkern
Copy link
Member

a note for you guys-- you will want to git fetch upstream and then git rebase -i upstream/master as we have just merged new stuff into the master branch of analysis

@Chris1221
Copy link
Author

Wow, thank you @jradrion! You're right that I'm unable to test it, but thank you for modifying the plotting code so that it works with the output. I'm sure that if it works for you then it would also work for me. I've put it in in the latest push and (@andrewkern) rebased against the current master (with mask). Everything is all squashed now as well.

Copy link
Member

@andrewkern andrewkern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor nitpick above and a question: is SMCSMC set up to do masking in this PR? If not let's get it there before merging

"species" : "homo_sapiens",
"model" : "GutenkunstThreePopOutOfAfrica",
"genetic_map" : "HapmapII_GRCh37",
"chrm_list" : "chr22,chrX"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nitpick: we are going to want to leave the mask_file variable in the readme

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry I was too zealous with the merging. I will put it back!

@Chris1221
Copy link
Author

Right, sorry that slipped my mind. I think I can mostly use your msmc masking code, but I'll stare at it for a little while and try to get that working.

@andrewkern
Copy link
Member

it would be great to get another set of eyes on what I implemented for msmc. here is the place to look.

@Chris1221
Copy link
Author

Sorry for never coming back to this, I did actually fix the masking code shortly after our conversation, a modified and admittedly very hacky version is in the development version of smc2. I subsequently got very distracted by life and passing a phd milestone. I'll do a little bit more testing and release the new version of smc2 on the weekend if that's okay, then I believe we should be good to go here? (I saw in another pr that @andrewkern was looking to clean up the repo a little and this is very much outstanding, my apologies)

@andrewkern
Copy link
Member

no worries @Chris1221. the repo has moved a bit since you were working on this PR so be sure to fetch/merge upstream code.

@jeromekelleher
Copy link
Member

Some changes upstream in stdpopsim mean that your branch won't work @Chris1221. Once the changes in #48 are merged, you should be able to rebase this branch and it should work. You haven't made any changes to the simulation code, so it should all work fine (after #48 is merged and you have rebased).

@Chris1221
Copy link
Author

Thanks @jeromekelleher, sorry I missed your comment originally. I'll do my best to get to this soon, sorry for the massive delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants