diff --git a/docs/source/user_manual/installation.rst b/docs/source/user_manual/installation.rst index ed39f9e9..6db9a55d 100644 --- a/docs/source/user_manual/installation.rst +++ b/docs/source/user_manual/installation.rst @@ -17,18 +17,18 @@ GADMA requires the following dependencies (`requirements/minimal.txt`): * Python3 * NumPy (>= 1.2.0) * Scipy (>= 0.6.0) -* ruamel.yaml +* ruamel.yaml (<0.18.0) * ``dadi`` (>= 1.7.0) * ``moments`` (>= 1.0.0) * ``momi`` -* ``moments.LD`` (manual installation of ``moments`` with `--ld-extension` flag) +* ``moments.LD`` (is installed alongside with ``moments``) * nlopt (for ``dadi``) * Cython (for ``moments``) * mpmath (for ``moments``) To draw demographic models one should also install the following packages (`requirements/minimal.txt`): -* matplotlib (>= 0.98.1) +* matplotlib (>= 0.98.1, <3.5) * Pillow (>= 4.2.1) - optional * ``moments`` (>= 1.0.0) @@ -51,11 +51,10 @@ To run Bayesian optimization `smac` of specified version is requered (`requireme ``momi`` package sometimes is not installed correctly for Windows and MacOS. If ``momi`` is not available please install it manually following the installation instructions in `momi's manual `_. .. note:: - ``momentsLD`` - the extension of ``moments``, should be installed manually following the installation instructions in `moments's manual `_. + ``momentsLD`` - the extension of ``moments``, it is installed together with ``moments``. Getting help for engine installation ------------------------------------ - If there are some troubles installing the engine, please, first of all check the table below for the ability to install this engine on your system. You are always welcome to `open an issue `_ on GitHub for getting help. GADMA has automatic tests on GitHUb for engines on different systems (Linux, Windows, MacOS). The following table indicates (according to our tests) if engine could be installed on specified system: @@ -88,7 +87,7 @@ GADMA has automatic tests on GitHUb for engines on different systems (Linux, Win Installing the latest release ------------------------------ -The latest release of GADMA is easily installed via ``pip`` or ``conda`` (``bioconda``): +The latest release of GADMA can be easily installed via ``pip`` or ``conda`` (``bioconda``): .. code-block:: console @@ -114,6 +113,35 @@ or $ conda install -c bioconda moments +Troubleshooting +--------------- + +If you experience problems with dependencies, we recommend to create an empty `conda environment `_: + +.. code-block:: console + + $ conda create -n gadma_env python=3.10 + $ conda activate gadma_env + +It is possible to install versions that are used for testing by downloading file `minimal.txt` from `here `_ and install requirements using: + +.. code-block:: console + + $ pip install -r minimal.txt + $ pip install gadma + +For **MacOS with M processor** we suggest the following recipe (credit to `@Enricobazzi `_): + +.. code-block:: console + + $ pip install git+https://github.com/MomentsLD/moments.git + $ conda install -c conda-forge dadi + $ conda install -c conda-forge scikit-allel + $ pip install gadma + $ pip uninstall ruamel.yaml + $ pip install "ruamel.yaml<0.18.0" + $ pip uninstall matplotlib + $ pip install "matplotlib<3.5" Manual installation ----------------------------- diff --git a/docs/source/user_manual/set_model/set_model.rst b/docs/source/user_manual/set_model/set_model.rst index 90dbfb22..46e82568 100644 --- a/docs/source/user_manual/set_model/set_model.rst +++ b/docs/source/user_manual/set_model/set_model.rst @@ -23,7 +23,7 @@ Types of demographic models GADMA could infer two base types of demographic models: -* `Demographic model with structire `__ (up to 3 populations). It is a more flexible model type as dynamics of population size change could be inferred and it has a lot of options for parameters. +* `Demographic model with structure `__ (up to 3 populations). It is a more flexible model type as dynamics of population size change could be inferred and it has a lot of options for parameters. .. admonition:: Related options @@ -42,8 +42,11 @@ GADMA could infer two base types of demographic models: * ``Inbreeding`` infers inbreeding coefficients (only for ``dadi`` engine). * ``Selection`` infers selection coefficients. * ``Ancestral size as parameter`` disables multinomial approach of ``dadi`` and ``moments`` engines when ancestral size is inferred implicitly. + * ``Lower bound of first split`` limits lower bound of the most ancient split. * ``Upper bound of first split`` limits upper bound of the most ancient split. - * ``Upper bound of second split`` limits upper bounds of next to the most ancient split. + * ``Lower bound of second split`` limits lower bound of the next to the most ancient split. + * ``Upper bound of second split`` limits upper bound of the next to the most ancient split. + * `Custom demographic model `__. It is a usual user-specified model like in ``dadi``, ``moments`` and other tools for demographic inference. Using such a model will give more control over parameters and could be used for inference of more than 3 populations but is less flexible. diff --git a/docs/source/user_manual/set_model/set_model_struct.rst b/docs/source/user_manual/set_model/set_model_struct.rst index d4e0fc8a..bda666fb 100644 --- a/docs/source/user_manual/set_model/set_model_struct.rst +++ b/docs/source/user_manual/set_model/set_model_struct.rst @@ -200,16 +200,28 @@ Split could be set in two ways: # param file Split fractions: True # for 1) point -Upper bound of split -_____________________ +Upper and lower bounds of splits +________________________________ -To limit time of some split one should specify an option in the parameter file. Splits are numbered from the most ancient. So split 1 is split that occurred with the ancient population and split 2 is the next division of the second population (exist only for three populations). There are two appropriate options: ``Upper bound of first split`` and ``Upper bound of second split``. +It is possible to limit time of split events in the demographic model with structure. In order to do that one should specify one or multiple options in the parameter file that refer to lower and upper bounds of split events. Splits are numbered from the most ancient, so split 1 is a split event that occurred with the ancient population and split 2 is the next division of the second population (exist only for three populations). There are three options corresponded to split times: ``Lower bound of first split``, ``Upper bound of first split``, ``Lower bound of second split`` and ``Upper bound of second split``. -One should translate time from years into genetic units, therefore divide it by ``2 * T_g``, where ``T_g`` is time (in years) for one generation. For example, one wants to limit the last split to 2000 years. Time for one generation is estimated as 24 years, then one should specify in the parameter file: +Bounds should be specified in GENERATIONS. In order to translate time from years to generations, divide it by ``T_g``, where ``T_g`` is time (in years) for one generation. For example, assume we want the last split to be between 1000 and 2000 years. Time for one generation is estimated to be 24 years. Therefore we construct the following parameter file: .. code-block:: none # param_file ... - Upper bound of second split : 41.666 + Lower bound of second split : 41.666 + Upper bound of second split : 83.333 + ... + + It is allowed to set any of those four options, just make sure they make sense. It is possible to set only one bound or one lower and one upper bounds for different splits: + + .. code-block:: none + + # param_file + ... + # In that particular case upper bound for the second split exists automatically + Upper bound of first split : 30 + Lower bound of second split : 10 ... diff --git a/example_params b/example_params index 88378912..79d1ceee 100644 --- a/example_params +++ b/example_params @@ -12,7 +12,7 @@ # If it is resumed from other directory and output directory # isn't set, GADMA will add '_resumed' for previous output # directory. -Output directory: my_example_run +Output directory: my_example_run_2 #!!! @@ -224,22 +224,31 @@ Inbreeding: False # Default: False Ancestral size as parameter: False -# It is possible to limit the time of splits. +# It is possible to limit the time of splits by bounds' specification. # Split 1 is the most ancient split. -# !Note that time is in genetic units (2 * time for 1 generation): +# !Note that time is in generations: # e.g. we want to limit by 150 kya, time for one generation is -# 25 years, then bound will be 150000 / (2*25) = 3000. +# 25 years, then bound will be 150000 / 25 = 6000. +# +# Lower bound for split 1 (in case of 2 or 3 populations). +# Default: None +Lower bound of first split: Null # # Upper bound for split 1 (in case of 2 or 3 populations). # Default: None Upper bound of first split: Null +# Lower bound for split 2 (in case of 3 populations). +# Default: None +Lower bound of second split: Null +# # Upper bound for split 2 (in case of 3 populations). # Default: None Upper bound of second split: Null + #!!! # Local optimization. # diff --git a/gadma/cli/params_template b/gadma/cli/params_template index 8146f4d1..67091d55 100644 --- a/gadma/cli/params_template +++ b/gadma/cli/params_template @@ -224,22 +224,31 @@ Inbreeding : # Default: False Ancestral size as parameter : -# It is possible to limit the time of splits. +# It is possible to limit the time of splits by bounds' specification. # Split 1 is the most ancient split. -# !Note that time is in genetic units (2 * time for 1 generation): +# !Note that time is in generations: # e.g. we want to limit by 150 kya, time for one generation is -# 25 years, then bound will be 150000 / (2*25) = 3000. +# 25 years, then bound will be 150000 / 25 = 6000. +# +# Lower bound for split 1 (in case of 2 or 3 populations). +# Default: None +Lower bound of first split : # # Upper bound for split 1 (in case of 2 or 3 populations). # Default: None Upper bound of first split : +# Lower bound for split 2 (in case of 3 populations). +# Default: None +Lower bound of second split : +# # Upper bound for split 2 (in case of 3 populations). # Default: None Upper bound of second split : + #!!! # Local optimization. # diff --git a/gadma/cli/settings.py b/gadma/cli/settings.py index 1221ac6e..265e97ef 100644 --- a/gadma/cli/settings.py +++ b/gadma/cli/settings.py @@ -66,7 +66,10 @@ # Time bounds upper_bound_of_first_split = None +lower_bound_of_first_split = None upper_bound_of_second_split = None +lower_bound_of_second_split = None + # Glocal optimizer global_optimizer = "Genetic_algorithm" diff --git a/gadma/cli/settings_storage.py b/gadma/cli/settings_storage.py index 26728d73..c301e9cc 100644 --- a/gadma/cli/settings_storage.py +++ b/gadma/cli/settings_storage.py @@ -198,7 +198,9 @@ class SettingsStorage(object): _float_attrs = ['theta0', 'time_for_generation', 'eps', 'const_of_time_in_drawing', 'vmin', 'min_n', 'max_n', 'min_t', 'max_t', 'min_m', 'max_m', + 'lower_bound_of_first_split', 'upper_bound_of_first_split', + 'lower_bound_of_second_split', 'upper_bound_of_second_split', 'const_for_mutation_strength', 'const_for_mutation_rate', @@ -575,7 +577,30 @@ def __setattr__(self, name, value): continue if len(getattr(self, attr_name)) != value: raise ValueError(f"Length of {attr_name} should be equal " - f"to {self.number_of_populations}.") + f"to {value}.") + if value < 2: + if self.lower_bound_of_first_split is not None: + raise ValueError( + "Do not specify `Lower bound of first split` as there" + f" is no split for 1 population" + ) + if self.upper_bound_of_first_split is not None: + raise ValueError( + "Do not specify `Upper bound of first split` as there" + f" is no split for 1 population" + ) + if value < 3: + if self.lower_bound_of_second_split is not None: + raise ValueError( + "Do not specify `Lower bound of second split` as there" + f" is no second split for {value} populations" + ) + if self.upper_bound_of_second_split is not None: + raise ValueError( + "Do not specify `Upper bound of second split` as there" + f" is no second split for {value} populations" + ) + # 3.7 If we set fractions or size of generation then we create/update # GA options that depend on these values. elif name == 'fractions' or name == 'size_of_generation': @@ -1240,24 +1265,45 @@ def get_engine_args(self, engine_id=None): def get_linear_constrain_for_model(self, model): """ - Returns linear constrain for model based of setted upper bound of - splits. NOT WORKING. + Returns linear constrain for model based of setted upper and lower + bounds of splits. """ if (self.upper_bound_of_first_split is None and - self.upper_bound_of_second_split is None): + self.lower_bound_of_first_split is None and + self.upper_bound_of_second_split is None and + self.lower_bound_of_second_split is None): return None A = list() ub = list() - if (self.upper_bound_of_first_split is not None): + lb = list() + # check for the first split + if (self.upper_bound_of_first_split is not None or + self.lower_bound_of_first_split is not None): A1, b1 = model.get_involved_for_split_time_vars(1) A.append(A1) - ub.append(self.upper_bound_of_first_split / 2 - b1) - if (self.upper_bound_of_second_split is not None): + if self.upper_bound_of_first_split is not None: + # we divide by 2 as time is in 2N units and we aware of this 2 + ub.append(self.upper_bound_of_first_split / 2 - b1) + else: + ub.append(np.inf) + if self.lower_bound_of_first_split is not None: + lb.append(self.lower_bound_of_first_split / 2 - b1) + else: + lb.append(-np.inf) + # check for the first split + if (self.upper_bound_of_second_split is not None or + self.lower_bound_of_second_split is not None): A2, b2 = model.get_involved_for_split_time_vars(2) A.append(A2) - ub.append(self.upper_bound_of_second_split / 2 - b2) - lb = - np.inf * np.ones(len(ub)) - return LinearConstrain(np.array(A), np.array(lb), np.array(ub)) + if self.upper_bound_of_second_split is not None: + ub.append(self.upper_bound_of_second_split / 2 - b2) + else: + ub.append(np.inf) + if self.lower_bound_of_second_split is not None: + lb.append(self.lower_bound_of_second_split / 2 - b2) + else: + lb.append(-np.inf) + return LinearConstrain(A, lb, ub) def get_model(self): """ @@ -1687,6 +1733,61 @@ def is_valid(self): labels) = check_and_return_projections_and_labels( self.data_holder) + # check for lower and upper bounds of population splits + # Check that bounds are specified when they are allowed + if (hasattr(self, "number_of_populations") and + self.number_of_populations < 2): + if self.lower_bound_of_first_split is not None: + raise ValueError( + "Do not specify `Lower bound of first split` as there is" + f" no split for 1 population" + ) + if self.upper_bound_of_first_split is not None: + raise ValueError( + "Do not specify `Upper bound of first split` as there is" + f" no split for 1 population" + ) + + if (hasattr(self, "number_of_populations") and + self.number_of_populations < 3): + if self.lower_bound_of_second_split is not None: + raise ValueError( + "Do not specify `Lower bound of second split` as there" + f" is no second split for {self.number_of_populations}" + " populations" + ) + if self.upper_bound_of_second_split is not None: + raise ValueError( + "Do not specify `Upper bound of second split` as there" + f" is no second split for {self.number_of_populations}" + " populations" + ) + # Check that values are not contraversal + if (self.lower_bound_of_first_split is not None and + self.upper_bound_of_first_split is not None): + if (self.lower_bound_of_first_split > + self.upper_bound_of_first_split): + raise ValueError( + "`Lower bound of first split` should be less than `Upper" + " bound of first split`" + ) + if (self.lower_bound_of_second_split is not None and + self.upper_bound_of_second_split is not None): + if (self.lower_bound_of_second_split > + self.upper_bound_of_second_split): + raise ValueError( + "`Lower bound of second split` should be less than `Upper" + " bound of second split`" + ) + if (self.upper_bound_of_first_split is not None and + self.lower_bound_of_second_split is not None): + if (self.upper_bound_of_first_split < + self.lower_bound_of_second_split): + raise ValueError( + "`Lower bound of second split` should be less than `Upper" + " bound of first split`" + ) + def print_citations(self): """ Prints neccessary citations according to current settings. diff --git a/gadma/code_generator/dadi_generator.py b/gadma/code_generator/dadi_generator.py index 1706fbb6..b1f90909 100644 --- a/gadma/code_generator/dadi_generator.py +++ b/gadma/code_generator/dadi_generator.py @@ -230,9 +230,15 @@ def _print_dadi_simulation(): return ret_str -def _print_ll(engine, mode='dadi'): +def _print_ll(engine, fixed_theta=None, mode='dadi'): if engine.multinom: - ret_str = f"ll_model = {mode}.Inference.ll_multinom(model, data)\n" + if fixed_theta is not None: + ret_str = "# we use fixed theta as we want to satisfy model "\ + "limitations (e.g. bounds on time splits)\n" + ret_str += f"theta = {fixed_theta}\n" + ret_str += f"ll_model = {mode}.Inference.ll(theta * model, data)\n" + else: + ret_str = f"ll_model = {mode}.Inference.ll_multinom(model, data)\n" else: ret_str = f"ll_model = {mode}.Inference.ll(theta * model, data)\n" ret_str += "print('Model log likelihood (LL(model, data)): "\ @@ -252,22 +258,39 @@ def _print_main(engine, values, mode='dadi', nanc=None): different to optimal. """ ret_str = "" + + mu_is_val = engine.model.mutation_rate is not None + data_holder_is_val = engine.data_holder is not None + seq_len_is_val = engine.data_holder.get_total_sequence_length() is not None + + have_theta0 = engine.model.theta0 is not None + mu_and_L = mu_is_val and data_holder_is_val and seq_len_is_val + if engine.multinom: - ret_str += _print_ll(engine, mode) - ret_str += f"\ntheta = {mode}.Inference.optimal_sfs_scaling"\ - "(model, data)\n" + fixed_theta = None + if nanc is not None and engine.model.linear_constrain is not None: + if have_theta0: + fixed_theta = nanc * engine.model.theta0 + else: + assert mu_and_L + L = engine.data_holder.get_total_sequence_length() + fixed_theta = nanc * 4 * engine.model.mutation_rate * L + ret_str += _print_ll(engine, fixed_theta, mode) + th_str = f"theta = {mode}.Inference.optimal_sfs_scaling"\ + "(model, data)\n" + if fixed_theta is not None: + ret_str += "\n#Uncomment the next line to obtain the true optimal"\ + f" value of theta\n#{th_str}" + else: + ret_str += f"\n{th_str}" ret_str += "print('Optimal value of theta: {0}'.format(theta))\n\n" if nanc is not None: ret_str += f"Nanc = {nanc}\n" - mu_is_val = engine.model.mutation_rate is not None - data_holder_is_val = engine.data_holder is not None - seq_len_is_val = engine.data_holder.get_total_sequence_length() is not None - mu_and_L = mu_is_val and data_holder_is_val and seq_len_is_val - if engine.model.theta0 is not None or mu_and_L: + if have_theta0 or mu_and_L: if engine.model.theta0 is not None: ret_str += f"theta0 = {engine.model.theta0}\n" ret_str += "Nanc = int(theta / theta0)\n" diff --git a/gadma/engines/dadi_moments_common.py b/gadma/engines/dadi_moments_common.py index 27fca696..12c17e7d 100644 --- a/gadma/engines/dadi_moments_common.py +++ b/gadma/engines/dadi_moments_common.py @@ -89,54 +89,29 @@ def _get_theta_from_sfs(self, values, model_sfs): theta_ub = theta0 * self.model.linear_constrain.ub / Ax if np.any(theta_lb > theta_ub): inds = np.where(theta_lb > theta_ub) - raise ValueError(f"Lower bounds for {inds} constrains in model" - f" are greater than upper bounds." - f"Please check linear constrain:\n" - f"{self.model.linear_constrain}\n" - f"and values:\n{values}.") + raise ValueError( + f"Lower bounds for {inds} constrains in model are greater " + f"than upper bounds. Please check linear constrain:\n" + f"{self.model.linear_constrain}\nand values:\n{values}." + ) theta_upper_lb = max(theta_lb) theta_lower_ub = min(theta_ub) if theta_upper_lb > theta_lower_ub: - raise ValueError(f"Upper lower bound ({upper_lb}) in greater than" - f" lower upper bound ({lower_ub}). Please check " - f" linear constrain:\n" - f"{self.model.linear_constrain}\n" - f"and values:\n{values}.") + # this means that there is no theta that will satisfy linear + # constrain, we return None + return None optimal_theta = max(optimal_theta, theta_upper_lb) optimal_theta = min(optimal_theta, theta_lower_ub) return optimal_theta -# for i, (lb, ub, val) in enumerate(zip(self.model.linear_constrain.lb, -# self.model.linear_constrain.ub, -# Ax): -# if lb > ub: -# raise ValueError(f"Lower bound for {i} constrain in model for" -# f" values {values} is greater than upper " -# f"bound: {val}, [{low_bound}, {upp_bound}]." -# f"Please check linear constrain:\n" -# f"{self.model.linear_constrain}\n" -# f"and values:\n{values}.") -# if lb is not None: -# if upper_lb is None: -# upper_lb = lb / val -# upper_lb = max(upper_lb, lb / val) -# if ub is not None: -# if lower_ub is None: -# lower_ub = ub / val -# lower_ub = max(lower_ub, ub / val) -# if upper_lb > lower_ub: -# raise ValueError(f"Upper lower bound ({upper_lb}) in greater than" -# f" lower upper bound ({lower_ub}). Please check " -# f" linear constrain:\n" -# f"{self.model.linear_constrain}\n" -# f"and values:\n{values}.") -# optimal_theta = max(optimal_theta, lower_ub) def get_theta(self, values, grid_sizes): key = self._get_key(values, grid_sizes) if key not in self.saved_add_info: - warnings.warn("Additional evaluation for theta. Nothing to worry " - "if this warning is seldom.") + warnings.warn( + "Additional evaluation for theta. " + "Nothing to worry if this warning is seldom." + ) self.evaluate(values, grid_sizes) theta = self.saved_add_info[key]["theta"] return theta @@ -278,6 +253,12 @@ def evaluate(self, values, grid_sizes): # If we have some constrains theta is chosen so that they are # satisfied, otherwise it is optimal_sfs_scaling theta = self._get_theta_from_sfs(values_gen, model_sfs) + if theta is None: + # we can get None if we failed to satisfy model constrain + assert self.model.linear_constrain is not None + key = self._get_key(values, grid_sizes) + self.saved_add_info[key] = {"theta": None, "failed_f": failed_f} + return None # The next line usually works like ll_multinom, but with theta # evaluated sometimes in a different way