Lcfit is a Python3 library for the robust regression of periodic time series. Its main intended purpose is to serve as a fast and versatile tool for the automated batch processing of a very large number of photometric light curves of periodic variable stars (e.g., pulsating stars) from large time-domain astronomical surveys such as Gaia, VVV, and OGLE. For example, this library was employed in the recent studies by Dekany et al. (2019, 2020, 2021).
If lcfit is used in an astronomical research project as provided here or in a modified form, the citation of any of the aforementioned papers in the resulting publication will be much appreciated.
-
Matern kernel is now available for Gaussian Process Regression (GPR) of the light curves. Various tests on noisy data with erratic sampling show that most of the times GPR using Matern kernel is more robust against phase gaps than the exponential sine-squared kernel. Matern is now the default kernel, but the expsine2 is still available (see the
--kernel
parameter). -
Phase shift values to reproduce the phase alignment of multiple light curves is written to the output. They can now also be read in and used as precomputed values (see the
--known_phaseshift
parameter). This feature can be used to easily phase-align multiple time series of different nature of the same object, e.g., light curves with radial velocity curves.
Simply use git clone <URL>
to clone this library into a local directory.
Subsequently, it can be added to your PYTHONPATH
either temporarily by
issuing the sys.path.append("/your/local/directory")
command in python,
or permanantly by exporting you directory into the PYTHONPATH
system variable.
For example, if using the bash shell, add the
export PYTHONPATH="${PYTHONPATH}:/your/local/directory"
in the ~/.bashrc
file.
Lcfit requires python >3.6 and the numpy, scipy, scikit-learn, scikit-optimize, matplotlib, and joblib python libraries for its full functionality. In addition, installation of the gls Fortran90 library of this repository is required for performing initial period search. The compiled Fortran binary is called by lcfit as a subprocess.
The latest version of lcfit was tested under the following python3 environment: numpy 1.18.5, scipy 1.6.3, scikit-learn 0.24.2, scikit-optimize 0.9.0, matplotlib 3.4.2, joblib 1.0.1.
Lcfit can be run using command-line arguments:
python lcfit.py [OPTION]
,
or by supplying a parameter file that includes the command-line arguments:
python lcfit.py @<parameter_file>
.
The full list of command-line options can be printed on the STDOUT by:
python lcfit.py --help
.
An example parameter file lcfit.par
is also provided for convenience.
Lcfit batch-processes a list of time series stored in separate files, specified
by the --input_list <listfile>
parameter. <listfile>
must contain the following
columns, separated by whitespaces:
identifier (string), [initial period (float)], [dataset (int)]. By default, each
time-series will be read from the file ./.dat. The subdirectory and the
file suffix can be customized by the --input_dir <dir>
and the
--input_lc_suffix <suffix>
parameters. If --do_gls
is specified, then an
initial value for the period will be computed by the
Generalized Lomb-Scargle algorithm
(using this implementation).
The time-series files should contain the following columns separated by whitespaces: time (float), value1 (float), [value_err1 (float)] [snr1 (float)] [flag1 (str)], [value2 (float), [value_err2 (float)] [snr2 (float)] [flag2 (str)]], ..., [ZP_error (float)]
The numbered columns refer to different alternative datasets (for example,
photometry of the same star acquired with different aperture sizes).
By default, a single dataset per input file is expected, this can be changed
with the --n_data_cols <number>
parameter. When the input files contain
multiple datasets, a subset of these to be considered by lcfit can be specified
by the --use_data_cols <list of numbers>
parameter.
By default, the input files are expected to contain a (time, value) column
pair for each dataset. An error column per dataset can be added by using the
--is_err_col
option; by which one can specify the statistical uncertainties
in the measurements of each dataset, and use it for sample weighing in the period
search and/or regression. In addition, a separate column with
the systematic zero-point (ZP) uncertainty (per dataset) can also be added using
the --is_zperr_col
option. This can be useful for example if the data have a
point-to-point calibration of the photometric ZP, such as in the case of
VVV. Finally, columns with the signal-to-noise ratio
of individual measurements and a measurement flag can also be added for each
dataset by using the --min_snr <snr>
and --flag_omit <flag>
options.
In this case, data with S/N<<snr>
and/or
flagged with <flag>
will be omitted from the analysis.
Lcfit performs robust non-linear regression of a truncated Fourier series in tandem with iterative outlier rejection. The optimal Fourier order (i.e., the optimal number of Fourier terms) is determined by phase-stratified k-fold cross-validation. Subsequently, Gaussian Process Regression (GPR) is performed on the resulting phase-folded time series, and the Fourier representation of the resulting best-fit GPR is provided. Errors in the derived model parameters can be estimated from the GPR model.
A simplified summary of lcfit in pseudocode:
for each datafile:
for each dataset:
initialize Period (P_init) [1]
repeat:
optimize Fourier order using P_init via CV [2]
non-linear regression with optimal Fourier order [3]
detect outliers around regression model [4]
if no outliers:
break
reassign P_init [5]
select best dataset (by highest S/N)
compute phase-folded time-series
compute GPR model and its Fourier representation [6]
compute model parameter errors from GPR model [7]
[1] The initial value for the period is either supplied by the user in the second
column of the input list file, or computed by the GLS algorithm (--do_gls
option).
[2] The number of folds can be specified by the --k_fold <number>
option.
The folds are stratified by the phase computed from the latest value of the
period. The range of possible orders can be specified by the
--fourier_order_min <min_order>
and
--fourier_order_max <max_order>
parameters.
[3] Regression is done using the trusted-region-reflective algorithm implemented
in the scipy library. The Huber loss is used by default, and can be changed with
the --loss <loss>
parameter. If the --weighted_fit
option is selected, then
sample weights proportional to the squared inverse of the uncertainties will
be used.
[4] Data points with residual values beyond n * sigma from the regression model
will be omitted from the dataset, where sigma is the residual standard deviation
of the latest model, and n can be set by the --sigma_clip_threshold <n>
parameter (default: None, i.e., no outlier rejection).
[5] If --redo_gls
is set, then the period will be reinitialized at the end of
every iteration with the GLS algorithm. This can be useful is the dataset contains
strong outliers that can severely bias the initial estimate of the period.
[6] The GPR is performed on the phase-folded time series using the implementation
provided by scikit-learn. A scaled periodic kernel (constant kernel * exponential
sine-squared kernel) is combined with a white kernel, and the noise in the dataset
is estimated from the latter. Periodic boundary conditions are created by padding
the folded time series with itself. Hyperparameter optimization is done either by
the standard method of maximizing the log-marginal likelihood
(--gpr_hparam_optimization mle
), or by maximizing the R2-score measured via
10-fold cross-validation (--gpr_hparam_optimization cv
). For the latter,
Bayesian optimization is performed for fast convergence, using the scikit-optimize
implementation. Cross-validation can be advantageous in case of low number of
noisy datapoints, that can otherwise yield to high-variance GPR models with the
standard approach. However, maximization of the log-marginal likelihood is
recommended in general, due to its fast computation.
[7] Errors in the Fourier model parameters are estimated (--compute_errors
)
by drawing random samples from the GPR model, and performing regression on
each realization. The standard deviations of the parameters cross the
realizations are returned as uncertainty estimates.
The derived regression parameters and descriptive statistics are written to
a file (specified by --output_param_file <filename>
, default: lcfit.out
,
one line per input file).
The outlier-free, phase-folded time series corresponding to the best dataset
per input file is written to a subdirecory specified by
--output_data_dir <directory_name>
. Synthetic time series (i.e., the
best-fitting regression model evaluated over an equidistant phase grid) are
written to the subdirectory specified by --output_syn_dir <directory_name>
.
If the option --output_gpr_dir <directory_name>
is provided, the GPR models
will be saved using joblib
and written to files the provided directory.
Figures showing the phase-folded time series, the rejected outliers, and the
best-fitting regression models are generated for each input time series and
written to a subdirectory specified by --plot_dir <directory_name>
.
A few example data files are provided in the test_photometry
subdirectory.
The input list file test1.lst
includes 3 RRab Lyrae stars from the OGLE-IV
survey, their corresponding data files contain a single I-band dataset with
an error column.
To analyze these with lcfit, the following options should be set in the
parameter file lcfit.par
:
--input_list test1.lst --input_dir test_photometry --n_data_cols 1 --use_data_cols 1 --is_err_col
The input list file test2.lst
comprises a single RRab Lyrae, and its data file
contains 5 Ks-band datasets from the VVV survey, corresponding to 5 photometric
apertures of different diameters. The object is temporally blended by a nearby
point-source, severely contaminating its photometric light curve. Each dataset
includes an error and a ZP-error column. To analyze the data of this object and
select the optimal one among the 3 smallest apertures, we should use the following
settings in lcfit.par
:
--input_list test2.lst --input_dir test_photometry --n_data_cols 5 --use_data_cols 1 2 3 --is_err_col --is_zperr_col
.
The results and output files will be written to the subdirectories provided
with this repository.
MIT, see LICENSE
.