- Clone the repo
git clone https://github.com/alercebroker/ztf_dr_downloader.git
- Install the package in your system
pip install .
You need an instance in S3 for run this script. The resources that worked without a problem are a c5a.large
instance with 150 GB of disk. As you increase the number of processes to run, you will need more disks.
Each process distributes the fields of the data release, each process performs:
- Verify that the field file is in S3, if it is not there, the file is downloaded, if it is, continue with the next one.
- Upload the file to S3.
- Delete the file from disk.
- Continue with the next file.
To run the code, follow the instructions below:
- Locate the data release that you need download. i.e: https://irsa.ipac.caltech.edu/data/ZTF/lc_dr5/
- Locate the checksum file. i.e: https://irsa.ipac.caltech.edu/data/ZTF/lc_dr5/checksums.md5
- Execute for 5 parallel process (arg
-n
):
dr download-data-release https://irsa.ipac.caltech.edu/data/ZTF/lc_dr5/ \
https://irsa.ipac.caltech.edu/data/ZTF/lc_dr5/checksums.md5 \
s3://<your-bucket>/<data-release>/<etc> \
-n 5
- Wait calmly because there is a lot of data!
If you want to get the Data Release metadata without the light curves (to do xmatch or another operation), you can get it using the following command (you must have the data stored somewhere e.g S3):
dr get-objects <your-bucket> <data-release>
You can also obtain characteristics of the light curves from the Data Release (based on code for Sánchez-Sáez et al. 2020), This can be very expensive but can be executed on different machines at the same time with slurm (on work).
Run for only one field:
dr get-features <input-file> <output-file>
Or compute in your own code:
import pandas as pd
from ztf_dr.extractors import DataReleaseExtractor
input_file = "path_to_pallet_town"
output_file = "path_to_drink_a_beer"
extractor = DataReleaseExtractor()
zone = pd.read_parquet(input_file)
features = extractor.compute_features(zone)
features.to_parquet(output_file)
If you have access to a slurm cluster, run this command in your terminal:
sbatch --array [0-499]%500 compute_features.slurm <s3-bucket-raw-data> <s3-bucket-output-data>
This code distributes 500 jobs in the whole cluster, therefore it distributes all the files of the data release in these jobs. This code is in charge of calculating the features for objects that meet the following conditions:
- points of the light curve with catflags = 0 and magerr <1
- ndets> 20 in fid 1 and 2, ndets> 5 in fid 3
Integration with ZTF DR API
Each time you want to update the data in the API database (when ZTF launch a new data release), you must do the following procedure:
- Launch a machine and install mongodb, the disk should be of a size similar to the total of the data release (e.g DR5 weighs ~ 3.5 TB and the disk of the machine is 3.4 TB).
- In the same instance run:
dr load_mongo <host> <db-name> <collection-name> <input-s3-bucket> --n-cores <number-of-processes> --batch-size <objects-per-batch>
NOTE: If you want drop all elements in database use the flag -d
in the command.
The code filter objects and converts the light curve to binary and inserts a document of the structure into database:
{
"_id" : NumberLong(1550215200000003),
"filterid" : 2,
"fieldid" : 1550,
"rcid" : 57,
"objra" : 35.6168823242188,
"objdec" : 19.0562019348145,
"nepochs" : 25,
"hmjd" : { "$binary" : "avJjR2D1Y0eF/GNHawNkR3AJZEdsCmRHWzdkR0k5ZEdHOmRHTT1kR0w/ZEdNQGRHTUNkR01EZEdWRWRHZFFkRzRYZEcwWWRHYHJlR2ByZUdpc2VHaXNlR3Z7ZUdffGVHX3xlRw==", "$type" : "00" },
"mag" : { "$binary" : "U/6EQc5ahUGrcYVBcG+FQfKUhUHSZYVBJliFQQBphUEcNIVB1AiFQRAqhUHOOoVBJA6FQUFahUEiMoVB9a2FQQ4UhUHWNoVBw0yFQQyYhUF2jIVBHCGFQWhchUHiQIVBWlyFQQ==", "$type" : "00" },
"magerr" : { "$binary" : "U65nPEjEazyCz2w8TbVsPMdybjybRGw8dKVrPMppbDzXB2o8UCJoPA6WaTwEVGo8Ql1oPN69azxi8Wk8LKFvPBWfaDzWJmo82SFrPAiYbjwwDW48ODFpPOTWazx0mWo8O9ZrPA==", "$type" : "00" },
"loc" : {
"type" : "Point",
"coordinates" : [
-144.383117675781,
19.0562019348145
]
}
}
Finally, the script create a spatial-index of the value loc
and other indexes like by nepochs
, filterid
, fieldid
, among others.
After that, in the AWS console:
- Go to
lambda
. - Click on
ztd-dr-api
- Go to
configuration
after that click onEnvironment variables
. - Change the oldest credentials to the newest credentials.
NOTE: If you did some change in the ztf_dr_api, you must update the lambda function.
- In data folder we save some data of data releases (since DR5).
- In
data/DR<X>_by_field.csv
we save total size and amount of files by field.