Skip to content

Commit

Permalink
Correct documentation file paths with new example project file structure
Browse files Browse the repository at this point in the history
  • Loading branch information
golmschenk committed May 19, 2024
1 parent d894755 commit 8691eca
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 13 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def get_positive_train_paths():

This functions says to create a `Path` object for a directory at `data/spoc_transit_experiment/train/positives`. Then, it obtains all the files ending with the `.fits` extension. It puts that in a list and returns that list. In particular, `qusi` expects a function that takes no input parameters and outputs a list of `Path`s.

In our example code, we've split the data based on if it's train, validation, or test data and we've split the data based on if it's positive or negative data. And we provide a function for each of the 6 permutations of this, which is almost identical to what's above. You can see the above function and other 5 similar functions near the top of `scripts/transit_dataset.py`.
In our example code, we've split the data based on if it's train, validation, or test data and we've split the data based on if it's positive or negative data. And we provide a function for each of the 6 permutations of this, which is almost identical to what's above. You can see the above function and other 5 similar functions near the top of `scripts/dataset.py`.

`qusi` is flexible in how the paths are provided, and this construction of having a separate function for each type of data is certainly not the only way of approaching this. Depending on your task, another option might serve better. In another tutorial, we will explore a few example alternatives. However, to better understand those alternatives, it's first useful to see the rest of this dataset construction.

Expand All @@ -35,7 +35,7 @@ def load_times_and_fluxes_from_path(path):
return light_curve.times, light_curve.fluxes
```

This uses a builtin class in `qusi` that is designed for loading light curves from TESS mission FITS files. However, the important thing is that your function returns two comma separated values, which is a NumPy array of the times and a NumPy array of the fluxes of your light curve. And the function takes a single `Path` object as input. These `Path` objects will be one of the ones we returned from the functions in the previous section. But you can write any code you need to get from a `Path` to the two arrays that represent times and fluxes. For example, if your file is a simple CSV file, it would be easy to use Pandas to load the CSV file and extract the time column and the flux column as two arrays which are then returned at the end of the function. You will see the above function in `scripts/transit_dataset.py`.
This uses a builtin class in `qusi` that is designed for loading light curves from TESS mission FITS files. However, the important thing is that your function returns two comma separated values, which is a NumPy array of the times and a NumPy array of the fluxes of your light curve. And the function takes a single `Path` object as input. These `Path` objects will be one of the ones we returned from the functions in the previous section. But you can write any code you need to get from a `Path` to the two arrays that represent times and fluxes. For example, if your file is a simple CSV file, it would be easy to use Pandas to load the CSV file and extract the time column and the flux column as two arrays which are then returned at the end of the function. You will see the above function in `scripts/dataset.py`.

## Creating a function to provide a label for the data

Expand All @@ -49,7 +49,7 @@ def negative_label_function(path):
return 0
```

Note, `qusi` expects the label functions to take in a `Path` object as input, even if we don't end up using it. This is because, it allows for more flexible configurations. For example, in a different situation, the data might not be split into positive and negative directories, but instead, the label data might be contained within the user's data file itself. Also, in other cases, this label can also be something other than 0 and 1. The label is whatever the NN is attempting to predict for the input light curve. But for our binary classification case, 0 and 1 are what we want to use. Once again, you can see these functions in `scripts/transit_dataset.py`.
Note, `qusi` expects the label functions to take in a `Path` object as input, even if we don't end up using it. This is because, it allows for more flexible configurations. For example, in a different situation, the data might not be split into positive and negative directories, but instead, the label data might be contained within the user's data file itself. Also, in other cases, this label can also be something other than 0 and 1. The label is whatever the NN is attempting to predict for the input light curve. But for our binary classification case, 0 and 1 are what we want to use. Once again, you can see these functions in `scripts/dataset.py`.

## Creating a light curve collection

Expand All @@ -59,7 +59,7 @@ Now we're going to join the various functions we've just defined into `LightCurv
positive_train_light_curve_collection = LightCurveObservationCollection.new()
```

This defines a collection of labeled light curves where `qusi` knows how to obtain the paths, how to load the times and fluxes of the light curves, and how to load the labels. This `LightCurveObservationCollection.new(...` function takes in the three pieces we just built earlier. Note that you pass in the functions themselves, not the output of the functions. So for the `get_paths_function` parameter, we pass `get_positive_train_paths`, not `get_positive_train_paths()` (notice the difference in parenthesis). `qusi` will call these functions internally. However, the above bit of code is not by itself in `scripts/transit_dataset.py` as the rest of the code in this tutorial was. This is because `qusi` doesn't use this collection by itself. It uses it as part of a dataset. We will explain why there's this extra layer in a moment.
This defines a collection of labeled light curves where `qusi` knows how to obtain the paths, how to load the times and fluxes of the light curves, and how to load the labels. This `LightCurveObservationCollection.new(...` function takes in the three pieces we just built earlier. Note that you pass in the functions themselves, not the output of the functions. So for the `get_paths_function` parameter, we pass `get_positive_train_paths`, not `get_positive_train_paths()` (notice the difference in parenthesis). `qusi` will call these functions internally. However, the above bit of code is not by itself in `scripts/dataset.py` as the rest of the code in this tutorial was. This is because `qusi` doesn't use this collection by itself. It uses it as part of a dataset. We will explain why there's this extra layer in a moment.

## Creating a dataset

Expand All @@ -76,7 +76,7 @@ def get_transit_train_dataset():

This is the function which generates the training dataset we called in the {doc}`/tutorials/basic_transit_identification_with_prebuilt_components` tutorial. The parts of this function are as follows. First, we create the `positive_train_light_curve_collection`. This is exactly what we just saw in the previous section. Next, we create a `negative_train_light_curve_collection`. This is almost identical to its positive counterpart, except now we pass the `get_negative_train_paths` and `negative_label_function` instead of the positive versions. Then there is the `train_light_curve_dataset = LightCurveDataset.new(` line. This creates a `qusi` dataset built from these two collections. The reason the collections are separate is that `LightCurveDataset` has several mechanisms working under-the-hood. Notably for this case, `LightCurveDataset` will balance the two light curve collections. We know of a lot more light curves that don't have planet transits in them than we do light curves that do have planet transits. In the real world case, it's thousands of times more at least. But for a NN, it's usually useful to during the training process to show equal amounts of the positives and negatives. `LightCurveDataset` will do this for us. You may have also noticed that we passed these collections in as the `standard_light_curve_collections` parameter. `LightCurveDataset` also allows for passing different types of collections. Notably, collections can be passed such that light curves from one collection will be injected into another. This is useful for injecting synthetic signals into real telescope data. However, we'll save the injection options for another tutorial.

You can see the above `get_transit_train_dataset` dataset creation function in the `scripts/transit_dataset.py` file. The only part of that file we haven't yet looked at in detail is the `get_transit_validation_dataset` and `get_transit_finite_test_dataset` functions. However, these are nearly identical to the above `get_transit_train_dataset` expect using the validation and test path obtaining functions above instead of the train ones.
You can see the above `get_transit_train_dataset` dataset creation function in the `scripts/dataset.py` file. The only part of that file we haven't yet looked at in detail is the `get_transit_validation_dataset` and `get_transit_finite_test_dataset` functions. However, these are nearly identical to the above `get_transit_train_dataset` expect using the validation and test path obtaining functions above instead of the train ones.

## Adjusting this for your own binary classification task

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ The remainder of the commands will assume you are running code from the project

## Downloading the dataset

The next thing we'll do is download a dataset of light curves that include cases both with and without transiting planets. To do this, run the example script at `scripts/download_spoc_transit_light_curves`. For now, don't worry about how each part of the code works. You can run the script with
The next thing we'll do is download a dataset of light curves that include cases both with and without transiting planets. To do this, run the example script at `scripts/download_data.py`. For now, don't worry about how each part of the code works. You can run the script with

```sh
python scripts/download_spoc_transit_light_curves.py
python scripts/download_data.py
```

The main thing to know is that this will create a `data` directory within the project directory and within that will be a `spoc_transit_experiment` directory, referring to the data for the experiment of finding transiting planets within the TESS SPOC data. This will further contain 3 directories. One for train data, one for validation data, and one for test data. Within each of those, it will create a `positive` directory, that will hold the light curves with transits, and a `negative` directory, that will hold the light curves without transits. So the project directory tree now looks like
Expand All @@ -36,10 +36,10 @@ data
examples
```

Each of these `positive` and `negative` data directories will now contain a set of light curves. The reason why the code in this script is not very important for you to know, is that it's mostly irrelevant for future uses. When you're working on your own problem, you'll obtain your data some other way. And `qusi` is flexible about the data structure, so this directory structure is not required. It's just one way to structure the data. Note, this is a relatively small dataset to make sure it doesn't take very long to get up and running. To get a better result, you'd want to download all known transiting light curves and a much larger collection non-transiting light curves. To quickly visualize one of these light curves, you can use the script at `scripts/transit_light_curve_visualization.py`. Due to the available light curves on MAST being updated constantly, the random selection of light curves you downloaded might not include the light curve noted in this example file. Be sure to open the `scripts/transit_light_curve_visualization.py` file and update the path to one of the light curves you downloaded. To see a transit case, be sure to select one from one of the `positive` directories. Then run
Each of these `positive` and `negative` data directories will now contain a set of light curves. The reason why the code in this script is not very important for you to know, is that it's mostly irrelevant for future uses. When you're working on your own problem, you'll obtain your data some other way. And `qusi` is flexible about the data structure, so this directory structure is not required. It's just one way to structure the data. Note, this is a relatively small dataset to make sure it doesn't take very long to get up and running. To get a better result, you'd want to download all known transiting light curves and a much larger collection non-transiting light curves. To quickly visualize one of these light curves, you can use the script at `scripts/light_curve_visualization.py`. Due to the available light curves on MAST being updated constantly, the random selection of light curves you downloaded might not include the light curve noted in this example file. Be sure to open the `scripts/light_curve_visualization.py` file and update the path to one of the light curves you downloaded. To see a transit case, be sure to select one from one of the `positive` directories. Then run

```sh
python scripts/transit_light_curve_visualization.py
python scripts/light_curve_visualization.py
```

You should see something like
Expand Down Expand Up @@ -67,7 +67,7 @@ This will only log runs locally. If you choose the offline route, at some point,

## Train the network

Next, we'll look at the `scripts/transit_train.py` file. In this script is a `main` function which will train our neural network on our data. The training script has 3 main components:
Next, we'll look at the `scripts/train.py` file. In this script is a `main` function which will train our neural network on our data. The training script has 3 main components:

1. Code to prepare our datasets.
2. Code to prepare the neural network model.
Expand All @@ -76,7 +76,7 @@ Next, we'll look at the `scripts/transit_train.py` file. In this script is a `ma
Since `qusi` provides both models and and training loop code, the only one of these components that every user will be expected to deal with is preparing the dataset, since you'll eventually want to have `qusi` tackle the task you're interested in which will require you're own data. And the `qusi` dataset component will help make your data more suitable for training a neural network. However, we're going to save how to set up your own dataset (and how these example datasets are created) for the next tutorial. For now, we'll just use the example datasets as is. So, in the example script, you will see the first couple of lines of the `main` function call other functions that produce an example train and validation dataset for us. Then we choose one of the neural network models `qusi` provides (in this case the `Hadryss` model). Then finally, we start the training session. To run this training, simply run the script with:

```sh
python scripts/transit_train.py
python scripts/train.py
```

You should see some output showing basic training statistics from the terminal as it runs through the training loop. It will run for as many train cycles as were specified in the script. On every completed cycle, `qusi` will save the latest version of the fitted model to `sessions/<wandb_run_name>/latest_model`.
Expand All @@ -85,10 +85,10 @@ You can also go to your Wandb project to see the metrics over the course of the

## Test the fitted model

A "fitted model" is a model which has been trained, or fitted, on some training data. Next, we'll take the fitted model we produced during training, and test it on data it didn't see during the training process. This is what happens in the `scripts/transit_finite_dataset_test.py` script. The `main` function will look semi-similar to from the training script. Again, we'll defer how the dataset is produced until the next tutorial. Then we create the model as we did before, but this time we load the fitted parameters of the model from the saved file. Here, you will need to update the script to point to your saved model produced in the last section. Then we can run the script with
A "fitted model" is a model which has been trained, or fitted, on some training data. Next, we'll take the fitted model we produced during training, and test it on data it didn't see during the training process. This is what happens in the `scripts/finite_dataset_test.py` script. The `main` function will look semi-similar to from the training script. Again, we'll defer how the dataset is produced until the next tutorial. Then we create the model as we did before, but this time we load the fitted parameters of the model from the saved file. Here, you will need to update the script to point to your saved model produced in the last section. Then we can run the script with

```sh
python scripts/transit_finite_dataset_test.py
python scripts/finite_dataset_test.py
```

This will run the network on the test data, producing the metrics that are requested in the file.

0 comments on commit 8691eca

Please sign in to comment.