Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
8959737
Update README.md
arapat Jun 2, 2020
9d83a27
Merge branch 'master' of github.com:arapat/bathymetry
arapat Jun 2, 2020
69605fb
Update README.md
arapat Jun 2, 2020
29fc334
Update config.json
arapat Jun 2, 2020
ff0348a
added notebook
arapat Jun 2, 2020
32a2ea9
added train-test-split code
arapat Jun 2, 2020
a6d67ff
Create README.md
arapat Jun 2, 2020
4c25f82
Update README.md
arapat Jun 2, 2020
49d8d51
Update README.md
arapat Jun 2, 2020
666f352
Update README.md
arapat Jun 4, 2020
2b6f4d9
Update README.md
arapat Jun 4, 2020
e96c114
Create DSE_README.md
arapat Jun 4, 2020
cb760b8
Update DSE_README.md
arapat Jun 4, 2020
a9d8755
updates
arapat Jun 5, 2020
73af1f4
updates
arapat Jun 5, 2020
7e1e45e
Create train-test.sh
arapat Jun 5, 2020
2cfbdb3
Create config.json
arapat Jun 5, 2020
ce7ad74
Create README.md
arapat Jun 5, 2020
739408b
Update DSE_README.md
arapat Jun 5, 2020
b953ab2
Rename README.md to Notes.md
arapat Jun 5, 2020
2968f48
Rename DSE_README.md to README.md
arapat Jun 5, 2020
3d6d401
fixed typo
arapat Jun 5, 2020
3453179
Merge branch 'master' of github.com:arapat/bathymetry
arapat Jun 5, 2020
bb590b8
Update and rename README.md to DSC291.md
arapat Jun 5, 2020
6345c68
Update DSC291.md
arapat Jun 5, 2020
fc1e2b2
Update and rename Notes.md to README.md
arapat Jun 5, 2020
44970a4
Update README.md
arapat Jun 5, 2020
b9ac5b8
updated examples
arapat Jun 5, 2020
978c003
updates
arapat Jun 5, 2020
a7db291
Merge branch 'master' of github.com:arapat/bathymetry
arapat Jun 5, 2020
f5f1ceb
fixed typo
arapat Jun 5, 2020
59fcb3b
Update README.md
arapat Jun 9, 2020
0c230cc
Update README.md
arapat Jun 9, 2020
50fcaf1
added cruise id
arapat Jun 16, 2020
652ed81
Test-USM2 task
Apr 13, 2021
76301d9
Test-USM2 task
Apr 13, 2021
a5ea332
Apply scores to CM files
Apr 13, 2021
6e2a294
Apply scores to CM files
Apr 13, 2021
ea00863
deleted old readme
Apr 13, 2021
3ec44f3
deleted old readme
Apr 13, 2021
741e35b
removed executable permissions
Apr 13, 2021
c766fe1
removed executable permissions
Apr 13, 2021
2167e15
removed unused variables
Apr 13, 2021
c9fce56
removed unused variables
Apr 13, 2021
558b914
updated readme
Apr 13, 2021
2cacd4a
updated readme
Apr 13, 2021
7929483
removed imports
Apr 13, 2021
98a91a6
removed imports
Apr 13, 2021
8e42390
fixed import path error
Apr 16, 2021
77f18b4
Edit CM to conform to Py-CMeditor input
Apr 23, 2021
e84e25a
reads from tsv instead of cm
Apr 26, 2021
f017782
keep predicted depth in cm file
hughaharper Apr 28, 2021
3bd7367
add notebooks to analysis dir
hughaharper May 6, 2021
fbfe9df
add notebooks to analysis dir
hughaharper May 6, 2021
410a22d
Simple analysis of new models
hughaharper May 7, 2021
612ea69
Simple analysis of new models
hughaharper May 7, 2021
57e5435
Removing old stuff
hughaharper May 7, 2021
7b41f0d
Removing old stuff
hughaharper May 7, 2021
1e3931b
added threshold labels to PRC
hughaharper May 10, 2021
43caca9
added threshold labels to PRC
hughaharper May 10, 2021
d125730
only clean files from spec database
hughaharper May 26, 2021
ad39486
simple visualization of cruise + scores
hughaharper May 26, 2021
79948ce
more notebooks
hughaharper May 26, 2021
79d1e4e
more notebooks
hughaharper May 26, 2021
98ec189
notebook for applying scores
hughaharper May 28, 2021
cb2902e
notebook for applying scores
hughaharper May 28, 2021
d32aa30
remove the kind and year features
hughaharper Jun 2, 2021
61f66a0
dont convert weights to list in read from text
hughaharper Jun 2, 2021
05f9946
Feature importance updates
hughaharper Jun 3, 2021
4fbd6bb
change num cpus in ray init
hughaharper Jun 4, 2021
fb198c6
move feature omission out of read data function, will now read/write …
hughaharper Jun 8, 2021
692ac18
see previous commit, now working
hughaharper Jun 8, 2021
a79ea18
removed debugging print statements
hughaharper Jun 8, 2021
072071b
fixed REMOVED_FEATURES list problem
hughaharper Jun 8, 2021
f8235d0
notebook for binary tree visualization
hughaharper Jun 21, 2021
35ee21b
added features back
hughaharper Jun 23, 2021
3c1bdbb
Edited for presentation
hughaharper Jun 23, 2021
8de9c6f
remove d10 feature
hughaharper Jun 29, 2021
b5084b7
remove d20 and NDP2
hughaharper Jun 29, 2021
0b9d382
move some stuff to a module
hughaharper Jun 30, 2021
0c46615
another metric to look at
hughaharper Jun 30, 2021
a11db9e
moved functions to module
hughaharper Jun 30, 2021
3304cd2
edit feature list
hughaharper Jul 13, 2021
34fd56f
remove d60,NDP10
hughaharper Jul 14, 2021
dfc4f4f
Remove NDP30
hughaharper Jul 15, 2021
5f0f217
add more regions
hughaharper Jul 19, 2021
a4b4f9a
Separate analyses by model tests
hughaharper Jul 21, 2021
1a4691f
remove old, cluttered notebooks
hughaharper Jul 21, 2021
4bf0467
renamed dirs
hughaharper Jul 21, 2021
562d5e0
cleaned up analysis notebooks
hughaharper Aug 2, 2021
3079f85
remove binary file prefix
hughaharper Aug 6, 2021
d147ba1
ignore a certain notebook
hughaharper Aug 9, 2021
d1e4ba7
fixing ignore stuff
hughaharper Aug 9, 2021
025ee69
ignore notebooks
hughaharper Aug 9, 2021
c998914
Merge branch 'analysis' of https://github.com/hughaharper/bathymetry …
hughaharper Aug 9, 2021
d927e27
remove old file from merge
hughaharper Aug 9, 2021
68944bd
remove old stuff from merge
hughaharper Aug 9, 2021
3012cd3
messing with new models
hughaharper Aug 9, 2021
56a7806
changed names of binary files
hughaharper Aug 9, 2021
827f6d9
Merge branch 'cm-clean'
hughaharper Sep 17, 2021
5181fcb
Merge branch 'master' of https://github.com/hughaharper/bathymetry
hughaharper Sep 17, 2021
f2e3898
Implement a "random" function
hughaharper Sep 30, 2021
27bdc93
merge analysis branch
hughaharper Sep 30, 2021
7ce037b
merge analysis branch
hughaharper Sep 30, 2021
960ceba
Merge branch 'master' of https://github.com/hughaharper/bathymetry
hughaharper Oct 1, 2021
d471e98
change pathname for data files
hughaharper Oct 11, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,6 @@ target/

# Editor
.vscode

# other
analysis/show*ipynb
78 changes: 0 additions & 78 deletions DSC291.md

This file was deleted.

19 changes: 14 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ If the input format is tsv, it will be written to disk in pickle files so that n
* `test.py`: template code to be called by "__main__.py" proper functions for testing. It outputs a pickle file that
contains scores in addition to some meta information about examples, e.g. cruise ID, longitute, latitude
* `train.py`: template code to be called by "__main__.py" proper functions for training.
* `clean_CM.py`: for applying model predictions to CM files.
* `config.json`: config such as the input data path, and the directory to write the models

## Typical usage
Expand Down Expand Up @@ -49,7 +50,7 @@ Then specify these three files to the training program in `config.json`.

2. Run training with bootstrap

The bathymetry module is implemented to train the models in different conditions (see `task_type` below). Note that
The bathymetry module is implemented to train the models in different conditions (see `task_type` below). Note that
bootstrap is NOT implemented in this module.

```
Expand All @@ -64,7 +65,7 @@ python bathymetry <data_type> <task_type> <config_path>
* "test-all": test the model trained on all data on the dataset from research institutions (test n times)
* "train-instances": training a model using a data that is splitted on the instance level (ignore for now)
* "test-instances": testing a model using a test set that was splitted on the instance level (ignore for now)

3. Run testing

Testing is implemented in this module (see above).
Expand All @@ -79,19 +80,26 @@ Testing is implemented in this module (see above).

### Label

The label is derived from the column 04 (see below), `sigd`: the example is labeled 0 if sigd == “9999”, and labeled 1 otherwise.
Each row in the TSV file correponds to one measurement. The descriptions of the columns could be found at
[README.md](README.md).

The learning task is binary classification, specifically, to decide if a depth measurement is correct or not.
The label is 0 if it is wrong (or corrupted), and 1 if it is accurate.
The human annotators put a label "9999" in the column 5, `sigd`, if they think the measurement is wrong,
and put other values otherwise.
The program we provide get the data label using a function in the form of `lambda row: row[4] != "9999"`.

### Description of all columns

Each line in the `.tsv` data files should contain 35 columns. The meaning of the columns are as follows.
Each line in the `.tsv` data files should contain 37 columns. The meaning of the columns are as follows.

```
index name Example Description
00 lon 143.92639 longitude of the location
01 lat -43.99727 latitude of the location
02 depth -4637 the depth measured by the crew
03 sigh 0 not sure what it means
04 sigd -1 state according to human editor: 9999 = bad (do not incorporate into atlas), all other values = Good (incorporate into atlas),
04 sigd -1 state according to human editor: 9999 = bad (do not incorporate into atlas), all other values = Good (incorporate into atlas),
05 SID 10088 Cruise ID, should not be used as features
06 pred -4633 the predicted depth with the gravity model
07 ID 1 not sure what it means
Expand Down Expand Up @@ -123,6 +131,7 @@ index name Example Description
33 D-MED30m/STD30m 0.0102018
34 year 2000 The year of the measurement
35 kind G Device type used for measurements
36 PRED-ABS(VGG_5m)
```

## Program output
Expand Down
3 changes: 0 additions & 3 deletions __init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1 @@
from .common import TRAINING_FILES_DESC
from .common import VALIDATION_FILES_DESC
from .common import TESTING_FILES_DESC

23 changes: 20 additions & 3 deletions __main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,20 @@
from .train import run_training
from .train import run_training_all
from .train import run_training_specific_file
from .train import run_training_n_times
from .test import get_all_data
from .test import run_testing
from .test import run_testing_specific_file


regions = ['AGSO', 'JAMSTEC', 'JAMSTEC2', 'NGA', 'NGA2', 'NGDC', 'NOAA_geodas', 'SIO', 'US_multi']
regions = ['AGSO', 'JAMSTEC', 'JAMSTEC2', 'NGA', 'NGA2', 'NGDC', 'NOAA_geodas',
'SIO', 'US_multi', 'US_multi2']
#regions = ['NGDC','US_multi','US_multi2']
#regions = ['TEST-ATL','TEST-PAC']
param1 = ["tsv", "pickle"]
param2 = ["train", "train-all", "test-self", "test-cross", "test-all",
"train-instances", "test-instances"]
usage_msg = "Usage: ./lgb.py <{}> <{}> <config_path>".format("|".join(param1), "|".join(param2))
"train-instances", "test-instances", "train-random"]
usage_msg = "Usage: python -m bathymetry <{}> <{}> <config_path>".format("|".join(param1), "|".join(param2))


@ray.remote
Expand Down Expand Up @@ -70,6 +74,12 @@ def run_testing_instances(model_name, regions):
run_testing_specific_file(model_name, [filename], test_region_name, config, logger)
run_testing_specific_file(model_name, filenames, "all", config, logger)

@ray.remote
def run_training_random(regions):
logger = Logger()
logfile = os.path.join(config["base_dir"], "training_log_all.log")
logger.set_file_handle(logfile)
run_training_n_times(config, regions, is_read_text, logger)

def get_data():
logger = Logger()
Expand Down Expand Up @@ -98,13 +108,16 @@ def get_data():
init_setup(config["base_dir"])
task = sys.argv[2].lower()


ray.init(num_cpus=10)
result_ids = []
if task == "train":
for region in regions:
result_ids.append(run_training_one_region.remote(region))
elif task == "train-all":
run_training_all_regions(regions)
elif task == "train-random":
result_ids.append(run_training_random.remote(regions))
elif task == "test-cross":
for region in regions:
result_ids.append(run_test.remote(region, regions, task))
Expand All @@ -121,6 +134,10 @@ def get_data():
elif task == "test-self":
for region in regions:
result_ids.append(run_test.remote(region, [region], task))
elif task == "test-usm2":
for region in regions:
#result_ids.append(run_test(region, ['US_multi2'], "test-cross"))
result_ids.append(run_test.remote(region, ['US_multi2'], "test-cross"))
else:
assert(False)
results = ray.get(result_ids)
446 changes: 446 additions & 0 deletions analysis/01_feature_removal/PRC-ROC.ipynb

Large diffs are not rendered by default.

Empty file.
321 changes: 321 additions & 0 deletions analysis/01_feature_removal/feature-importance.ipynb

Large diffs are not rendered by default.

Loading