Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
rafelafrance committed Feb 22, 2024
1 parent 90d6060 commit 0a1221e
Show file tree
Hide file tree
Showing 5 changed files with 54 additions and 22 deletions.
44 changes: 37 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,12 @@ This repository merges three older repositories:
- `traiter_plants`
- `traiter_efloras`
- `traiter_mimosa`
- Parts of `digi_leap`

More merging for other Traiter repositories for plant traits may occur.
And I split some functionality was also split out to enable me to use it in other projects.
- `pdf_parsers`: Scripts for parsing PDFs to prepare them for information extraction.
- https://github.com/rafelafrance/pdf_parsers
- `LabelTraiter`: Parsing treatments (this repo) and herbarium labels are now separate repositories.
- https://github.com/rafelafrance/LabelTraiter

## All right, what's this all about then?
**Challenge**: Extract trait information from plant treatments. That is, if I'm given treatment text like: (Reformatted to emphasize targeted traits.)
Expand All @@ -28,9 +31,6 @@ Essentially, we are finding relevant terms in the text (NER) and then linking th
4. Sex: Plants exhibit sexual dimorphism, so we to note which part/subpart/trait notation is associated with which sex.
5. Other text: Things like conjunctions, punctuation, etc. Although they are not recorded, they are often important for parsing and linking of terms.

## Multiple methods for parsing
1. Rule based parsing. Most machine learning models require a substantial training dataset. I use this method to bootstrap the training data. If machine learning methods fail, I can fall back to this.

## Rule-based parsing strategy
1. I label terms using Spacy's phrase and rule-based matchers.
2. Then I match terms using rule-based matchers repeatedly until I have built up a recognizable trait like: color, size, count, etc.
Expand Down Expand Up @@ -58,6 +58,38 @@ cd FloraTraiter
make install
```

### Extract traits

You'll need some treatment text files. One treatment per file.

Example:

```bash
parse-treatments --treatment-dir /path/to/treatments --json-dir /path/to/output/traits --html-file /path/to/traits.html
```

The output formats --json-dir & --html-file are optional. An example of the HTML output was shown above. An example of JSON output.

```json
{
"dwc:scientificName": "Astragalus cobrensis A. Gray var. maguirei Kearney, | var. maguirei",
"dwc:taxonRank": "variety",
"dwc:scientificNameAuthorship": "A. Gray | Kearney",
"dwc:dynamicProperties": {
"leafletHairSurface": "pilosulous",
"leafletHair": "hair",
"leafletHairShape": "incurved-ascending",
"leafletHairSize": "lengthLowInCentimeters: 0.06 ~ lengthHighInCentimeters: 0.08",
"leafPart": "leaflet | leaf",
"partLocation": "adaxial",
"fruitPart": "legume",
"legumeColor": "white",
"legumeSurface": "villosulous"
},
"text": "..."
}
```

### Taxon database

A taxon database is included with the source code, but it may be out of date. I build a taxon database from 4 sources. The 3 primary sources each have various issues, but they complement each other well.
Expand All @@ -69,8 +101,6 @@ A taxon database is included with the source code, but it may be out of date. I

Download the first 3 sources and then use the `util_add_taxa.py` script to extract the taxa and put them into a form the parsers can use.

## Repository details

## Tests
There are tests which you can run like so:
```bash
Expand Down
Binary file modified assets/traits.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/treatment.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 9 additions & 9 deletions flora/parse_treatments.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,26 +15,26 @@ def main():
log.started()
args = parse_args()

treatments: Treatments = Treatments(args)
treatments: Treatments = Treatments(args.treatment_dir, args.limit, args.offset)
treatments.parse()

if args.html_file:
writer = HtmlWriter(args.html_file, args.spotlight)
writer.write(treatments, args)

if args.traiter_dir:
args.traiter_dir.mkdir(parents=True, exist_ok=True)
write_json(treatments, args.traiter_dir)
if args.json_dir:
args.json_dir.mkdir(parents=True, exist_ok=True)
write_json(treatments, args.json_dir)

log.finished()


def write_json(treatments, traiter_dir):
def write_json(treatments, json_dir):
for treat in treatments.treatments:
dwc = DarwinCore()
_ = [t.to_dwc(dwc) for t in treat.traits]

path = traiter_dir / f"{treat.path.stem}.json"
path = json_dir / f"{treat.path.stem}.json"
with path.open("w") as f:
output = dwc.to_dict()
output["text"] = treat.text
Expand All @@ -54,15 +54,15 @@ def parse_args() -> argparse.Namespace:
)

arg_parser.add_argument(
"--text-dir",
"--treatment-dir",
metavar="PATH",
type=Path,
required=True,
help="""Directory containing the input text files.""",
help="""Directory containing the input treatment text files.""",
)

arg_parser.add_argument(
"--traiter-dir",
"--json-dir",
metavar="PATH",
type=Path,
help="""Output JSON files holding traits, one for each input text file, in this
Expand Down
14 changes: 8 additions & 6 deletions flora/pylib/treatments.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,18 @@


class Treatments:
def __init__(self, args):
self.treatments: list[Treatment] = self.get_treatments(args)
def __init__(self, treatment_dir, limit, offset):
self.treatments: list[Treatment] = self.get_treatments(
treatment_dir, limit, offset
)
self.nlp = pipeline.build()

@staticmethod
def get_treatments(args):
labels = [Treatment(p) for p in sorted(args.text_dir.glob("*"))]
def get_treatments(treatment_dir, limit, offset):
labels = [Treatment(p) for p in sorted(treatment_dir.glob("*"))]

if args.limit:
labels = labels[args.offset : args.limit + args.offset]
if limit:
labels = labels[offset : limit + offset]

return labels

Expand Down

0 comments on commit 0a1221e

Please sign in to comment.