Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated code for extraction #8

Merged
merged 79 commits into from
May 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
f0d0fad
Citation extractions by Vishal for code review
venvis Feb 8, 2024
7868c30
cellar
venvis Feb 8, 2024
905d223
Updated code for extraction
venvis Feb 27, 2024
36cc239
Updated code
venvis Mar 6, 2024
c42ca98
Update Testing_file.py
venvis Mar 21, 2024
7cb83e4
Delete cellar/cellar_extractor/operative_extraction.py
venvis Mar 21, 2024
d5e1aa3
Add files via upload
venvis Mar 21, 2024
8d754a7
Delete cellar/cellar_extractor/Testing_file.py
venvis Mar 21, 2024
a6fd5b8
Add files via upload
venvis Mar 21, 2024
3664a9c
Added doc string to method extra_cellar
Mar 28, 2024
16af4bb
Added doc string and linted code for cellar_queries file
Mar 28, 2024
4e07a79
Linted code, added encoding for opening files, and removed unused lib…
Mar 28, 2024
5243673
Linted code and moved doc strings under methods rather than above for…
Mar 28, 2024
7db0f6a
Linted code for csv_extractor.py
Mar 28, 2024
2ddf0a6
Linted code, changed conditional statements for PEP8 conformity, and …
Mar 28, 2024
c355d00
Linted code, corrected variable names that are similar to inbuilt ref…
Mar 28, 2024
acd033e
Linted code for json_to_csv file
Mar 28, 2024
f541100
Linted code, changed for loop to use enumerate rather than range and …
Mar 28, 2024
14ffecc
Linted code, changed conditions from != None to is not None for code …
Mar 28, 2024
2886aef
code linting for sparql file
Mar 28, 2024
6ad2cab
Code linting for Testing file
Mar 28, 2024
db86889
Update gitignore file to not consider DS_Store files and venv directo…
Mar 28, 2024
c33835a
Merge branch 'extraction_operative_keywords' into 'cellar'
Mar 28, 2024
b1f24ee
Unittests for operative part
venvis Apr 10, 2024
e88255c
Update operative_extractions.py
venvis Apr 11, 2024
09865ff
Update __init__.py
venvis Apr 11, 2024
0d7c77c
Update tests.py
venvis Apr 11, 2024
01f7d84
Update tests.py
venvis Apr 11, 2024
642dbe3
Delete cellar/cellar_extractor/Testing_file.py
venvis Apr 11, 2024
8db810d
Update operative_extractions.py
venvis Apr 11, 2024
3c32aaa
Update tests.py
venvis Apr 11, 2024
28858a6
Update tests.py
venvis Apr 11, 2024
98ea00c
Update tests.py
venvis Apr 11, 2024
8cae2fa
Update tests.py
venvis Apr 11, 2024
8503d8d
Update tests.py
venvis Apr 11, 2024
87271b8
Update tests.py
venvis Apr 11, 2024
c3daf22
Update tests.py
venvis Apr 11, 2024
428f466
Update tests.py
venvis Apr 11, 2024
b75fe86
Updated variable name from current to current_dir
Apr 17, 2024
2537752
Removed conditional statements and updated assert False statements
Apr 17, 2024
f82a23f
Removed unnecessary colon
Apr 17, 2024
106675f
Removed duplicate __init__ method, reordered import libraries
Apr 17, 2024
0dfd386
Corrected methods being called, extra import from operative_extractions
Apr 17, 2024
55b1178
Changed from enumerate to range and len
Apr 17, 2024
abcd5ef
Updated setup.py file for finding operative_extractions
Apr 17, 2024
25cdcf1
Correcting path to include everything under cellar_extractor in setup…
Apr 17, 2024
cd6e55e
F string changes and url changes for eurex website
venvis Apr 17, 2024
d22849e
Removed import cellar_extractor.operative_extractions
Apr 17, 2024
d89aaec
os module for tests.py
venvis Apr 24, 2024
9aaf2e3
Include json & csv directory for outputs
venvis Apr 24, 2024
e33cc9c
os configuration to include cellar/ for tests.py
venvis Apr 24, 2024
483c419
Update tests.py
shashankmc Apr 24, 2024
f8887d7
Update setup.py
shashankmc Apr 24, 2024
56f7845
Change to len(celex_store)-1 to avoid index out of range error.
shashankmc Apr 24, 2024
19fccdc
Added additional import to test.py file
shashankmc Apr 24, 2024
d57f015
Create __init__.py
venvis May 1, 2024
8045e9c
Shorten imports after adding __init__.py to cellar folder
venvis May 1, 2024
26d305d
Add extra import command
venvis May 1, 2024
5334381
Update tests.py
venvis May 1, 2024
e56d328
Update __init__.py
venvis May 1, 2024
dc9aba3
Update __init__.py
venvis May 1, 2024
1ca331b
Update __init__.py
venvis May 1, 2024
75b2978
Update tests.py
venvis May 1, 2024
e4c7236
Update __init__.py
venvis May 1, 2024
a7455ee
Update tests.py
venvis May 1, 2024
78b7140
add pip install -e
venvis May 1, 2024
8da60dc
Update github-actions.yml
venvis May 1, 2024
25caaf8
Add pip install -e cellar/
venvis May 1, 2024
c7bac51
Update README.md
venvis May 1, 2024
7548054
Update README.md
venvis May 1, 2024
f15870c
Update README.md
venvis May 1, 2024
15b7b04
Update README.md
venvis May 1, 2024
44bf4b8
Update README.md
venvis May 1, 2024
aa09b6f
Update README.md
venvis May 1, 2024
908d5e6
Update README.md
venvis May 1, 2024
3f00adb
Add for Analyzer and Writing Classes
venvis May 1, 2024
c92fbef
Add code for Analyzer class
venvis May 1, 2024
868b14c
Update README.md
venvis May 1, 2024
6368651
Update README.md
venvis May 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/github-actions.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install cellar-extractor
pip install -e cellar/
# pip install echr-extractor
- run: echo "💡 The ${{ github.repository }} repository has been cloned to the runner."
- run: echo "🖥️ The workflow is now ready to test your code on the runner."
Expand Down
6 changes: 4 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
venv
.venv*
.idea
data
rechtspraak/rechtspraak_extractor/tests/data
Expand All @@ -20,4 +20,6 @@ rechtspraak.zip
build.bat
echr_extractor-whl.zip
echr_extractor-whl
echr_extractor.egg-info
echr_extractor.egg-info

.*DS_Store
49 changes: 48 additions & 1 deletion cellar/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,13 @@ Python 3.9
<sub><b>gijsvd</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/venvis">
<img src="https://avatars.githubusercontent.com/venvis" width="100;" alt="venvis"/>
<br />
<sub><b>venvis</b></sub>
</a>
</td>
</tr>
</table>
<!-- readme: contributors,gijsvd -end -->
Expand All @@ -59,6 +66,16 @@ Python 3.9
Allows the creation of a network graph of the citations. Can only be returned in-memory.
<li><code>filter_subject_matter</code></li>
Returns a dataframe of cases only containing a certain phrase in the column containing the subject of cases.
<li><code>Analyzer</code></li>
A class whose instance(declaration) when called returns a list of the all the text contained within the operative part for each European Court of Justice (CJEU, formerly known as European Court of Justice (ECJ)) judgement (English only).
<li><code>Writing</code></li>
A class which writes the text for the operative part for each European Case law case(En-English only) into csv,json and txt files(Generated upon initialization).<br>
the <code>Writing</code> class has three functions : <br><br>
<ul>
<li><code>to_csv()</code> - Writes the operative part along with celex id into a csv file</li>
<li><code>to_json()</code> - Writes the operative part along with celex id into a json file</li>
<li><code>to_txt()</code> - Writes the operative part along with celex id into a txt file</li>
</ul>
<br>
</ol>

Expand Down Expand Up @@ -115,11 +132,22 @@ Python 3.9
<li><strong>phrase: string, required, default None</strong></li>
The phrase which has to be present in the subject matter of cases. Case insensitive.
</ul>
<li><code>Analyzer</code></li>
<ul>
<li><strong>celex id: str, required</strong></li>
<li>Pass as a constructor upon initializing the class</li>
</ul>
<li><code>Writing</code></li>
<ul>
<li><strong>celex id: str, required</strong></li>
<li>Pass as a constructor upon initializing the class</li>
</ul>

</ol>


## Examples
```
```python
import cellar_extractor as cell

Below are examples for in-file saving:
Expand All @@ -132,7 +160,26 @@ Below are examples for in-memory saving:
df = cell.get_cellar(save_file='n', file_format='csv', sd='2022-01-01', max_ecli=1000)
df,json = cell.get_cellar_extra(save_file='n', max_ecli=100, sd='2022-01-01', threads=10)
```
<p>Create a callback of the instance of the class initiated and pass a list as it's value.</p>

```python
import cellar_extractor as cell
instance=cell.Analyzer(celex_id:str)
output_list=instance()
print(output_list) # prints operative part of the Case as a list
```


<p>The Writing Class also takes a celex id , upon initializing the class , through the means of the constructor and writes the content of its operative part into different files , depending on the function called</p>

```python
import cellar_extractor as cell
instance=cell.Writing(celex_id:str)
output=instance.to_csv()#for csv
output=instance.to_txt()#for txt
output=instance.to_json()#for json

```

## License
[![License: Apache 2.0](https://img.shields.io/github/license/maastrichtlawtech/extraction_libraries)](https://opensource.org/licenses/Apache-2.0)
Expand Down
2 changes: 2 additions & 0 deletions cellar/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from cellar_extractor import *

30 changes: 0 additions & 30 deletions cellar/cellar_extractor/Testing_file.py

This file was deleted.

4 changes: 3 additions & 1 deletion cellar/cellar_extractor/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,7 @@
from cellar_extractor.cellar import get_cellar_extra
from cellar_extractor.cellar import get_nodes_and_edges_lists
from cellar_extractor.cellar import filter_subject_matter
from cellar_extractor.operative_extractions import Analyzer
from cellar_extractor.operative_extractions import Writing
import logging
logging.basicConfig(level=logging.INFO)
logging.basicConfig(level=logging.INFO)
15 changes: 8 additions & 7 deletions cellar/cellar_extractor/cellar.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
from cellar_extractor.json_to_csv import json_to_csv_main, json_to_csv_returning
from cellar_extractor.nodes_and_edges import get_nodes_and_edges


def get_cellar(ed=None, save_file='y', max_ecli=100, sd="2022-05-01", file_format='csv'):
if not ed:
ed = datetime.now().isoformat(timespec='seconds')
Expand Down Expand Up @@ -40,7 +39,7 @@ def get_cellar(ed=None, save_file='y', max_ecli=100, sd="2022-05-01", file_forma
json_to_csv_main(all_eclis, file_path)
else:
file_path = os.path.join('data', file_name + '.json')
with open(file_path, "w") as f:
with open(file_path, "w", encoding="utf-8") as f:
json.dump(all_eclis, f)
else:
if file_format == 'csv':
Expand All @@ -51,7 +50,8 @@ def get_cellar(ed=None, save_file='y', max_ecli=100, sd="2022-05-01", file_forma
logging.info("\n--- DONE ---")


def get_cellar_extra(ed=None, save_file='y', max_ecli=100, sd="2022-05-01", threads=10, username="", password=""):
def get_cellar_extra(ed=None, save_file='y', max_ecli=100, sd="2022-05-01",
threads=10, username="", password=""):
if not ed:
ed = datetime.now().isoformat(timespec='seconds')
data = get_cellar(ed=ed, save_file='n', max_ecli=max_ecli, sd=sd, file_format='csv')
Expand All @@ -64,15 +64,16 @@ def get_cellar_extra(ed=None, save_file='y', max_ecli=100, sd="2022-05-01", thre
file_path = os.path.join('data', file_name + '.csv')
if save_file == 'y':
Path('data').mkdir(parents=True, exist_ok=True)
extra_cellar(data=data, filepath=file_path, threads=threads, username=username, password=password)
extra_cellar(data=data, filepath=file_path, threads=threads,
username=username, password=password)
logging.info("\n--- DONE ---")

else:
data, json = extra_cellar(data=data, threads=threads, username=username, password=password)
data, json_data = extra_cellar(data=data, threads=threads,
username=username, password=password)
logging.info("\n--- DONE ---")

return data, json

return data,json_data

def get_nodes_and_edges_lists(df=None, only_local=False):
if df is None:
Expand Down
31 changes: 29 additions & 2 deletions cellar/cellar_extractor/cellar_extra_extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,44 @@


def extra_cellar(data=None, filepath=None, threads=10, username="", password=""):
"""
Extracts information from a cellar dataset.

Args:
data (pandas.DataFrame, optional): The input dataset. If not provided,
it will be read from the specified filepath.
filepath (str, optional): The path to the input dataset file. If provided,
the data will be read from this file.
threads (int, optional): The number of threads to use for parallel
processing. Default is 10.
username (str, optional): The username for accessing a separate
webservice. Default is an empty string.
password (str, optional): The password for accessing a separate
webservice. Default is an empty string.

Returns:
tuple: A tuple containing the modified dataset and a JSON object.

If `data` is not provided, the dataset will be read from the specified
`filepath`.

If `username` and `password` are provided, the function will add
citations using a separate webservice.

The function will add sections to the dataset using the specified
number of `threads`. If `filepath` is provided,
the modified dataset will be saved to the same file. Otherwise, the
modified dataset and a JSON object will be returned.
"""
if data is None:
data = read_csv(filepath)
if filepath:
if username !="" and password !="":
add_citations_separate_webservice(data, username, password)
#print("Citations successfully added. The rest of additional extraction will now happen.")
add_sections(data, threads, filepath.replace(".csv", "_fulltext.json"))
data.to_csv(filepath, index=False)
else:
if username != "" and password != "":
add_citations_separate_webservice(data, username, password)
#print("Citations successfully added. The rest of additional extraction will now happen.")
json = add_sections(data, threads)
return data, json
24 changes: 15 additions & 9 deletions cellar/cellar_extractor/cellar_queries.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,18 +48,23 @@ def get_all_eclis(starting_date=None, ending_date=None):
return eclis


def get_raw_cellar_metadata(eclis, get_labels=True, force_readable_cols=True, force_readable_vals=False):
def get_raw_cellar_metadata(eclis, get_labels=True, force_readable_cols=True,
force_readable_vals=False):
"""Gets cellar metadata

:param eclis: The ECLIs for which to retrieve metadata
:type eclis: list[str]
:param get_labels: Flag to get human-readable labels for the properties, defaults to True
:param get_labels: Flag to get human-readable labels for the properties,
defaults to True
:type get_labels: bool, optional
:param force_readable_cols: Flag to remove any non-labelled properties from the resulting dict, defaults to True
:param force_readable_cols: Flag to remove any non-labelled properties
from the resulting dict, defaults to True
:type force_readable_cols: bool, optional
:param force_readable_vals: Flag to remove any non-labelled values from the resulting dict, defaults to False
:param force_readable_vals: Flag to remove any non-labelled values from
the resulting dict, defaults to False
:type force_readable_vals: bool, optional
:return: Dictionary containing metadata. Top-level keys are ECLIs, second level are property names
:return: Dictionary containing metadata. Top-level keys are ECLIs, second
level are property names
:rtype: Dict[str, Dict[str, list[str]]]
"""

Expand Down Expand Up @@ -100,8 +105,8 @@ def get_raw_cellar_metadata(eclis, get_labels=True, force_readable_cols=True, fo
for ecli in eclis:
metadata[ecli] = {}

# Take each triple, check which source doc it belongs to, key/value pair into its dict derived from the p and o in
# the query
# Take each triple, check which source doc it belongs to, key/value pair
# into its dict derived from the p and o in the query
for res in ret['results']['bindings']:
ecli = res['ecli']['value']
# We only want cdm predicates
Expand All @@ -125,8 +130,9 @@ def get_raw_cellar_metadata(eclis, get_labels=True, force_readable_cols=True, fo
else:
val = res['o']['value']

# We store the values for each property in a list. For some properties this is not necessary,
# but if a property can be assigned multiple times, this is important. Notable, for example is citations.b
# We store the values for each property in a list. For some properties
# this is not necessary, but if a property can be assigned multiple
# times, this is important. Notable, for example is citations.
if key in metadata[ecli]:
metadata[ecli][key].append(val)
else:
Expand Down
Loading
Loading