A powerful Python package for parsing and analyzing Google Case Law and Patent data. This tool provides comprehensive functionality for extracting, parsing, and analyzing legal documents from Google Scholar's case law database, with additional support for patent information extraction and analysis.
- Extract comprehensive case information including:
- Full case name and citation
- Court information and jurisdiction
- Decision date
- Judge names and personal opinions (concurrences/dissents)
- Case numbers and docket information
- Bluebook citations
- Footnotes and references
- Page numbers and structure
- Extract and analyze patent information:
- Identify patents-in-suit
- Parse patent claims with flexible output options:
- Get just claim text and numbers
- Process without file I/O
- Full patent data extraction
- Track claim citations
- Handle patent application numbers
- Link to USPTO data
- Monitor patent transaction history
- Semantic parsing of legal documents
- Bluebook citation formatting and validation
- Court and jurisdiction identification
- Structured data extraction
- Support for multiple citation formats
- Footnote and reference management
- JSON serialization of parsed data
- CSV export capabilities
- Structured data organization
- Thread-safe concurrent processing
- Support for batch processing
- Caching and data persistence
- Clone the repository:
git clone https://github.com/alirezabehtash/gcl.git
cd gcl
- Create and activate a virtual environment:
python3 -m venv env
source env/bin/activate # On Windows: env\Scripts\activate
- Install the package in development mode:
pip install -e .
The package requires the following main dependencies:
- beautifulsoup4 (≥4.12.2) - HTML parsing
- requests (≥2.31.0) - HTTP requests
- selenium (≥4.0.0) - Web scraping
- reporters-db (≥3.2.56) - Legal reporter database
- Additional dependencies listed in requirements.txt
The package uses Selenium for web scraping. To set up the Selenium container:
docker-compose up -d
from gcl import GCLParse
from pathlib import Path
# Initialize parser
data_dir = Path("data")
parser = GCLParse(data_dir=data_dir)
# Parse a case law
case_law_url = "https://scholar.google.com/scholar_case?case=9862061449582190482"
case_data = parser.gcl_parse(
case_law_url,
skip_patent=False, # Include patent data
return_data=True, # Return the parsed data
)
# Access parsed data
print(f"Case Name: {case_data['full_case_name']}")
print(f"Court: {case_data['court']}")
print(f"Date: {case_data['date']}")
print(f"Judges: {case_data['judges']}")
# Get Bluebook citation
citation = parser.gcl_citor(case_law_url)
print(f"Bluebook Citation: {citation}")
# Get citation summary
summary = parser.gcl_citation_summary(case_data['id'])
print(f"Citation Summary: {summary}")
from gcl import GooglePatents
# Initialize patent parser
gp = GooglePatents(data_dir="data")
# Get full patent data
found, patent_data = gp.patent_data(
"US7654321",
return_data=["patent_number", "title", "abstract", "claims"]
)
# Get just the claims
found, claims = gp.patent_data(
"US7654321",
just_claims=True # Returns only claim numbers and their text
)
# Process without saving to disk
found, claims = gp.patent_data(
"US7654321",
just_claims=True,
no_save=True # Prevents reading from or writing to files
)
# Parse case with patent information
case_data = parser.gcl_parse(
case_law_url,
skip_patent=False,
skip_application=False
)
# Access patent information
for patent in case_data['patents_in_suit']:
print(f"Patent Number: {patent['patent_number']}")
print(f"Application Number: {patent['application_number']}")
print(f"Cited Claims: {patent['cited_claims']}")
# Create a list of case summaries
parser.gcl_make_list("case_summaries")
# Bundle citations from multiple cases
parser.gcl_bundle_cites(blue_citation=True)
Both GCLParse
and GooglePatents
classes are thread-safe and support concurrent downloads. Each instance maintains its own thread-local storage to prevent data leakage between threads.
from gcl import GCLParse
# List of case URLs or IDs to process
case_urls = [
"https://scholar.google.com/scholar_case?case=9862061449582190482",
"https://scholar.google.com/scholar_case?case=4398438352003003603",
# ... more cases
]
# Initialize parser
parser = GCLParse(data_dir="data")
# Process cases in parallel
for case_url in case_urls:
# Each case gets its own thread with isolated storage
parser.gcl_parse(case_url)
# Access results
for case_id in case_ids:
case_data = load_json(f"data/json/json_v1.3.3/{case_id}.json")
print(f"Case {case_id}: {case_data['full_case_name']}")
The thread-safe design ensures that:
- Each thread has its own isolated storage
- No data leakage between concurrent downloads
- Safe parallel processing of multiple cases
- Thread-safe file I/O operations
The package can identify and extract personal opinions (concurrences and dissents):
if case_data['personal_opinions']['concur']:
for opinion in case_data['personal_opinions']['concur']:
print(f"Concurring Judge: {opinion['judge']}")
print(f"Opinion Location: {opinion['index_span']}")
Extract and analyze citation networks within cases:
# Get all citations in a case
citations = case_data['cites_to']
for case_id, citations in citations.items():
print(f"Cited Case: {case_id}")
for citation in citations:
print(f"Citation Format: {citation['citation']}")
The package includes a comprehensive test suite:
- Ensure development installation:
pip install -e .
- Run the tests:
python -m unittest tests/test_main.py
The test suite covers:
- Case law parsing accuracy
- Citation formatting
- Patent data extraction
- Data structure validation
- JSON serialization
- Error handling
- Thread safety and concurrent processing
- Memory isolation between threads
gcl/
├── gcl/
│ ├── __init__.py
│ ├── main.py # Core parsing functionality
│ ├── google_patents_scrape.py # Patent data extraction
│ ├── uspto_api.py # USPTO API integration
│ ├── proxy.py # Proxy management
│ ├── regexes.py # Regular expressions
│ ├── settings.py # Configuration
│ ├── version.py # Version information
│ └── utils.py # Utility functions
├── tests/
│ ├── test_main.py
│ └── test_files/ # Test case data
├── docker-compose.yml # Docker configuration
└── requirements.txt # Dependencies
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Alireza Behtash
Copyright (c) 2025 Alireza Behtash