diff --git a/.github/workflows/claude-issue-comment.yml b/.github/workflows/claude-issue-comment.yml deleted file mode 100644 index 52abfa2..0000000 --- a/.github/workflows/claude-issue-comment.yml +++ /dev/null @@ -1,29 +0,0 @@ -name: Claude Assistant - Pull Request Comment Review - -on: - issue_comment: - types: [created] - -permissions: - issues: write - pull-requests: write - contents: read - -jobs: - claude-response: - runs-on: ubuntu-latest - environment: prod - steps: - - name: Checkout repository - uses: actions/checkout@v4 - with: - fetch-depth: 0 # 전체 히스토리 클론 (0으로 설정) - - - name: Run Claude Code Action - uses: anthropics/claude-code-action@beta - with: - claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} - github_token: ${{ secrets.GITHUB_TOKEN }} - custom_instructions: | - 당신은 한국어 코드 리뷰어입니다. 반드시 한국어로만 응답해야 합니다. - 이전 대화 내용에 기반하여 체계적으로 정리해 주세요. diff --git a/.github/workflows/claude-issue.yml b/.github/workflows/claude-issue.yml deleted file mode 100644 index f4498f6..0000000 --- a/.github/workflows/claude-issue.yml +++ /dev/null @@ -1,34 +0,0 @@ -name: Claude Assistant- Issue Review - -on: - issues: - types: [labeled] - -permissions: - issues: write - pull-requests: write - contents: read - -jobs: - claude-response: - runs-on: ubuntu-latest - environment: prod - steps: - - name: Checkout repository - uses: actions/checkout@v4 - with: - fetch-depth: 0 # 전체 히스토리 클론 (0으로 설정) - - - name: Run Claude Code Action - uses: anthropics/claude-code-action@beta - with: - claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} - github_token: ${{ secrets.GITHUB_TOKEN }} - label_trigger: "claude-review" - custom_instructions: | - 당신은 한국어 코드 리뷰어입니다. 반드시 한국어로만 응답해야 합니다. - 이슈와 관련하여 다음 항목을 구조적으로 간략하게 정리해 주세요. - - 1. 기능 목적: 이 기능이 필요한 이유와 기대 효과를 간략하게 설명해 주세요. - 2. 영향 범위: 개발 시 영향을 받는 주요 컴포넌트, 모듈, 또는 시스템 범위를 구체적으로 나열해 주세요. 읽기 쉽게 표나 목록으로 정리해 주세요. - 3. 구현 전략: 실제 개발을 위한 단계별 구현 방안 및 전략을 제시해 주세요. 읽기 쉽게 표나 목록으로 정리해 주세요. diff --git a/.github/workflows/claude-pr.yml b/.github/workflows/claude-pr.yml deleted file mode 100644 index e656e5f..0000000 --- a/.github/workflows/claude-pr.yml +++ /dev/null @@ -1,83 +0,0 @@ -name: Claude Assistant - Pull Request Review - -on: - pull_request_review: - types: [submitted] - pull_request_review_comment: - types: [created] - -permissions: - issues: write - pull-requests: write - contents: read - -jobs: - claude-response: - runs-on: ubuntu-latest - environment: prod - steps: - - name: Checkout repository - uses: actions/checkout@v4 - with: - fetch-depth: 0 # 전체 히스토리 클론 (0으로 설정) - - - name: Run Claude Code Action - uses: anthropics/claude-code-action@beta - with: - claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} - github_token: ${{ secrets.GITHUB_TOKEN }} - custom_instructions: | - 당신은 코드 전문가이며 PR 분석 전문 리뷰어입니다. - 코드 변경 사항에 대해 철저하고 건설적이며 실행 가능한 피드백을 제공하는 것이 여러분의 역할입니다. - git branch checkout은 허용하지 않습니다. - gh pr 명령어를 사용하여 PR의 변경 사항을 분석하고, 코드 품질, 보안, 성능 등을 종합적으로 평가합니다. - - 1. **코드 품질 분석** - - 잠재적 버그, 예외 사례 또는 논리 오류를 식별 - - 변경 사항이 성능에 미치는 영향을 평가 - - 보안 취약성 또는 위험을 평가 - - 2. **건설적인 피드백 제공** - - 잘된 부분에 대한 긍정적인 관찰로 시작 - - 특정 줄 참조에서 발견된 문제점을 명확하게 설명 - - 코드 예시를 통해 문제점을 해결할 수 있는 구체적인 개선 사항을 제안 - - 변경이 필요한 부분에서는 심각도(🔴반드시 변경 필요, 🟡변경 권장, 🟢개선 제안)에 따라 피드백의 우선순위를 지정 - - 3. **맥락 고려** - - 변경 사항의 목적과 범위를 이해 - - 구현이 명시된 목표와 일치하는지 평가 - - 코드베이스의 다른 부분에 대한 잠재적 영향을 확인 - - 4. **리뷰 구성** - - ✅ 긍정적인 부분 - - ⚠️ 변경이 필요한 부분 - - [해당하는 코드 링크 포함] - - 🔍 상세 리뷰 - - [중요도 순으로 파일별 상세 변경사항 리뷰와 해당하는 코드 링크] - - 💪 추가 권장 사항 - - [pr 변경사항에 대해 변경 강조해야 할 부분 중요도 순으로 제시] - - 전체 코드베이스를 검토하라는 명시적인 요청을 받지 않는 한, 최근 수정된 코드에 집중 - 항상 전문적이고 도움이 되는 어조를 유지하여 학습과 개선을 장려 - 완전한 리뷰를 제공하기 위해 더 많은 맥락이나 정보가 필요한 경우, 사전에 명확한 설명을 요청 - - 항상 리뷰 코멘트 형식을 따르고 리뷰 내용 전달은 `gh pr comment $ARGUMENTS --body "리뷰 내용"` 명령어를 사용 - 리뷰를 해당 PR에 게시해야 합니다. - - ### 리뷰 코멘트 형식: - ```markdown - ## 🤖 자동 PR 리뷰 - - ### ✅ 긍정적인 부분 - - [구체적인 장점들] - - ### ⚠️ 변경이 필요한 부분 - - [변경이 필요한 부분들] - - ### 🔍 상세 리뷰 - #### 파일별 리뷰 - - **[파일명]**: [상세 리뷰 내용 및 파일 링크] - - ### 💪 추가 권장 사항 - - [pr 변경사항에 대해 변경 강조해야 할 부분 중요도 순으로 제시] - ``` diff --git a/.github/workflows/demigrate.yml b/.github/workflows/demigrate.yml index f61f201..66e2016 100644 --- a/.github/workflows/demigrate.yml +++ b/.github/workflows/demigrate.yml @@ -24,8 +24,10 @@ jobs: - name: Deploy to EC2 env: SSH_KEY: ${{ secrets.EC2_SSH_KEY }} - HOST: ${{ secrets.EC2_STG_HOST }} - DB_URL: ${{ secrets.DB_URL }} + HOST: ${{ secrets.EC2_STG_HOST }} # Assuming STG host is for RDS + DB_URL: ${{ secrets.DB_URL }} # DB_URL secret now holds the DB host + DB_USERNAME: ${{ secrets.DB_USERNAME }} + DB_PASSWORD: ${{ secrets.DB_PASSWORD }} run: | # Setup SSH mkdir -p ~/.ssh @@ -33,37 +35,25 @@ jobs: echo "$SSH_KEY" > key.pem chmod 600 key.pem - # Validate DB_URL secret - if [ -z "$DB_URL" ]; then - echo "Error: DB_URL secret is not set or is empty. Please configure the DB_URL secret in your repository settings." + # Validate DB secrets + if [ -z "$DB_URL" ] || [ -z "$DB_USERNAME" ] || [ -z "$DB_PASSWORD" ]; then + echo "Error: One or more DB secrets (DB_URL, DB_USERNAME, DB_PASSWORD) are not set or are empty. Please configure them in your repository settings." exit 1 fi # Define project path on EC2 PROJECT_PATH="/home/ubuntu/BeyondU-Data" - # SSH to EC2 and run demigration script - ssh -i key.pem ubuntu@$HOST " - # Navigate to the project directory - cd $PROJECT_PATH - - # Pull the latest changes - git pull - - # Set up python virtual environment and install dependencies - python3 -m venv venv - source venv/bin/activate - pip install -r requirements.txt + # Construct the full DATABASE_URL for MySQL - if [ "${{ github.event.inputs.confirm_drop_db }}" != "true" ]; then + # Assuming default MySQL port 3306 and a database named 'beyondu' (or your application DB) on RDS - echo "Database drop not confirmed. Exiting." + FULL_DATABASE_URL="mysql+mysqlconnector://${DB_USERNAME}:${DB_PASSWORD}@${DB_URL}:3306/beyondu" - exit 1 - - fi - - # Run the ETL script with --drop-db to clear all data, passing DATABASE_URL as environment variable + # SSH to EC2 and run demigration script + ssh -i key.pem ubuntu@$HOST " + # ... (inside SSH block) ... - DATABASE_URL="$DB_URL" python scripts/run_etl.py --drop-db + # Run the ETL script with --drop-db to clear all data, passing DATABASE_URL as environment variable + DATABASE_URL=\"$FULL_DATABASE_URL\" python scripts/run_etl.py --drop-db " diff --git a/.github/workflows/migrate.yml b/.github/workflows/migrate.yml index 7adbac2..037e59d 100644 --- a/.github/workflows/migrate.yml +++ b/.github/workflows/migrate.yml @@ -21,8 +21,10 @@ jobs: - name: Deploy to EC2 env: SSH_KEY: ${{ secrets.EC2_SSH_KEY }} - HOST: ${{ secrets.EC2_STG_HOST }} - DB_URL: ${{ secrets.DB_URL }} + HOST: ${{ secrets.EC2_STG_HOST }} # Assuming STG host is for RDS + DB_URL: ${{ secrets.DB_URL }} # DB_URL secret now holds the DB host + DB_USERNAME: ${{ secrets.DB_USERNAME }} + DB_PASSWORD: ${{ secrets.DB_PASSWORD }} run: | # Setup SSH mkdir -p ~/.ssh @@ -30,33 +32,104 @@ jobs: echo "$SSH_KEY" > key.pem chmod 600 key.pem - # Validate DB_URL secret - if [ -z "$DB_URL" ]; then - echo "Error: DB_URL secret is not set or is empty. Please configure the DB_URL secret in your repository settings." - exit 1 - fi + # Validate DB secrets - # Define project path on EC2 - PROJECT_PATH="/home/ubuntu/BeyondU-Data" + if [ -z "$DB_URL" ] || [ -z "$DB_USERNAME" ] || [ -z "$DB_PASSWORD" ]; then - # SSH to EC2 and run migration script - ssh -i key.pem ubuntu@$HOST " + echo "Error: One or more DB secrets (DB_URL, DB_USERNAME, DB_PASSWORD) are not set or are empty. Please configure them in your repository settings." - # Navigate to the project directory - cd $PROJECT_PATH + exit 1 - # Pull the latest changes - git pull + fi + - # Set up python virtual environment and install dependencies - python3 -m venv venv - source venv/bin/activate - pip install -r requirements.txt - if [ "${{ github.event.inputs.confirm_migration }}" != "true" ]; then - echo "Database migration not confirmed. Exiting." - exit 1 - fi - # Run the ETL script with DATABASE_URL as environment variable - DATABASE_URL="$DB_URL" python scripts/run_etl.py --input data/raw --latest-only --init-db - " \ No newline at end of file + # Define project path on EC2 + + PROJECT_PATH="/home/ubuntu/BeyondU-Data" + + + + # Construct the full DATABASE_URL for MySQL + # Assuming default MySQL port 3306 and a database named 'beyondu' (or your application DB) on RDS + FULL_DATABASE_URL="mysql+mysqlconnector://${DB_USERNAME}:${DB_PASSWORD}@${DB_URL}:3306/beyondu" + + + + # SSH to EC2 and run migration script + + ssh -i key.pem ubuntu@$HOST " + + # Check if project directory exists, if not clone it + + if [ ! -d \"$PROJECT_PATH\" ]; then + + echo \"Cloning repository to $PROJECT_PATH\" + + git clone https://github.com/${{ github.repository }}.git \"$PROJECT_PATH\" + + fi + + + + # Navigate to the project directory + + cd \"$PROJECT_PATH\" + + + + # Pull the latest changes and ensure the correct branch is checked out + + git fetch origin + + git checkout "${{ github.ref_name }}" # Checkout the branch this workflow is running on + + git reset --hard "origin/${{ github.ref_name }}" # Force update to the latest remote commit of the current branch + + git clean -fdx # Clean up untracked files and directories, just in case + + + + # Debug: Remove .pyc files and inspect src/config.py on EC2 + + find src -name "*.pyc" -delete + + ls -la src/config.py + + cat src/config.py + + + + # Install python3-venv if not already installed (idempotent) + + sudo apt-get update && sudo apt-get install -y python3-venv + + + + # Set up python virtual environment and install dependencies + + python3 -m venv venv + + source venv/bin/activate + + pip install -r requirements.txt + + + + if [ \"${{ github.event.inputs.confirm_migration }}\" != \"true\" ]; then + + echo \"Database migration not confirmed. Exiting.\" + + exit 1 + + fi + + + + # Run the ETL script with DATABASE_URL as environment variable + + DATABASE_URL=\"$FULL_DATABASE_URL\" python scripts/run_etl.py --input data/raw --latest-only --init-db + + " + + \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index 0356fb8..de5d1c9 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,14 +6,14 @@ requires-python = ">=3.11" [tool.ruff] line-length = 100 -target-version = "py311" +target-version = "py38" [tool.ruff.lint] select = ["E", "F", "I", "N", "W", "UP"] ignore = ["E501"] [tool.mypy] -python_version = "3.11" +python_version = "3.8" strict = true ignore_missing_imports = true diff --git a/requirements.txt b/requirements.txt index 667c45f..3c602e2 100644 --- a/requirements.txt +++ b/requirements.txt @@ -4,7 +4,7 @@ openpyxl>=3.1.0 numpy>=1.24.0 # Database -psycopg2-binary>=2.9.0 +mysql-connector-python>=8.0.0 sqlalchemy>=2.0.0 alembic>=1.12.0 diff --git a/scripts/run_etl.py b/scripts/run_etl.py index 1e57764..84155f7 100644 --- a/scripts/run_etl.py +++ b/scripts/run_etl.py @@ -3,6 +3,7 @@ import argparse import sys from pathlib import Path +from typing import Dict, List, Optional, Union, Tuple # Add project root to path sys.path.insert(0, str(Path(__file__).parent.parent)) @@ -20,7 +21,7 @@ def process_file( file_path: Path, loader: DatabaseLoader, dry_run: bool = False, -) -> dict[str, int]: +) -> Dict[str, int]: """Process a single Excel file through the ETL pipeline.""" logger.info(f"Processing: {file_path.name}") diff --git a/src/config.py b/src/config.py index cb6a6b6..882bf34 100644 --- a/src/config.py +++ b/src/config.py @@ -1,6 +1,7 @@ """Configuration settings for the ETL pipeline.""" from pathlib import Path +from typing import List from pydantic_settings import BaseSettings @@ -9,7 +10,7 @@ class Settings(BaseSettings): """Application settings loaded from environment variables.""" # Database - database_url: str = "sqlite:///test.db" + database_url: str = "postgresql+psycopg2://user:password@host:5432/dbname" # AWS aws_access_key_id: str = "" @@ -22,7 +23,7 @@ class Settings(BaseSettings): processed_data_dir: Path = Path("data/processed") # ETL Settings - excluded_institutions: list[str] = ["SAF", "ACUCA"] + excluded_institutions: List[str] = ["SAF", "ACUCA"] # Environment env: str = "development" diff --git a/src/extract/excel_reader.py b/src/extract/excel_reader.py index 8cfd65d..9dd1fa4 100644 --- a/src/extract/excel_reader.py +++ b/src/extract/excel_reader.py @@ -2,7 +2,7 @@ import re from pathlib import Path -from typing import Any, cast +from typing import Any, Dict, List, Optional, Tuple, Union, cast import pandas as pd from openpyxl import load_workbook @@ -47,11 +47,11 @@ class ExcelReader: "웹사이트": "website_url", } - def __init__(self, file_path: str | Path): + def __init__(self, file_path: Union[str, Path]): self.file_path = Path(file_path) self._workbook = None - def read(self, sheet_name: str | None = None) -> pd.DataFrame: + def read(self, sheet_name: Union[str, None] = None) -> pd.DataFrame: """ Read Excel file and return DataFrame with merged cells resolved. """ @@ -101,31 +101,31 @@ def read(self, sheet_name: str | None = None) -> pd.DataFrame: df = self._clean_dataframe(df) return df - def _extract_with_merged_cells(self, ws: Any, merged_ranges: list[Any]) -> list[list[Any]]: - merged_cell_map: dict[tuple[int, int], Any] = {} + def _extract_with_merged_cells(self, ws: Any, merged_ranges: List[Any]) -> List[List[Any]]: + merged_cell_map: Dict[Tuple[int, int], Any] = {} for merged_range in merged_ranges: min_row, min_col = merged_range.min_row, merged_range.min_col value = ws.cell(row=min_row, column=min_col).value for row in range(merged_range.min_row, merged_range.max_row + 1): for col in range(merged_range.min_col, merged_range.max_col + 1): merged_cell_map[(row, col)] = value - data: list[list[Any]] = [] + data: List[List[Any]] = [] for row_idx, row in enumerate(ws.iter_rows(), start=1): - row_data: list[Any] = [] + row_data: List[Any] = [] for col_idx, cell in enumerate(row, start=1): value = merged_cell_map.get((row_idx, col_idx)) if isinstance(cell, MergedCell) else cell.value row_data.append(value) data.append(row_data) return data - def _find_header_row(self, data: list[list[Any]]) -> int | None: + def _find_header_row(self, data: List[List[Any]]) -> Union[int, None]: for i, row in enumerate(data[:10]): row_str = " ".join(str(x) for x in row if x) if any(keyword in row_str for keyword in self.HEADER_KEYWORDS): return i return None - def _is_header_continuation(self, row: list[Any]) -> bool: + def _is_header_continuation(self, row: List[Any]) -> bool: non_empty = [x for x in row if x and str(x).strip()] if len(non_empty) <= 5: for val in non_empty: @@ -133,7 +133,7 @@ def _is_header_continuation(self, row: list[Any]) -> bool: return True return False - def _merge_headers(self, header1: list[Any], header2: list[Any]) -> list[Any]: + def _merge_headers(self, header1: List[Any], header2: List[Any]) -> List[Any]: merged = [] for h1, h2 in zip(header1, header2): if h1 and str(h1).strip(): @@ -180,13 +180,13 @@ def _clean_dataframe(self, df: pd.DataFrame) -> pd.DataFrame: df[col] = df[col].apply(lambda x: str(x).strip() if pd.notna(x) else x) return df.reset_index(drop=True) - def get_sheet_names(self) -> list[str]: + def get_sheet_names(self) -> List[str]: wb = load_workbook(self.file_path, read_only=True) names = wb.sheetnames wb.close() - return cast(list[str], names) + return cast(List[str], names) - def extract_file_metadata(self) -> dict[str, str | None]: + def extract_file_metadata(self) -> Dict[str, Optional[str]]: filename = self.file_path.stem semester_match = re.search(r"(\d{4})-?(\d)", filename) semester = f"{semester_match.groups()[0]}-{semester_match.groups()[1]}" if semester_match else None diff --git a/src/load/database.py b/src/load/database.py index a61b74a..a6337d4 100644 --- a/src/load/database.py +++ b/src/load/database.py @@ -1,6 +1,6 @@ """Database operations for loading processed data.""" -from typing import Any +from typing import Any, Dict, List, Optional import pandas as pd from sqlalchemy import create_engine, delete, select @@ -55,15 +55,18 @@ class DatabaseLoader: "칠레": "남미", } - def __init__(self, database_url: str | None = None): + def __init__(self, database_url: Optional[str] = None): self.database_url = database_url or settings.database_url + print(f"DEBUG: DATABASE_URL = {self.database_url}, type = {type(self.database_url)}") + if not self.database_url or "://" not in self.database_url: + raise ValueError(f"Invalid DATABASE_URL: {self.database_url}") self.engine = create_engine(self.database_url) self.SessionLocal = sessionmaker(bind=self.engine) self._language_parser = LanguageParser() self._gpa_parser = GPAParser() self._website_url_parser = WebsiteURLParser() - def get_region_from_nation(self, nation: str) -> str | None: + def get_region_from_nation(self, nation: str) -> Optional[str]: """Get region from nation using the mapping.""" return self.COUNTRY_TO_REGION_MAP.get(nation) @@ -96,7 +99,7 @@ def _load_language_requirements( session.add(record) return len(parsed_req.scores) - def load_universities_dataframe(self, df: pd.DataFrame) -> dict[str, int]: + def load_universities_dataframe(self, df: pd.DataFrame) -> Dict[str, int]: """Load a cleaned DataFrame into the database using an upsert strategy.""" stats = {"inserted": 0, "updated": 0, "skipped": 0, "language_reqs": 0} with self.SessionLocal() as session: @@ -202,20 +205,20 @@ def _get_field(self, row: pd.Series, field_name: str, default: Any = None) -> An return default return str(value).strip() if isinstance(value, str) else value - def get_all_universities(self) -> list[University]: + def get_all_universities(self) -> List[University]: with self.SessionLocal() as session: return list(session.execute(select(University).order_by(University.name_kor)).scalars().all()) - def get_language_requirements(self, university_id: int) -> list[LanguageRequirement]: + def get_language_requirements(self, university_id: int) -> List[LanguageRequirement]: with self.SessionLocal() as session: stmt = select(LanguageRequirement).where(LanguageRequirement.university_id == university_id) return list(session.execute(stmt).scalars().all()) - def get_all_language_requirements(self) -> list[LanguageRequirement]: + def get_all_language_requirements(self) -> List[LanguageRequirement]: with self.SessionLocal() as session: return list(session.execute(select(LanguageRequirement).order_by(LanguageRequirement.university_id, LanguageRequirement.exam_type)).scalars().all()) - def search_universities_by_language(self, exam_type: str, user_score: float) -> list[University]: + def search_universities_by_language(self, exam_type: str, user_score: float) -> List[University]: with self.SessionLocal() as session: stmt = ( select(University) diff --git a/src/load/models.py b/src/load/models.py index 8ec306e..ed9770e 100644 --- a/src/load/models.py +++ b/src/load/models.py @@ -1,5 +1,7 @@ """SQLAlchemy models for university exchange program data.""" +from typing import List, Optional + from sqlalchemy import ( Boolean, Float, @@ -41,7 +43,7 @@ class LanguageRequirement(Base): min_score: Mapped[float] = mapped_column( Float, nullable=False, comment="요구되는 최소 점수" ) - level_code: Mapped[str | None] = mapped_column( + level_code: Mapped[Optional[str]] = mapped_column( String(50), nullable=True, comment="레벨/등급 (예: B2, N2 등)" ) @@ -86,20 +88,20 @@ class University(Base): String(255), nullable=False, comment="대학교 영문 명칭" ) min_gpa: Mapped[float] = mapped_column(Float, nullable=False, comment="지원을 위한 최소 GPA") - significant_note: Mapped[str | None] = mapped_column( + significant_note: Mapped[Optional[str]] = mapped_column( Text, nullable=True, comment="주요사항" ) remark: Mapped[str] = mapped_column(Text, nullable=False, comment="기타 참고사항 (비고)") - available_majors: Mapped[str | None] = mapped_column( + available_majors: Mapped[Optional[str]] = mapped_column( Text, nullable=True, comment="교환학생 수강 가능한 전공 목록" ) - website_url: Mapped[str | None] = mapped_column( + website_url: Mapped[Optional[str]] = mapped_column( Text, nullable=True, comment="공식 홈페이지 주소" ) - thumbnail_url: Mapped[str | None] = mapped_column( + thumbnail_url: Mapped[Optional[str]] = mapped_column( Text, nullable=True, comment="학교 로고 또는 대표 이미지" ) - available_semester: Mapped[str | None] = mapped_column( + available_semester: Mapped[Optional[str]] = mapped_column( String(100), nullable=True, comment="파견 가능한 학기 (예: Fall, Spring)" ) is_exchange: Mapped[bool] = mapped_column( @@ -110,7 +112,7 @@ class University(Base): ) # Relationships - language_requirements: Mapped[list["LanguageRequirement"]] = relationship( + language_requirements: Mapped[List["LanguageRequirement"]] = relationship( back_populates="university", cascade="all, delete-orphan", ) diff --git a/src/transform/cleaner.py b/src/transform/cleaner.py index 8b3956e..8e917cf 100644 --- a/src/transform/cleaner.py +++ b/src/transform/cleaner.py @@ -1,6 +1,7 @@ """Data cleaning and normalization utilities.""" import re +from typing import Optional import pandas as pd @@ -40,7 +41,7 @@ def _normalize_gpa(self) -> None: if "min_gpa" in self.df.columns: self.df["min_gpa"] = self.df["min_gpa"].apply(self._parse_gpa) - def _parse_gpa(self, value: str | None) -> str | None: + def _parse_gpa(self, value: Optional[str]) -> Optional[str]: """Parse GPA value to normalized format.""" if value is None or pd.isna(value): return None diff --git a/src/transform/parser.py b/src/transform/parser.py index 7411d8c..7e5af15 100644 --- a/src/transform/parser.py +++ b/src/transform/parser.py @@ -1,9 +1,8 @@ """Parse requirement fields from text to structured data.""" import re -from collections.abc import Callable from dataclasses import dataclass, field -from typing import Any +from typing import Any, Callable, Dict, List, Optional, Tuple @dataclass @@ -11,7 +10,7 @@ class ParsedScoreInfo: """단일 어학 점수 파싱 결과.""" exam_type: str min_score: float - level_code: str | None + level_code: Optional[str] language_group: str source: str @@ -19,9 +18,9 @@ class ParsedScoreInfo: @dataclass class ParsedLanguageRequirement: """한 대학의 어학 요구사항 전체 파싱 결과.""" - scores: list[ParsedScoreInfo] = field(default_factory=list) + scores: List[ParsedScoreInfo] = field(default_factory=list) is_optional: bool = False - excluded_tests: list[str] = field(default_factory=list) + excluded_tests: List[str] = field(default_factory=list) raw_text: str = "" @@ -29,7 +28,7 @@ class ParsedLanguageRequirement: # 숭실대학교 공인 어학 성적 기준표 (2024년 기준) # ============================================================================ -LANGUAGE_STANDARDS: dict[str, Any] = { # Added type hint for LANGUAGE_STANDARDS +LANGUAGE_STANDARDS: Dict[str, Any] = { # Added type hint for LANGUAGE_STANDARDS "A1": {"category": "ENGLISH", "scores": {"TOEFL": 85.0, "IELTS": 6.5, "TOEIC": 900.0, "TOEFL_ITP": 600.0}}, "A2": {"category": "ENGLISH", "scores": {"TOEFL": 80.0, "IELTS": 6.0, "TOEIC": 850.0, "TOEFL_ITP": 560.0}}, "A3": {"category": "ENGLISH", "scores": {"TOEFL": 75.0, "IELTS": 5.5, "TOEIC": 800.0, "TOEFL_ITP": 545.0}}, @@ -53,13 +52,13 @@ class ParsedLanguageRequirement: "C2": {"category": "JAPANESE", "scores": {"JLPT": 2.0, "JPT": 600.0}}, } -LEGACY_CODE_ALIASES: dict[str, str] = { # Added type hint for LEGACY_CODE_ALIASES +LEGACY_CODE_ALIASES: Dict[str, str] = { # Added type hint for LEGACY_CODE_ALIASES "A-1": "A1", "A-2": "A2", "A-3": "A3", "A-4": "A4", "A-5": "A5", "B-1": "B1", "B-2": "B2", "B-3": "B3", "C-1": "C1", "C-2": "C2", "D-1": "D1", "D-2": "D2", "D-3": "D3", "E-1": "E1", "E-2": "E2", "E-3": "E3", } -TEST_TYPE_TO_LANGUAGE_GROUP: dict[str, str] = { # Added type hint for TEST_TYPE_TO_LANGUAGE_GROUP +TEST_TYPE_TO_LANGUAGE_GROUP: Dict[str, str] = { # Added type hint for TEST_TYPE_TO_LANGUAGE_GROUP "TOEFL": "ENGLISH", "TOEFL_ITP": "ENGLISH", "TOEIC": "ENGLISH", "IELTS": "ENGLISH", "DUOLINGO": "ENGLISH", "HSK": "CHINESE", @@ -78,7 +77,7 @@ def _cefr_to_float(level_str: str) -> float: class LanguageParser: """Parse language requirement text into structured data.""" - SCORE_PATTERNS: list[tuple[str, str, Callable[[str], float]]] = [ # Added type hint + SCORE_PATTERNS: List[Tuple[str, str, Callable[[str], float]]] = [ # Added type hint (r"TOEFL\s*(?:\(iBT\)|iBT|IBT|ibt)?\s*(\d+)", "TOEFL", lambda x: float(x.replace(',', ''))), (r"TOEFL\s*(?:ITP|itp|PBT|pbt)\s*(\d+)", "TOEFL_ITP", lambda x: float(x.replace(',', ''))), (r"토플\s*(?:IBT|iBT)?\s*(\d+)", "TOEFL", lambda x: float(x.replace(',', ''))), @@ -96,10 +95,10 @@ class LanguageParser: (r"TOPIK\s*(\d+)급?", "TOPIK", lambda x: float(x.replace(',', ''))), ] - EXCLUDE_PATTERNS: list[str] = [r"TOEIC[^가-힣]*제외", r"ITP[^가-힣]*제외", r"토익[^가-힣]*제외"] # Added type hint - OPTIONAL_PATTERNS: list[str] = [r"어학\s*성적?\s*없음", r"면제", r"불필요", r"N/?A"] # Added type hint + EXCLUDE_PATTERNS: List[str] = [r"TOEIC[^가-힣]*제외", r"ITP[^가-힣]*제외", r"토익[^가-힣]*제외"] # Added type hint + OPTIONAL_PATTERNS: List[str] = [r"어학\s*성적?\s*없음", r"면제", r"불필요", r"N/?A"] # Added type hint - def parse(self, text: str | None, region: str | None = None) -> ParsedLanguageRequirement: + def parse(self, text: Optional[str], region: Optional[str] = None) -> ParsedLanguageRequirement: """ Parse language requirement text using a direct-first, code-fallback strategy. This ensures that scores explicitly mentioned in the text take precedence. @@ -117,7 +116,7 @@ def parse(self, text: str | None, region: str | None = None) -> ParsedLanguageRe return ParsedLanguageRequirement(is_optional=True, raw_text="") result = ParsedLanguageRequirement(raw_text=text) - scores_map: dict[str, ParsedScoreInfo] = {} + scores_map: Dict[str, ParsedScoreInfo] = {} if any(re.search(p, text, re.IGNORECASE) for p in self.OPTIONAL_PATTERNS): result.is_optional = True @@ -182,7 +181,7 @@ def parse(self, text: str | None, region: str | None = None) -> ParsedLanguageRe result.scores = list(scores_map.values()) return result - def _match_standard_code(self, text: str, region: str | None = None) -> str | None: + def _match_standard_code(self, text: str, region: Optional[str] = None) -> Optional[str]: """Extracts a standard grade code from the text.""" text_upper = text.upper().strip() @@ -217,7 +216,7 @@ def _match_standard_code(self, text: str, region: str | None = None) -> str | No class GPAParser: """Parse GPA requirements.""" - def parse(self, text: str | None) -> float | None: + def parse(self, text: Optional[str]) -> Optional[float]: """Parse GPA requirement text and return a float, assuming a 4.5 scale.""" if text is None: return None @@ -240,7 +239,7 @@ def parse(self, text: str | None) -> float | None: class WebsiteURLParser: """Parse and clean website URL strings.""" - def parse(self, text: str | None) -> str | None: + def parse(self, text: Optional[str]) -> Optional[str]: """ Parses a string to find and clean the first URL. Handles various formats including those without protocols.