λ°ν μ¬λΌμ΄λμ μ°Έκ³ μλ£λ₯Ό κΈ°λ°μΌλ‘ μ¬λΌμ΄λλ³ λ°ν λλ³Έμ μλ μμ±νκ³ , μμ±λ λλ³Έμ΄ ν€μλλ₯Ό μΌλ§λ 컀λ²νλμ§ μΆμ νλ μμ€ν μ λλ€.
- RAG κΈ°λ° μ°Έκ³ λ¬Έμ κ²μ (LangChain + ChromaDB)
- IBM Watsonx λλ Upstage Solar LLM μ ν μ¬μ©
- μ¬λΌμ΄λλ³ ν΅μ¬ ν€μλ μΆμΆ (μ΅μ )
- μμ±λ λλ³Έμ ν€μλ 컀λ²λ¦¬μ§ λΆμ
- 3κ°μ§ λ§€μΉ λ°©μ:
hybrid: token + sentence κ²°ν© (κΈ°λ³Έ, μ΅κ³ μ±λ₯)token: ννμ κΈ°λ° λ§€μΉ (λΉ λ¦)sentence: λ¬Έμ₯ μ μ¬λ κΈ°λ° λ§€μΉ (μ ν)
- ν€μλ μ κ·ν: νμ΄ν/μΈλμ€μ½μ΄λ₯Ό 곡백μΌλ‘ μΉν (κΈ°λ³Έ ON)
- μλ² λ© μ ν:
- Transformer (λ‘컬, λΉ λ¦, κΈ°λ³Έκ°)
- Upstage API (μ ν, API νΈμΆ νμ)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt- Python 3.11.9
- langchain, langchain-openai, langchain-ibm
- chromadb
- sentence-transformers
- konlpy (νκ΅μ΄ ννμ λΆμ, μ΅μ )
# IBM Watsonx
API_KEY=your_ibm_api_key
PROJECT_ID=your_ibm_project_id
IBM_CLOUD_URL=https://us-south.ml.cloud.ibm.com
# Upstage
UPSTAGE_API_KEY=your_upstage_api_key
UPSTAGE_API_URL=https://api.upstage.ai/v1/solar# IBM Watsonx μ¬μ©
python main.py --api ibm --extractor y
# Upstage Solar μ¬μ©
python main.py --api upstage --extractor yμΆλ ₯:
outputs/paper_scripts.md: μ¬λΌμ΄λλ³ λλ³Έoutputs/paper_keywords.txt: μ¬λΌμ΄λλ³ ν€μλ
# κΈ°λ³Έ (hybrid + μ κ·ν ON, Transformer μλ² λ©)
python run_tracker.py
# token λͺ¨λ (μ κ·ν OFF μμ)
python run_tracker.py -m token --normalize n
# sentence λͺ¨λ + Upstage API
python run_tracker.py -m sentence --api yμΆλ ₯:
outputs/paper_coverage_analysis.txt: μ¬λΌμ΄λλ³ μ»€λ²λ¦¬μ§ λΆμ
docs_root = "./docs" # λ¬Έμ ν΄λ
out_dir = "./outputs" # κ²°κ³Ό μ μ₯ ν΄λpersist_dir = "./chroma_db" # λ²‘ν° DB μ μ₯ κ²½λ‘
chunk_size = 400 # μ²ν¬ ν¬κΈ°
chunk_overlap = 50 # μ²ν¬ μ€λ²λ©
top_k = 3 # κ²μ κ²°κ³Ό κ°μlexical_threshold = 0.5 # ν ν° λ§€μΉ μκ³κ°
semantic_threshold = 0.65 # λ¬Έμ₯ μ μ¬λ μκ³κ°# Transformer (λ‘컬)
transformer_model = "paraphrase-multilingual-MiniLM-L12-v2"
# Upstage (API)
embedding_model = "solar-embedding-1-large".
βββ docs/ # λ¬Έμ ν΄λ
β βββ paper.pdf # λ°ν μ¬λΌμ΄λ
β βββ report.pdf # μ°Έκ³ λ¬Έμ
βββ outputs/ # κ²°κ³Ό νμΌ
β βββ paper_scripts.md # μμ±λ λλ³Έ
β βββ paper_keywords.txt # μΆμΆλ ν€μλ
β βββ paper_coverage_analysis.txt # 컀λ²λ¦¬μ§ λΆμ
βββ tracker/ # ν€μλ μΆμ λͺ¨λ
β βββ keyword_tracker.py # ν ν° κΈ°λ° μΆμ
β βββ sentence_tracker.py # λ¬Έμ₯ μ μ¬λ μΆμ
βββ main.py # λλ³Έ μμ±
βββ run_tracker.py # ν€μλ μΆμ
βββ config.py # μ€μ
βββ requirements.txt # μμ‘΄μ±
| λ°©μ | μ νλ | μλ | νΉμ§ |
|---|---|---|---|
| token | 89% | λΉ λ¦ | ννμ κΈ°λ°, μ νν λ¨μ΄ λ§€μΉ |
| sentence | 82% | λλ¦Ό | μλ―Έ κΈ°λ°, λ¬Έλ§₯ μ΄ν΄ |
| hybrid | 95.9% | μ€κ° | token + sentence κ²°ν©, κΈ°λ³Έ μ κ·ν(곡백 μΉν) |
κΆμ₯: νλ μ ν
μ΄μ
μΆμ μλ python run_tracker.py (hybrid + μ κ·ν ON) μ¬μ©
normalize_text(): ν μ€νΈ μ κ·νtokenize_simple_ko_en(): νμ ν ν°νtoken_overlap_score(): ν ν° μ€λ²λ© μ μ κ³μ°parse_keywords_from_file(): ν€μλ νμΌ νμ±parse_scripts_by_slide(): λλ³Έ νμΌ νμ±
SentenceMatcher: λ¬Έμ₯ μ μ¬λ λ§€μΉ ν΄λμ€find_matches(): λ¬Έμ₯μμ ν€μλ λ§€μΉget_coverage_for_slide(): μ¬λΌμ΄λ 컀λ²λ¦¬μ§ κ³μ°
main_analysis(): ν€μλ μΆμ λ©μΈ λ‘μ§
MIT License