Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Rag scrap notice and embedding for vectorDB #191

Merged
merged 30 commits into from
Jul 22, 2024
Merged
Changes from 1 commit
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
24400f2
setting: Chroma Vector DB 의존성 설정
zbqmgldjfh Jul 15, 2024
4241964
feat: 환경설정 파일 수정
zbqmgldjfh Jul 15, 2024
b571e81
feat(QueryVectorStoreAdapter): QueryVectorStoreAdapter를 ChromaVectorS…
zbqmgldjfh Jul 15, 2024
4ed5577
feat(Notice): Notice 테이블에 embedded boolean 필드 추가
zbqmgldjfh Jul 16, 2024
d997740
feat(NoticeTextParserTemplate): 공지의 본문, 제목, 아이디를 파싱하는 ParserTemplate 구현
zbqmgldjfh Jul 16, 2024
937e233
test: ChromaDB test container 설정
zbqmgldjfh Jul 17, 2024
64d54af
feat(NoticeApiClient): 단일 페이지를 scrap하는 requestSinglePageWithUrl 구현
zbqmgldjfh Jul 17, 2024
d285e34
fix(NoticeJdbcRepository): 공지에 추가된 embedded 필드를 위해 bulk insert method…
zbqmgldjfh Jul 18, 2024
aa7c80b
feat(NoticeRepository): updateNoticeEmbeddingStatus, findNotYetEmbedd…
zbqmgldjfh Jul 18, 2024
4d65d24
fix(KuisHomepageNoticeTextParser): 본문을 포함하는 추가 테그를 파싱하는 로직 추가
zbqmgldjfh Jul 18, 2024
360b078
feat(KuisHomepageNoticeInfo): textParser 의존성 추가
zbqmgldjfh Jul 18, 2024
95dbbae
feat(ChromaVectorStoreAdapter): ChromaVector 구현
zbqmgldjfh Jul 18, 2024
7c71810
test(KuisHomepageNoticeScraperTemplateTest): 임베딩 테스트 scrapForEmbeddin…
zbqmgldjfh Jul 18, 2024
91c6a93
feat(RAGConfiguration): RAG 환경설정 구현
zbqmgldjfh Jul 19, 2024
f808b15
feat(NoticeEmbeddingUpdater): 공지 embedding을 위한 Updater 구현
zbqmgldjfh Jul 19, 2024
3c0fca8
feat: 공지 updater 작업 수행 시간 변경
zbqmgldjfh Jul 19, 2024
6b4c10c
chore: 설정파일에 collection-name 추가
zbqmgldjfh Jul 19, 2024
81e0625
fix(ChromaVectorStoreAdapter): embedding 메서드 수정과 테스트 추가
zbqmgldjfh Jul 20, 2024
1889237
feat(ChromaVectorStoreAdapter): 유사도 임계치 제거
zbqmgldjfh Jul 20, 2024
828fd4e
feat: 사용하지 않는 RestTemplateConfig 제거
zbqmgldjfh Jul 20, 2024
63d68b3
chore: Public 접근 제어자 제거
zbqmgldjfh Jul 20, 2024
bc88621
feat(ChromaVectorStoreAdapter): Top-K 를 2로 변경
zbqmgldjfh Jul 21, 2024
4e52373
feat(User): 한달 질문 가능 횟수를 3번으로 변경
zbqmgldjfh Jul 21, 2024
2aee9bf
feat(UserUpdater#questionCountReset): 매달 마지막날 사용자 질문 카운트 초기화 작업 구현
zbqmgldjfh Jul 21, 2024
10488b2
feat(UserRegisterNonChainingFilter): 사용자 중복 등록 예외 로그를 남기도록 처리
zbqmgldjfh Jul 21, 2024
d8944be
feat(UserUpdater): 사용자 제거작업 중지
zbqmgldjfh Jul 21, 2024
8351982
setting: ai max token 1000으로 변경
zbqmgldjfh Jul 21, 2024
d00db85
feat(RAGQueryApiV2): RAGQueryApi 문서화
zbqmgldjfh Jul 21, 2024
736e835
refactor: SecurityRequirement에서 상수를 사용하도록 변경
zbqmgldjfh Jul 22, 2024
645cb77
feat(User): 사용자 질문 횟수 2로 제한
zbqmgldjfh Jul 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feat(NoticeEmbeddingUpdater): 공지 embedding을 위한 Updater 구현
  • Loading branch information
zbqmgldjfh committed Jul 19, 2024

Verified

This commit was signed with the committer’s verified signature.
renovate-bot Mend Renovate
commit f808b155ae2d4dc2971ba67b9e4864baea24e983
Original file line number Diff line number Diff line change
@@ -2,15 +2,18 @@

import com.kustacks.kuring.common.exception.InternalLogicException;
import com.kustacks.kuring.common.exception.code.ErrorCode;
import com.kustacks.kuring.notice.application.port.out.dto.NoticeDto;
import com.kustacks.kuring.worker.dto.ComplexNoticeFormatDto;
import com.kustacks.kuring.worker.dto.ScrapingResultDto;
import com.kustacks.kuring.worker.scrap.noticeinfo.KuisHomepageNoticeInfo;
import com.kustacks.kuring.worker.parser.notice.PageTextDto;
import com.kustacks.kuring.worker.parser.notice.RowsDto;
import com.kustacks.kuring.worker.scrap.noticeinfo.KuisHomepageNoticeInfo;
import com.kustacks.kuring.worker.update.notice.dto.response.CommonNoticeFormatDto;
import lombok.extern.slf4j.Slf4j;
import org.jsoup.nodes.Document;
import org.springframework.stereotype.Component;

import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.function.Function;
@@ -34,6 +37,19 @@ public List<ComplexNoticeFormatDto> scrap(
return noticeDtoList;
}

public List<PageTextDto> scrapForEmbedding(
List<NoticeDto> scrapResults,
KuisHomepageNoticeInfo noticeInfo
) throws InternalLogicException {
List<ScrapingResultDto> requestResults = requestWithDeptInfoForEmbedding(scrapResults, noticeInfo);

log.debug("[{}] Text extract begin", noticeInfo.getCategoryName());
List<PageTextDto> noticeDtoList = htmlTextParsingFromScrapingResult(noticeInfo, requestResults);
log.debug("[{}] Text extract end", noticeInfo.getCategoryName());

return noticeDtoList;
}

private void validateScrapedNoticeCountIsNotZero(List<ComplexNoticeFormatDto> noticeDtoList) {
for (ComplexNoticeFormatDto complexNoticeFormatDto : noticeDtoList) {
if (complexNoticeFormatDto.getNormalNoticeSize() == 0) {
@@ -48,16 +64,52 @@ private List<ScrapingResultDto> requestWithDeptInfo(
) {
long startTime = System.currentTimeMillis();

log.debug("[{}] HTML 요청", kuisNoticeInfo.getCategoryName());
log.debug("[{}] HTML SCRAP 요청", kuisNoticeInfo.getCategoryName());
List<ScrapingResultDto> reqResults = decisionMaker.apply(kuisNoticeInfo);
log.debug("[{}] HTML 수신", kuisNoticeInfo.getCategoryName());
log.debug("[{}] HTML SCRAP 수신", kuisNoticeInfo.getCategoryName());

long endTime = System.currentTimeMillis();
log.debug("[{}] 파싱에 소요된 초 = {}", kuisNoticeInfo.getCategoryName(), (endTime - startTime) / 1000.0);

return reqResults;
}

private List<ScrapingResultDto> requestWithDeptInfoForEmbedding(
List<NoticeDto> scrapResults,
KuisHomepageNoticeInfo noticeInfo
) {
long startTime = System.currentTimeMillis();

List<ScrapingResultDto> scrapResultDtos = new LinkedList<>();
for (NoticeDto scrapResult : scrapResults) {
log.debug("[{}] HTML SCRAP 요청", noticeInfo.getCategoryName());
scrapResultDtos.add(noticeInfo.scrapSinglePageHtml(scrapResult.getUrl()));
log.debug("[{}] HTML SCRAP 수신", noticeInfo.getCategoryName());
}

long endTime = System.currentTimeMillis();
log.debug("[{}] 파싱에 소요된 초 = {}", noticeInfo.getCategoryName(), (endTime - startTime) / 1000.0);

return scrapResultDtos;
}

private List<PageTextDto> htmlTextParsingFromScrapingResult(
KuisHomepageNoticeInfo noticeInfo,
List<ScrapingResultDto> results
) {
List<PageTextDto> parsedTexts = new ArrayList<>();
for (ScrapingResultDto result : results) {
try {
PageTextDto parsedText = noticeInfo.parseText(result.getDocument());
parsedTexts.add(parsedText);
} catch (InternalLogicException e) {
log.error("Exception extracting url: {}", result.getViewUrl(), e);
}
}

return parsedTexts;
}


private List<ComplexNoticeFormatDto> htmlParsingFromScrapingResult(
KuisHomepageNoticeInfo kuisNoticeInfo,
Original file line number Diff line number Diff line change
@@ -62,7 +62,7 @@ public List<ScrapingResultDto> requestAll(KuisHomepageNoticeInfo kuisHomepageNot
@Override
public ScrapingResultDto requestSinglePageWithUrl(KuisHomepageNoticeInfo noticeInfo, String url) {
try {
Document document = jsoupClient.get(url, LATEST_SCRAP_ALL_TIMEOUT);
Document document = jsoupClient.get(url, LATEST_SCRAP_TIMEOUT);
return new ScrapingResultDto(document, url);
} catch (IOException e) {
log.info("Notice Text Scrap IOException", e);
@@ -78,7 +78,7 @@ public int getTotalNoticeSize(String url) throws IOException, IndexOutOfBoundsEx

Element totalNoticeSizeElement = document.selectFirst(".util-search strong");

if(totalNoticeSizeElement == null) { // 총 공지 개수가 없는 경우 650개로 가정
if (totalNoticeSizeElement == null) { // 총 공지 개수가 없는 경우 650개로 가정
return TOTAL_KUIS_NOTICES_COUNT;
}

Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
package com.kustacks.kuring.worker.update.notice;

import com.kustacks.kuring.ai.application.port.out.CommandVectorStorePort;
import com.kustacks.kuring.notice.application.port.out.NoticeCommandPort;
import com.kustacks.kuring.notice.application.port.out.NoticeQueryPort;
import com.kustacks.kuring.notice.application.port.out.dto.NoticeDto;
import com.kustacks.kuring.notice.domain.CategoryName;
import com.kustacks.kuring.worker.parser.notice.PageTextDto;
import com.kustacks.kuring.worker.scrap.KuisHomepageNoticeScraperTemplate;
import com.kustacks.kuring.worker.scrap.noticeinfo.KuisHomepageNoticeInfo;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor;
import org.springframework.stereotype.Component;

import java.time.LocalDateTime;
import java.util.List;
import java.util.concurrent.CompletableFuture;

@Slf4j
@Component
@RequiredArgsConstructor
public class NoticeEmbeddingUpdater {

private final ThreadPoolTaskExecutor noticeUpdaterThreadTaskExecutor;
private final KuisHomepageNoticeScraperTemplate scrapperTemplate;
private final List<KuisHomepageNoticeInfo> kuisNoticeInfoList;
private final CommandVectorStorePort commandVectorStorePort;
private final NoticeCommandPort noticeCommandPort;
private final NoticeQueryPort noticeQueryPort;

/*
학사, 장학, 취창업, 국제, 학생, 산학, 일반, 공지 embedding
*/
@Scheduled(cron = "0 0 21 * * *", zone = "Asia/Seoul") // 매일 오후 9시 embedding 작업 수행
public void update() {
log.info("========== KUIS Hompage Embedding 시작 ==========");

for (KuisHomepageNoticeInfo kuisNoticeInfo : kuisNoticeInfoList) {
CompletableFuture
.supplyAsync(
() -> lookupNotYetEmbeddingNotice(kuisNoticeInfo),
noticeUpdaterThreadTaskExecutor
).thenApply(
scrapResults -> scrapNoticeText(scrapResults, kuisNoticeInfo)
).thenAccept(
scrapResults -> embeddingNotice(scrapResults, kuisNoticeInfo.getCategoryName())
);
}
}

private List<NoticeDto> lookupNotYetEmbeddingNotice(KuisHomepageNoticeInfo noticeInfo) {
log.debug("lookupNotYetEmbeddingNotice {}", noticeInfo.getCategoryName());
LocalDateTime startDate = LocalDateTime.now().minusMonths(2);
return noticeQueryPort.findNotYetEmbeddingNotice(noticeInfo.getCategoryName(), startDate);
}

private List<PageTextDto> scrapNoticeText(
List<NoticeDto> scrapResults,
KuisHomepageNoticeInfo noticeInfo
) {
return scrapperTemplate.scrapForEmbedding(scrapResults, noticeInfo);
}

private void embeddingNotice(List<PageTextDto> extractTextResults, CategoryName categoryName) {
if (extractTextResults.isEmpty()) {
log.debug("Embedding {} no more notice to embed", categoryName);
return;
}
log.info("Embedding {}, size = {}", categoryName, extractTextResults.size());

commandVectorStorePort.embedding(extractTextResults, categoryName);
List<String> articleIds = extractTextResults.stream()
.map(PageTextDto::articleId)
.toList();

noticeCommandPort.updateNoticeEmbeddingStatus(categoryName, articleIds);
}
}