Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DS][BUG] ChatGPT 파싱 수정 요청 #14

Open
1 of 2 tasks
robert-min opened this issue Feb 1, 2024 · 6 comments
Open
1 of 2 tasks

[DS][BUG] ChatGPT 파싱 수정 요청 #14

robert-min opened this issue Feb 1, 2024 · 6 comments
Assignees
Labels
Bug Not Working(ex. 기존 작업물 오류 수정) DataScience Data Science task(ex. 모델 연구, EDA, 최적화 작업)

Comments

@robert-min
Copy link
Contributor

robert-min commented Feb 1, 2024

📌 Description

  • ChatGPT응답에 따라 아래와 같은 경우 추가적인 처리가 필요함

현재 처리하는 코드

def extract_coord_keyword(content: str):
    from collections import defaultdict
    pattern = r"'(.*?)':\[(.*?\n)"

    # TODO : 좌표 추출 코드 수정
    matches = re.findall(pattern, content)
    all_coords = defaultdict(list)
    for name, coords in matches:
        coords = coords.replace("\n", "")[1:-1].split("],[")
        for idx, coord in enumerate(coords):
            if idx == len(coords) - 1:
                while coord[-1] == "]":
                    coord = coord[:-1]
            temp = list(map(int, coord.split(",")))
            all_coords[name].append(temp)
    return all_coords

예외 상황

  1. 위의 코드로 현재 처리하는 경우
  • 괄호 안에 값들이 좌표별로 잘 들어 가있음
- 'hot air balloons': [[143,38,225,119], [198,49,274,132], [348,69,394,113]]\n
- 'river': [[354,187,637,423]]\n
- 'terraced fields': [[43,241,802,707]]"
  1. 에러가 발생하는 경우
  • 좌표마다 괄호 두개를 해서 값을 보냄
- 'hot air balloons':[[58,31,139,85]], [[176,15,243,68]], [[282,7,324,39]]\n
- 'river':[[179,237,429,292]]\n
- 'terraced fields':[[88,308,394,600]]

🎈 Goal

$\tiny{구체적인\ 산출물을\ 포함한\ 목표를\ 작성해주세요.}$

  • 프롬프팅을 수정해서 [[좌표값1], [좌표값2], [좌표값3]] 형식으로 값을 보내도록 수정 필요
  • 아니면 저 에러가 발생하는 경우를 파싱할 수 있도록 코드 수정이 필요

둘 중 더 편한 방법으로 진행!!


✏️ Todo

$\tiny{목표\ 달성을\ 위해\ 해야할\ 일을\ 세부적으로\ 작성해주세요.}$

  • 에러 상황 전달
  • 문제 해결
@robert-min robert-min added Bug Not Working(ex. 기존 작업물 오류 수정) DataScience Data Science task(ex. 모델 연구, EDA, 최적화 작업) labels Feb 1, 2024
@kimdoeon
Copy link
Contributor

kimdoeon commented Feb 1, 2024

📌 Description

  • 프롬프트 수정. 10번 돌렸을 때 10번 모두 [[좌표값1], [좌표값2], [좌표값3]] 형식으로 출력됨.

수정사항

  • 기존 프롬프트
    You are an expert art historian with vast knowledge about artists throughout history who revolutionized their craft. You will begin by briefly summarizing the personal life and achievements of the artist. Then you will go on to explain the medium, style, and influences of their works. Then you will provide short descriptions of what they depict and any notable characteristics they might have. Fianlly identify THREE keywords in the picture and provide each coordinate of the keywords in the last sentence. For example if the keyword is woman, the output must be 'woman':[[x0,y0,x1,y1]] 
  • 수정 프롬프트
   You are an expert art historian with vast knowledge about artists throughout history who revolutionized their craft. You will begin by briefly summarizing the personal life and achievements of the artist. Then you will go on to explain the medium, style, and influences of their works. Then you will provide short descriptions of what they depict and any notable characteristics they might have. Fianlly identify THREE keywords in the picture and provide each coordinate of the keywords in the last sentence. For example if the keyword is woman, the output must be 'woman':[[x0,y0,x1,y1]] or 'woman':[[x0,y0,x1,y1], [x2,y2,x3,y3]] 

@kimdoeon
Copy link
Contributor

kimdoeon commented Feb 1, 2024

📌 Description

  1. response로 받은 raw content에서 개행/탭 제거하는 refine_ouput_first 함수 추가
  2. 키워드/좌표 추출 함수 extract_coord_keyword 수정
  • 1번 수정사항
  def refine_ouput_first(content: str) -> str:
      '''raw content에서 개행/탭 제거'''
      content = content.replace('\n', ' ').replace('\t', ' ').strip()
      
      return content
  • 2번 수정사항
  def extract_coord_keyword(content: str) -> dict[str, list[list[int]]]:
      chk = 'json'
      if chk in content:
          key_coord_dic = content.split(chk)[-1].strip() #json 기준 뒷부분 추출
          match = re.search(r'\{.*\}', key_coord_dic) # { } 안 문자열 추출 정규식
          if match: 
              str_dict = match.group()
              key_coord_dic = json.loads(str_dict) #json 형태 문자열 딕셔너리로 변환
      else:
          return {} #json 없을 경우 빈 딕셔너리 반환
      
      return key_coord_dic

  • 간단한 test code
  import re
  import json
  
  content = '''As an AI, I do not have access to specific databases for identifying individual artworks or artists beyond my training data, which only goes up until April 2023. Therefore, I cannot provide a personal history or achievements of the artist of this specific painting since it requires identifying individual living or recent artists, which I cannot do. However, I can describe the visible characteristics of this image.\n\nThe artwork displayed is an idyllic landscape painting that appears to employ a stylized realism. The medium looks like it could be acrylic or oil on canvas, given the vibrancy of the colors and the smooth texture of the painted surface. The style presents a harmonized composition with vibrant colors, and there\'s a certain rhythm created by the patterns of the fields. This style is reminiscent of folk art or naive art, which often features simplified forms and a sense of serenity.\n\nThe painting depicts a lush green landscape with a meandering river leading towards a tranquil blue lake. Terraced fields, perhaps indicative of rice paddies or tea plantations, add a patterned texture to the rolling hills. Trees intermittently dot the landscape, and the presence of hot air balloons in the sky introduces a whimsical or fantastical element to the scene. There\'s a structure visible to the left, possibly part of a house or an outbuilding with a red brick chimney and a white parasol, suggesting a human presence without showing actual figures.\n\nNow, for the coordinates of three keywords within the image:\n\n1. \'hot air balloon\',\n2. \'river\',\n3. \'terraced fields\'.\n\n```json\n{\n  "hot air balloon": [[74,35,117,84], [200,29,236,66], [411,43,442,69]],\n  "river": [[223,285,400,406]],\n  "terraced fields": [[0,228,600,477]]\n}\n```'''
  
  def refine_ouput_first(content: str) -> str:
      '''raw content에서 개행/탭 제거'''
      content = content.replace('\n', ' ').replace('\t', ' ').strip()
      
      return content
  
  def extract_coord_keyword(content: str) -> dict[str, list[list[int]]]:
      chk = 'json'
      if chk in content:
          key_coord_dic = content.split(chk)[-1].strip() #json 기준 뒷부분 추출
          match = re.search(r'\{.*\}', key_coord_dic) # { } 안 문자열 추출 정규식
          if match: 
              str_dict = match.group()
              key_coord_dic = json.loads(str_dict) #json 형태 문자열 딕셔너리로 변환
      else:
          return {} #json 없을 경우 빈 딕셔너리 반환
      
      return key_coord_dic

  ref_content = refine_ouput_first(content )
  key_coord_dic = extract_coord_keyword(ref_content)
  
  print(key_coord_dic)

@kimdoeon
Copy link
Contributor

kimdoeon commented Feb 2, 2024

📌 Description

  • 출력 텍스트에 'json' 포함되지 않은 경우, json 형식을 따르지 않는 경우 발견
    => 1. extract_coord_keyword 수정.
    => 2. 프롬프트 수정

1. extract_coord_keyword 수정

  • 수정 :
    • content에서 바로 { }안 키워드 추출
    • str_dict의 ' -> " 로 replace
    • json 형식 따르지 않는 경우 무조건 빈 딕셔너리 반환
  • 기존 코드
  def extract_coord_keyword(content: str) -> dict[str, list[list[int]]]:
      chk = 'json' #수정
      if chk in content: #수정
          key_coord_dic = content.split(chk)[-1].strip() #수정
          match = re.search(r'\{.*\}', key_coord_dic) 
          if match: 
              str_dict = match.group()
              key_coord_dic = json.loads(str_dict) 
      else:
          return {} 
      
      return key_coord_dic
  • 수정 코드
  def extract_coord_keyword(content: str):
      match = re.search(r'\{.*\}', content) 
      if match: 
          str_dict = match.group()
          str_dict = str_dict.replace("'",'"')#수정
          try: #수정
              return json.loads(str_dict) 
          except json.JSONDecodeError:
              return {}
      else:
          return {}

2. 프롬프트 수정

  • 수정 프롬프트
'''"You are an expert art historian with vast knowledge about artists throughout history who revolutionized their craft.",
            "You will begin by briefly summarizing the personal life and achievements of the artist.",
            "Then you will go on to explain the medium, style, and influences of their works.",
            "Then you will provide short descriptions of what they depict and any notable characteristics they might have.",
            "Fianlly identify THREE keywords in the picture and provide each coordinate of the keywords in the last sentence.",
            'For example, Give the coordinate value of the keywords in json format such as if the keyword is Pretty_woman, ```json{"pretty_woman", [[x0,y0,x1,y1]]}```, or if there are multiple coordinates, keyword coordinates in json format such as ```json{"pretty_woman":[[x0,y0,x1,y1], [x2,y2,x3,y3]]}`',
            "The values ​​entered in x0, y0, x1, y1 are unconditionally the coordinate values ​​of each keyword."'''

@kimdoeon
Copy link
Contributor

kimdoeon commented Feb 2, 2024

📌 Description

출력 형식이 수정되어 refine_output 함수 수정.

  • 기존 : ' : ' 를 기준으로 : 뒷 문장들 제거
  • 수정 : 전체 해설에서 정수가 포함된 문장 제거
  • 기존 코드
def refine_output(content: str) -> str:
    keyword = ':'
    if keyword in content:
        content = content[:content.find(keyword)].strip()
    content = content.replace('\n', ' ').strip()
    return content
  • 수정 코드
  def refine_output(content: str) -> str:
    output = ""
    sentences = content.split(". ")
    for sentence in sentences:
        if not re.search(r'\d',sentence):
            output+=sentence
            
    if not output:
        return content    
        
    else:
        return output

@kimdoeon
Copy link
Contributor

kimdoeon commented Feb 2, 2024

📌 Description

refine_output 함수 수정.
해설 앞 부분 AI의 변명(I cannot, i do not ~) 제거, json, JSON, {, 정수 들어간 문장 제거.

def refine_output(content: str) -> str:
    # AI 변명, focus-pointing에서 걸러지지 않은 문장 제거
    words=["cannot", "AI", "do not", "can't", "json", "JSON", "{",]
    output = ""

    sentences = content.split(". ")
    for sentence in sentences:
        if not any(word in sentence for word in words) and not re.search(r'\d',sentence):
            output+=sentence

    if not output:
        return content
    
    else:
        return output

@robert-min
Copy link
Contributor Author

240213 기준 프롬프팅

"You are an expert art historian with vast knowledge about artists throughout history who revolutionized their craft.",
"You will begin by briefly summarizing the personal life and achievements of the artist.",
"Then you will go on to explain the medium, style, and influences of their works.",
"Then you will provide short descriptions of what they depict and any notable characteristics they might have.",
"Fianlly identify THREE keywords in the picture and provide each coordinate of the keywords in the last sentence.",
"For example, Give the coordinate value of the keywords in json format.",
"if the keyword is pretty_woman and big_ball, value is  ```json{\"pretty_woman\", [[x0,y0,x1,y1]], \"big_ball\", [[x0,y0,x1,y1], [x2,y2,x3,y3]]}```",
"The values ​​entered in x0, y0, x1, y1 are unconditionally the coordinate values ​​of each keyword.",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Not Working(ex. 기존 작업물 오류 수정) DataScience Data Science task(ex. 모델 연구, EDA, 최적화 작업)
Projects
None yet
Development

No branches or pull requests

2 participants