Skip to content

Commit

Permalink
Atualiza Apache Tika e corrige build corrompido
Browse files Browse the repository at this point in the history
Versões mais novas do Tika exigem o "Accept: text/plain" para
retornar apenas o conteúdo textual, pois o padrão é retornar HTML.
  • Loading branch information
ogecece committed Sep 14, 2024
1 parent 4700489 commit 9353f36
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 2 deletions.
5 changes: 4 additions & 1 deletion data_extraction/text_extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,10 @@ def _try_extract_text(self, filepath: str) -> str:
if self.is_txt(filepath):
return self._return_file_content(filepath)
with open(filepath, "rb") as file:
headers = {"Content-Type": self._get_file_type(filepath)}
headers = {
"Content-Type": self._get_file_type(filepath),
"Accept": "text/plain",
}
response = requests.put(f"{self._url}/tika", data=file, headers=headers)
response.encoding = "UTF-8"
return response.text
Expand Down
2 changes: 1 addition & 1 deletion scripts/Dockerfile_apache_tika
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ RUN adduser --system gazette && \
apt-get clean

# install Apache Tika
RUN curl -o /tika-server.jar http://archive.apache.org/dist/tika/tika-server-1.24.1.jar && \
RUN curl -o /tika-server.jar https://archive.apache.org/dist/tika/2.9.2/tika-server-standard-2.9.2.jar && \
chmod 755 /tika-server.jar

USER gazette
Expand Down

0 comments on commit 9353f36

Please sign in to comment.