Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #4

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
15 changes: 15 additions & 0 deletions .conveyor/project.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
docker:
path: .
id: bbc60685-7f80-4d7f-a9ce-967fdb43f317
ide:
filename: ide.yaml
name: capstone-smart_clean
resources:
path: resources
streaming:
filename: streaming.yaml
template: ""
version: "0.3"
workflows:
path: dags
version: "2"
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ If this is going well, we will run it on a cloud platform, called [Conveyor](htt
To get started We've set up a Gitpod environment containing all the tools required to complete this exercise (awscli, python, vscode, ...).
You can access this environment by clicking the button below:

[![Open in Gitpod](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#https://github.com/datamindedacademy/capstone-llm)
[![Open in Gitpod](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#https://github.com/SamrtCookie/capstone-llm)

NOTE: When you fork the code repo to your own remote make sure to change the Gitpod URL to reflect your account in this README!

Expand Down
1 change: 1 addition & 0 deletions answers.json

Large diffs are not rendered by default.

Binary file added bin/linux/amd64/conveyor
Binary file not shown.
Binary file added bin/linux/amd64/datafy
Binary file not shown.
6 changes: 5 additions & 1 deletion capstone_llm/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
FROM public.ecr.aws/dataminded/spark-k8s-glue:v3.5.1-hadoop-3.3.6-v1

USER 0
ENV PYSPARK_PYTHON python3
WORKDIR /opt/spark/work-dir
COPY ./requirements.txt .
RUN pip install -r requirements.txt
COPY . .
RUN pip install -e .
CMD ["python3", "-m", "capstonellm.tasks.clean", "--env=prod"]

#TODO add your project code and dependencies to the image
6 changes: 6 additions & 0 deletions capstone_llm/dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#
# This file is autogenerated by pip-compile with Python 3.11
# by the following command:
#
# pip-compile dev-requirements.in
#
13 changes: 13 additions & 0 deletions capstone_llm/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
boto3==1.35.5 ; python_version >= "3.10" and python_version < "4.0"
botocore==1.35.5 ; python_version >= "3.10" and python_version < "4.0"
certifi==2024.7.4 ; python_version >= "3.10" and python_version < "4.0"
charset-normalizer==3.3.2 ; python_version >= "3.10" and python_version < "4.0"
idna==3.8 ; python_version >= "3.10" and python_version < "4.0"
jmespath==1.0.1 ; python_version >= "3.10" and python_version < "4.0"
py4j==0.10.9.7 ; python_version >= "3.10" and python_version < "4.0"
pyspark==3.5.1 ; python_version >= "3.10" and python_version < "4.0"
python-dateutil==2.9.0.post0 ; python_version >= "3.10" and python_version < "4.0"
requests==2.32.3 ; python_version >= "3.10" and python_version < "4.0"
s3transfer==0.10.2 ; python_version >= "3.10" and python_version < "4.0"
six==1.16.0 ; python_version >= "3.10" and python_version < "4.0"
urllib3==2.2.2 ; python_version >= "3.10" and python_version < "4.0"
2 changes: 1 addition & 1 deletion capstone_llm/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
name="capstonellm",
version="0.0.1",
description="Capstone llm project",
python_requires=">=3.11",
python_requires=">=3.10",
packages=find_packages("src"),
package_dir={"": "src"},
py_modules=[splitext(basename(path))[0] for path in glob("src/*.py")],
Expand Down
19 changes: 16 additions & 3 deletions capstone_llm/src/capstonellm/tasks/clean.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,27 @@
import argparse
import logging
from pyspark.sql import SparkSession

from pyspark.sql.functions import explode, col
from capstonellm.common.spark import ClosableSparkSession

logger = logging.getLogger(__name__)

def clean(spark: SparkSession, environment: str, tag: str):
pass

qs_df = spark.read.json(f"s3a://dataminded-academy-capstone-llm-data-us/input/{tag}/questions.json")
qs_df = qs_df.withColumn("items", explode("items"))
qs_df = qs_df.select(col("items.question_id").alias("question_id_q")\
,col("items.body").alias("qusetion_body")\
,col("items.tags").alias("tags")\
,col("items.accepted_answer_id").alias("answer_id"))
ans_df = spark.read.json(f"s3a://dataminded-academy-capstone-llm-data-us/input/{tag}/answers.json")
ans_df = ans_df.withColumn("items", explode("items"))
ans_df = ans_df.select(col("items.answer_id").alias("answer_id")\
,col("items.body").alias("answer_body")\
,col("items.question_id").alias("question_id"))
final_df = qs_df.join(ans_df, on = "answer_id", how = "inner")\
.select(["question_id","answer_id","tags","qusetion_body","answer_body"])
final_df.write.mode("overwrite").json(f"s3a://dataminded-academy-capstone-llm-data-us/cleaned/{tag}")

def main():
parser = argparse.ArgumentParser(description="capstone_llm")
parser.add_argument(
Expand Down
4 changes: 2 additions & 2 deletions capstone_llm/src/capstonellm/tasks/ingest.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
import argparse
from typing import List

import logging
logger = logging.getLogger(__name__)

def ingest(tag: str):
pass



def main():
parser = argparse.ArgumentParser(description="stackoverflow ingest")
parser.add_argument(
Expand Down
Binary file added conveyor_linux_amd64.tar.gz
Binary file not shown.
59 changes: 0 additions & 59 deletions dags/docker_example.py

This file was deleted.

58 changes: 58 additions & 0 deletions dags/reza_task.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.empty import EmptyOperator
from conveyor.operators import ConveyorContainerOperatorV2


default_args = {
"owner": "airflow",
"description": "Use of the DockerOperator",
"depend_on_past": False,
"start_date": datetime(2021, 5, 1),
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=1),
}

with DAG(
"reza_dag",
default_args=default_args,
schedule_interval="5 * * * *",
catchup=False,
) as dag:
start_dag = EmptyOperator(task_id="start_dag")

end_dag = EmptyOperator(task_id="end_dag")

t1 = BashOperator(task_id="print_current_date", bash_command="date")

# t2 = DockerOperator(
# task_id="clean_data",
# image="clean:latest",
# container_name="cleaner",
# api_version="auto",
# auto_remove=True,
# environment={
# "AWS_ACCESS_KEY_ID": os.getenv("AWS_ACCESS_KEY_ID"),
# "AWS_SECRET_ACCESS_KEY": os.getenv("AWS_SECRET_ACCESS_KEY"),
# "AWS_SESSION_TOKEN": os.getenv("AWS_SESSION_TOKEN")
# },
# command = "python3 -m capstonellm.tasks.clean --env prod",
# docker_url="unix://var/run/docker.sock",
# network_mode="bridge",
# )
t2 = ConveyorContainerOperatorV2(
task_id="clean_data",
aws_role="capstone_conveyor_llm",
instance_type='mx.micro',
)

t4 = BashOperator(task_id="print_hello", bash_command='echo "hello world"')

start_dag >> t1

t1 >> t2 >> t4

t4 >> end_dag
1 change: 1 addition & 0 deletions questions.json

Large diffs are not rendered by default.

26 changes: 13 additions & 13 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
boto3==1.35.5 ; python_version >= "3.11" and python_version < "4.0"
botocore==1.35.5 ; python_version >= "3.11" and python_version < "4.0"
certifi==2024.7.4 ; python_version >= "3.11" and python_version < "4.0"
charset-normalizer==3.3.2 ; python_version >= "3.11" and python_version < "4.0"
idna==3.8 ; python_version >= "3.11" and python_version < "4.0"
jmespath==1.0.1 ; python_version >= "3.11" and python_version < "4.0"
py4j==0.10.9.7 ; python_version >= "3.11" and python_version < "4.0"
pyspark==3.5.1 ; python_version >= "3.11" and python_version < "4.0"
python-dateutil==2.9.0.post0 ; python_version >= "3.11" and python_version < "4.0"
requests==2.32.3 ; python_version >= "3.11" and python_version < "4.0"
s3transfer==0.10.2 ; python_version >= "3.11" and python_version < "4.0"
six==1.16.0 ; python_version >= "3.11" and python_version < "4.0"
urllib3==2.2.2 ; python_version >= "3.11" and python_version < "4.0"
boto3==1.35.5 ; python_version >= "3.10" and python_version < "4.0"
botocore==1.35.5 ; python_version >= "3.10" and python_version < "4.0"
certifi==2024.7.4 ; python_version >= "3.10" and python_version < "4.0"
charset-normalizer==3.3.2 ; python_version >= "3.10" and python_version < "4.0"
idna==3.8 ; python_version >= "3.10" and python_version < "4.0"
jmespath==1.0.1 ; python_version >= "3.10" and python_version < "4.0"
py4j==0.10.9.7 ; python_version >= "3.10" and python_version < "4.0"
pyspark==3.5.1 ; python_version >= "3.10" and python_version < "4.0"
python-dateutil==2.9.0.post0 ; python_version >= "3.10" and python_version < "4.0"
requests==2.32.3 ; python_version >= "3.10" and python_version < "4.0"
s3transfer==0.10.2 ; python_version >= "3.10" and python_version < "4.0"
six==1.16.0 ; python_version >= "3.10" and python_version < "4.0"
urllib3==2.2.2 ; python_version >= "3.10" and python_version < "4.0"
Loading