-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Show #6
base: main
Are you sure you want to change the base?
Show #6
Changes from 6 commits
669d0e4
23fa125
4cf059e
5d00f57
d0e8192
2da2b2b
3a5f74d
2237317
f26c0a5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
source("renv/activate.R") |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,16 +20,27 @@ To begin exploring the integration of Dagster and R: | |
```bash | ||
cd dagster-and-r | ||
``` | ||
3. **Install Dependencies** | ||
Using [poetry](https://python-poetry.org/), install the package and its dependencies: | ||
```bash | ||
poetry install | ||
3. **Install Python Dependencies** | ||
# you'll need a version of python installed | ||
Using uv | ||
# install uv | ||
# curl -LsSf https://astral.sh/uv/install.sh | sh | ||
uv venv | ||
source .venv/bin/activate | ||
uv sync | ||
``` | ||
4. ** Install R dependencies** | ||
``` | ||
# from R | ||
# if you haven't installed renv before | ||
# install.packages("renv") | ||
# renv::restore() | ||
``` | ||
|
||
4. **Set RETICULATE_PYTHON environment variable** | ||
Determine the path to the python binary associated with this project's poetry environment. | ||
```bash | ||
poetry run | ||
# from your viritual environment | ||
which python | ||
# /home/user/.cache/pypoetry/virtualenvs/dagster-and-r-kS5e8P_l-py3.10/bin/python | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I needed to update my #.Renviron
RETICULATE_PYTHON=.venv/bin/python We should probably update the README to include this new path after migrating to |
||
``` | ||
|
@@ -38,7 +49,7 @@ Create a new `.Renviron` file at the root of the project and set the `RETICULATE | |
5. **Launch the Dagster UI** | ||
Start the Dagster web server: | ||
```bash | ||
poetry run dagster dev | ||
dagster dev | ||
``` | ||
Access the UI at http://localhost:3000 in your browser. | ||
|
||
|
@@ -61,7 +72,7 @@ Create a new `.Renviron` file at the root of the project and set the `RETICULATE | |
Then, start the Dagster UI web server: | ||
|
||
```bash | ||
poetry run dagster dev -m dagster_and_r | ||
dagster dev -m dagster_and_r | ||
``` | ||
|
||
Open http://localhost:3000 with your browser to see the project. | ||
|
@@ -85,21 +96,21 @@ Open http://localhost:3000 with your browser to see the project. | |
### Adding Python Dependencies | ||
To add new Python packages to the project: | ||
```bash | ||
poetry add <pkg-name> | ||
uv add <pkg-name> | ||
``` | ||
|
||
### Unit Testing | ||
Unit tests are essential for ensuring code reliability and are currently being developed. Run existing tests using `pytest`: | ||
```bash | ||
poetry run pytest dagster_and_r_tests | ||
pytest dagster_and_r_tests | ||
``` | ||
> [!NOTE] | ||
> Unit tests are a work in progress. | ||
|
||
### Schedules and Sensors | ||
To enable [Schedules](https://docs.dagster.io/concepts/partitions-schedules-sensors/schedules) and [Sensors](https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors), ensure the [Dagster Daemon](https://docs.dagster.io/deployment/dagster-daemon) is active: | ||
```bash | ||
poetry run dagster dev | ||
dagster dev | ||
``` | ||
With the Daemon running, you can start using schedules and sensors for your jobs. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
# make sure these packages are installed | ||
library(reticulate) | ||
library(readr) | ||
library(glue) | ||
|
@@ -7,6 +8,36 @@ library(magrittr) | |
reticulate::py_config() | ||
stopifnot(reticulate::py_module_available("dagster_pipes")) | ||
|
||
# Function to convert R types to Python types | ||
convert_r_to_python_types <- function(df) { | ||
# Get R types | ||
r_types <- sapply(df, class) | ||
|
||
# Define type mapping | ||
type_mapping <- list( | ||
"numeric" = "float", | ||
"integer" = "int", | ||
"character" = "str", | ||
"factor" = "str", | ||
"logical" = "bool", | ||
"Date" = "datetime.date", | ||
"POSIXct" = "datetime.datetime", | ||
"POSIXlt" = "datetime.datetime" | ||
) | ||
|
||
# Convert types | ||
python_types <- sapply(r_types, function(x) { | ||
if (x %in% names(type_mapping)) { | ||
type_mapping[[x]] | ||
} else { | ||
"object" # default type | ||
} | ||
}) | ||
|
||
return(reticulate::r_to_py(as.list(python_types))) | ||
} | ||
Comment on lines
+12
to
+38
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm surprised this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It did the floats fine but defaults to "object" for the strings. I'll check into it some more There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the info. I haven't had a chance to test your changes yet, but I'm hoping there's still a way that we can piggyback off reticulate's type conversion somehow. The r_to_py.data.frame() function should be able to handle most R data types. Maybe we can pass It's possible that reticulate isn't recognizing that pandas is available, which could be causing I'll try taking a closer look in the next couple of days. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh r_to_py.data.frame was the function I was missing after rifling through the reticulate docs. Yeah that might do it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Leaning on reticulate::r_to_py(df)
{'Sepal.Length': dtype('float64'), 'Sepal.Width': dtype('float64'), 'Petal.Length': dtype('float64'), 'Petal.Width': dtype('float64'), 'Species': CategoricalDtype(categories=['setosa', 'versicolor', 'virginica'], ordered=False, categories_dtype=object)} With a little more work, we can extract the name of each data type from its convert_r_to_python_types <- function(df) {
df_pandas <- reticulate::r_to_py(df)
dtypes_dict <- df_pandas$dtypes$to_dict()
for (col_name in names(dtypes_dict)) {
dtypes_dict[[col_name]] <- as.character(dtypes_dict[[col_name]]$name)
}
return(dtypes_dict)
}
{'Sepal.Length': 'float64', 'Sepal.Width': 'float64', 'Petal.Length': 'float64', 'Petal.Width': 'float64', 'Species': 'category'} This will materialize successfully, however, reticulate's I'm not really sure if there's an implicit way to rely on reticulate here. The hard-coded mapping dictionary may be the only way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think relying on reticluate and then manually overriding a few might make the most sense (like float64 vs. float). I'll take the code here and add it to my PR with an override |
||
|
||
|
||
# Import Python modules | ||
# R doesn't support selective imports like Python, so you have to do this | ||
# to avoid typing the full namespace path repeatedly... | ||
|
@@ -23,9 +54,20 @@ with(open_dagster_pipes() %as% pipes, { | |
context$log$info(head(iris)) | ||
context$log$info(os$environ["MY_ENV_VAR_IN_SUBPROCESS"]) | ||
output_dir <- Sys.getenv("OUTPUT_DIR") | ||
iris_head <- head(iris) | ||
context$log$info(glue::glue("output_dir: {output_dir}")) | ||
context$report_asset_materialization() | ||
|
||
#python function to report back the materialization and metadata | ||
context$report_custom_message( | ||
payload = reticulate::r_to_py(list( | ||
"dagster/row_count" = nrow(iris), | ||
# if using report_asset_materialization | ||
#list( type = "md", "raw_value" = paste(knitr::kable(iris_head, format = "pipe"), collapse = "\n") ), | ||
"preview" = paste(knitr::kable(iris_head, format = "pipe"), collapse = "\n"), | ||
"iris_head_df" = reticulate::r_to_py(jsonlite::toJSON(x = iris_head, dataframe = "columns")), | ||
"column_types" = convert_r_to_python_types(iris_head) | ||
)) | ||
) | ||
context$log$info(glue::glue("got here!")) | ||
# Ensure that Sepal.Length field does not contain any NAs | ||
context$report_asset_check( | ||
asset_key="iris_r", | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,30 +0,0 @@ | ||
from dagster import ( | ||
Definitions, | ||
PipesSubprocessClient, | ||
) | ||
from . jobs import docker_container_op_r | ||
from . asset_checks import ( | ||
# no_missing_sepal_length_check_r, | ||
no_missing_sepal_length_check_py, | ||
) | ||
|
||
# python_assets = load_assets_from_modules([assets]) | ||
from . assets import ( | ||
hello_world_r, | ||
iris_r, | ||
iris_py, | ||
) | ||
|
||
defs = Definitions( | ||
assets=[ | ||
hello_world_r, | ||
iris_r, | ||
iris_py, | ||
], | ||
asset_checks=[ | ||
# no_missing_sepal_length_check_r, | ||
no_missing_sepal_length_check_py, | ||
], | ||
jobs=[docker_container_op_r], | ||
resources={"pipes_subprocess_client": PipesSubprocessClient()}, | ||
) | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
from dagster import ( | ||
Definitions, | ||
PipesSubprocessClient, | ||
) | ||
from . jobs import docker_container_op_r | ||
from . asset_checks import ( | ||
# no_missing_sepal_length_check_r, | ||
no_missing_sepal_length_check_py, | ||
) | ||
|
||
# python_assets = load_assets_from_modules([assets]) | ||
from . assets import ( | ||
hello_world_r, | ||
iris_r, | ||
iris_py, | ||
) | ||
|
||
defs = Definitions( | ||
assets=[ | ||
hello_world_r, | ||
iris_r, | ||
iris_py, | ||
], | ||
asset_checks=[ | ||
# no_missing_sepal_length_check_r, | ||
no_missing_sepal_length_check_py, | ||
], | ||
jobs=[docker_container_op_r], | ||
resources={"pipes_subprocess_client": PipesSubprocessClient()}, | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
def main(): | ||
print("Hello from dagster-and-r!") | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
def main(): | ||
print("Hello from dagster-and-r!") | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing an opening ``` here?