Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add postgres #49

Merged
merged 75 commits into from
Aug 24, 2020
Merged

Add postgres #49

merged 75 commits into from
Aug 24, 2020

Conversation

jlubken
Copy link
Member

@jlubken jlubken commented Aug 5, 2020

  • Remove rst cargo
  • Use pyproject.toml, then setup.cfg, then setup.py.
  • Use extras for optional dependencies.
  • Separate sql from python code.
  • Use configure contextmanagers to parse arguments and inject dependencies by surrounding the command to parse arguments.
  • Move argument parser setup from mixin to dependency class.
  • Use Messages classes to setup log messages for each dependency.
  • Add (Abstract)Persistor component classes.
  • Pull up main function into Service classmethod to simplify logging setup.
  • Update retry decorator retires and backoff to help with a slow postgres startup. Mongo's driver has a very high number of connect retries built-in (but no backoff).
  • Add tests for non-sqlalchemy configuration and secrets.
  • Simplify df_from_query methods.

Not Done:

  • Separate connect from commit and rollback contextmanagers to allow for temporary tables to be used for cohort without round tripping the patient ids through python. Use transaction isolation instead of client hacks/broken work-arounds to ensure that the cohort doesn't change during processing.
  • Pass an optional connection to Task.call(service, batch, con=None).
  • Add Composite Tasks. Allow the composite to wrap the multistep data load with a single mssql database connection.

Left not done:

  • SqlAlchemy cargo acceptance/rejection for ops pipelines.
  • The sql-injection protection done by psycopg2 and pymssql use identical parameter placeholders %(name)s without an additional layer of software.
  • Both drivers expand python list variables safely. Prototype code with SqlAlchemy has been doing this with additional sql-injection unsafe code. Just use the drivers directly, or find out how to do this safely through SqlAlchemy.
  • Most of SqlAlchemy's features are simply unexamined and unused: ORM, query expressions, sessions, dynamic sql help.
  • Careful management of the unit of work is being left incomplete during prototyping, and is obscured by SqlAlchemy's reconnect-on-demand feature. Features like this are convenient for prototyping, but are not appropriate for production pipelines.
  • Newer pandas, have DataFrame.read_sql(sql, con, parameters) that leverage SqlAlchemy, but non of these features were used during upstream prototyping. Rather, fallback to familiar copy and paste, highly inefficient code for marshaling, and un-marshaling records.
  • The choice is not between using the drivers or SqlAlchemy, but whether or not SqlAlchemy adds benefit for ops pipelines in addition to the underlying drivers; at this time, and with the current level of sql and SqlAlchemy experience.

@jlubken jlubken self-assigned this Aug 13, 2020
@jlubken jlubken requested a review from Mdraugelis August 13, 2020 20:24
src/dsdk/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@mdbecker mdbecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

@jlubken
Copy link
Member Author

jlubken commented Aug 19, 2020

Tested failure of the pytest to propagate and fail the workflow, then fixed the github workflow to pass.

@jlubken jlubken marked this pull request as ready for review August 24, 2020 17:25
@jlubken jlubken requested review from mdbecker and removed request for Mdraugelis August 24, 2020 17:27
@jlubken jlubken merged commit 3e53d59 into master Aug 24, 2020
@jlubken jlubken deleted the add_postgres branch August 24, 2020 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants