Here are some notes as I have looked at (and worked with) additions done to gocddash in the summer of 2016.
The most important aspect of the additions to gocddash is obviously that they do work and add value. Both of those conditions are true!
Most of the notes below point out things that could have been made differently or that I've changed, to make the code simpler, and easier to maintain and extend.
I suspect that the authors have thought about a lot of these things already, but didn't get the time to fix it. Maybe there is still a lesson or two to learn from these notes...
Parts of it was written when I was somewhat grumpy, so if you have an asbestos suit, you might want to wear that. ;-)
September 2016
Magnus Lyckå
Measuring code coverage helps understand to what extent code is getting tested (or used at all). My initial code coverage measurements showed about 50% code coverage, which makes it difficult to refactor the code without adding a substantial amount of extra code. E.g. graphs, email and claims were untested.
Profiling showed that most of run time is spend waiting for two things:
- Network queries to the go server.
- Bokeh.
Solutions for these issues are presented below.
The inspect code function in e.g. PyCharm is very helpful, and fixing the warnings often reveal actual bugs or bugs waiting to happen, and it certainly makes the code more consistent and easy to read. (There are a few inspect bugs for Python 3 in PyCharm though.)
Some code was clearly not tested. For instance, running info
in
the cli tool without stating a pipeline, caused a call of a method
in the data_access module which didn't exist. (Maybe you renamed
a method and didn't have tests...)
AttributeError: 'SQLConnection' object has no attribute 'get_synced_pipelines'
It's really simple to write generic code in Python, and I think some of the mistakes in the code was partly due to a lack of understanding for this. There are many more paths through the program than needed, presumably because the programmers haven't seen the simple Pythonic way of writing more generic code.
Also, some diagnostic messages didn't match behaviour:
2016-10-07 14:48:59.546286 Will synchronize [...] from 39 onwards.
Requested pipelines = 10 (from 39 to 49)
Pipeline counter: 39
Pipeline counter: 38
[...]
Pipeline counter: 31
Pipeline counter: 30
Don't use global variables, and be very careful when
messing with module global state. I'm sure there are
other programming languages where it's very nifty to
like this, but modules with global variables and
functions are not like classes with members and
methods in Python. As a concrete example, there were
unit tests that replaced global objects in modules
with mocks, and didn't clean up after themselves.
This caused spurious errors in completely different
unit tests which used the same modules when using
python3 -m unittest
to run the tests.
More examples below in the sections about database code and OOP.
There is a lot of Connascence of Position in the code.
For instance, the functions returning data from an SQL
SELECT statement typically return a tuple, or a list of
tuples, often even the result of a SELECT * FROM ...
,
which means that A) it's difficult to change the data
model without breaking code, and B) it's time consuming
and error prone to maintain the code, since you might
need to go from a function call you analyse, to the
data_access DB function, to a CREATE VIEW statement, to
a CREATE TABLE statement, to understand what it is you
get back from your function call.
Don't use SELECT * FROM ...
in production code, and
prefer a row factory etc, which gives you the ability
to access columns by name. E.g. use something like
conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
with psycopg2, or conn.row_factory = sqlite3.Row
with
sqlite3.
For non-SQL code, prefer e.g. instances of a class, namedtuple, dict and named arguments to tuples and positional arguments.
Favour locality! The maintenance cost of e.g. Connascence of Position depends a lot on locality. It's no big deal to call a method with five arguments if it's defined five rows below in the code.
Parts of the code is really much more complex that the actual problem warrants. There are implicit assumptions which should be clearly declared once, which are instead assumed in several different modules. E.g. the idea that we fetch data in chunks of 10 when we sync pipelines is implicit in at least three modules, which makes is difficult to change this chunk size.
Another example of structural confusion is the synchronize()
function in the gocddash_sync
script. It's obviously good
that it abstracts away some things, and that e.g. the network
calls to the Go server and the database interactions are
abstracted away, but that's about how much abstraction you
need in this case.
The way it's done, the bulk of the code is in
analysis.pipeline_fetcher
, but the gocddash_sync
script
only interacts with it through a thin wrapper in the
analysis.actions
module. On the other side of
analysis.pipeline_fetcher
, analysis.domain
is used as
an abstraction layer over analysis.data_acces
, which in
turn is an abstraction layer over the SQL database.
This kinds of layered architecture, but typically three
layers rather than five, was popular in large projects in
the 1990's, before agile methodologies and test driven
development became popular. The plan was to isolate the
details in the different layers from each other, so that
they could be developed independently without creating a
mess when several dozen developers were developing the
same application. It's debatable whether this is a good
idea at all, and it's certainly overkill in this case.
It's not consistently used here anyway. For instance, the
gocddash_sync
script makes calls directly to the bottom
analysis.data_acces
layer, completely short-circuiting
the layered approach.
Someone once said that there is no problem you can't solve by adding another layer of abstraction except the problem of having too many layers of abstraction. I'd say we ran into this problem here...
First of all, we don't need all the layers of abstraction seen here. It's a bit as if you've thought "hey, let's add another module", when you ought to have thought "is there a simpler way of solving this problem?"
One reason for structural confusion could be collaboration problems. It might be easier to add or change in "my module" even if I know it really belongs in "his module". The way to get around this is to integrate continuously. Work with small chunks in the same branch (preferably master) and make sure to push well tested code after each little chunk, and to pull the collaborators changes as soon as they pushed.
I also think the solutions would have been simpler if there had been more of object-oriented design. Classes and instances are good at organizing data and operations on that data in a good way if done right. I didn't really appreciate the inter-modular roller coaster I had to go through to follow some of the code flow...
If we use an approach of abstractions / layers, it's important to understand what belongs where, and how we actually add entropy to code.
For instance, text strings should typically be created
close to where they are displayed in the code structure. Data
passed between layers is better passed as data structures with
descriptively labeled pristine values.
See Connascense of Position above. If it seems important to declare
string formatting where the data is constructed rather than
consumed, use classes and the __str__
method, instead of just
passing a string. Another day someone who uses the function
wants to use that number you embedded in the string...
There were a number of functions and code sections that only returned an empty string, or never got any value (e.g. finished_pipelines). Work in progress adds no value, but does add a significant cost in maintenance and for understanding the code.
Please refrain from adding any code which "might come in handy later", and make sure to do one thing at a time, and only that thing until that thing works. If code falls out of use, remove it entirely, even if you might want it again later. It's in the git history...
Use super(), e.g. super().__init__(args)
to call __init__
of
appropriate parent class. E.g:
class TestFailure(StageFailure):
def __init__(self, stage, failure_signature, test_names):
super().__init__(stage)
...
instead of
class TestFailure(StageFailure):
def __init__(self, stage, failure_signature, test_names):
StageFailure.__init__(self, stage)
...
A common motivation for super is to get correct method resolution order in case of multiple inheritance, but as I see it, the main advantage, besides being less cluttered, is that it reduces the risk of mistakes as the code evolves and we change the inheritance structure or copy and paste code.
A cursor must not be used beyond a single database transaction. (Marc-André Lemburg, author of PEP 249, the DB-API 2.0 spec, was very clear about this when he commented my "Database Programming with Python" tutorial at EuroPython 2004...) Behaviour is undefined. This might have caused the odd bug you saw with psycopg2 queries.
Why the explicit create_connection
followed by get_connection
and a global connection object? (Same pattern is
used with pipeline config.) I dropped create_connection
and
the global object, and made SQLConnection
a Borg. That reduces
get_connection()
to just return SQLConnection()
.
There seems to be a reluctance to use objects and methods implemented in other modules. E.g. the go_client module has two classes with the same interface, and then the same interface is repeated in it's entirety as functions, instead of simply using a factory function which returned an instance of the appropriate variant of the interface.
There are a number of classes which are just used as plain C
structs (e.g. in analysis.domain
).
They have an __init__
method which simply sets members, and basically
no more methods; but instead module global functions using these
objects as values. Having methods on classes is not a bad idea,
but if a non-OO approach is preferred, there are better containers
in the Python 3 library than anorectic classes.
collections.namedtuple
for instance.
One important aspect of this is that the life cycles for instances and modules are completely different in Python. When you create an object, you get an new instance, which is not shared outside your control, and is garbage collected when there are no more references to it. An imported module stays alive during the whole execution of the process, and its state is shared.
There is a lot of this kind of stuff: some_log = some_log.split('Some peculiar text')[1] some_log = some_log.split('Some other interesting text')[0] ...
This is not a very robust way of parsing text files. The errors we'll get when the file doesn't look like we expected will give the impression that there is a bug in the code rather than something wrong with the file, and I kind of agree... ;-)
I know regular expressions aren't a lot of fun, but I still think they'd be a better fit here. You'd end up with not getting the data you expected when the file is broken, and acting based on that, rather than getting an unexpected IndexError all of a sudden.
A maybe better option could be https://pypi.python.org/pypi/parse
sys.path.append(str(Path(abspath(getsourcefile(lambda: 0))).parents[1]))
It's probably better to install the package in the virtualenv
with
setup.py, or possibly to set PYTHONPATH
in advance of running the
program.
except Exception as error
Ouch! In Python this means that we
e.g. catch NameError
and AttributeError
, which are almost
certainly programming errors!
Flask Blueprints
are a good way of partitioning Flask
apps.
It might be better to e.g. have the insights
and graph
parts of app.py
as separate Blueprints
, or in a common
pipeline
Blueprint
.
Related to data modeling mentioned below, it might be better to
have URLs focused on the central resources. E.g.
http://go/dash/pipelines/old-system-tests/insights
instead of http://go/dash/insights/old-system-tests
,
http://go/dash/pipelines/old-system-tests/graphs
instead of http://go/dash/graphs/old-system-tests
, and
http://go/dash/pipelines/old-system-tests/counter/123/claim
instead of http://go/dash/claim
and all substance in the
POST data.
The web server logs are certainly more useful with more
explicit URLs, and I think testing gets easier too.
Decorators such as @app.before_first_request
helps you
run things at the right time, without having to run them
when the module is imported (which might cause a mess for
testing etc).
We can certainly model our business objects in different ways, and there is no strict right and wrong, but depending on how we twist and turn things, we might turn out with solutions that are fairly straight forward, or very complex, to use. Concretely:
- Is a
claim
best modeled as an enity which has apipeline instance
as a property, or as a couple of properties of apipeline instance
? - Is a
final stage
an entity, or is it simply a property of astage
.
I think five or so of the database tables are better modeled as
additional columns in the pipeline_instance
or stage
tables.
If I use an RDBMS for persistence or objects in an object oriented application, I'd certainly use abstract keys, but in the current situation, where the natural keys are derived from the go-server REST API, why not simply use these? In this case, abstract keys just seem like an awkward detour which forces us to join table much more than we really need.
I haven't fully gotten my head around the views yet, but it seems to me that you're using the database for some things that might be easier and quicker to do in Python, particularly with your overly complex data model and lack of indices.
We use the database for two simple tasks:
- Caching some things we learn from the go server REST API.
- Bookkeeping of small amounts of non-critical data:
- claims: name and comment
- whether email has been sent
Using a full blown SQL server, such as PostgreSQL for this seems like overkill. Adding a migration tool (yoyo) to the mix when there are no requirements on persistence also seems extravagant.
As a start, I'm ditching PostgreSQL and yoyo, and replacing psycopg2 with the builtin sqlite3 module, which is sufficient for this task. This way, we drop 3rd party dependencies (docker-py as well) and remove configuration work.
I'm also adding memcached through werkzeug.contrib.cache.MemcachedCache
,
to cache all network calls to the go-server
(in go_client.GoSource.base_request
). This does add an extra
dependency, but memcached is basically no maintenance:
Just start a server. No schema to maintain. No authentication etc.
Considering that Bokeh takes a lot of time, (both CPU time and network time) I'm considering replacing it with a client side javascript tool, such as plotly etc. Besides lowering load on the go-server, it could also enable features such as zooming and panning in diagrams. (Someone likes bar charts with 500 bars in them...)
UPDATE: I replaced Bokeh with plotly.js. This reduced both installation time (no numpy, pandas or bokeh) and runtime (both server side CPU time, network transfering 2 MB png and client side rendering time) a lot, and allowed for zooming and panning etc in the chart.