Investigate/Improve performance of `beval` (and `bgen`)

Currently, running `beval` on a non-trivial benchmark set is extremely slow. To a lesser extend, this is also true for `bgen`.
It would be nice if we could improve the runtime of these tools. 

As is first step, it would probably make sense to profile/document major bottlenecks in evaluation - e.g. file/disk access vs. computational costs.