Stat_fu is a plugin for managing statistics in Rails 2.xx as background jobs.
It is run in a production environment and does well, but as it’s pretty specific, you might want to think if you like its logic before applying it. The logic should allow you to create any kind of stats though.
If your stats are fairly simple and do not require much computation or db queries, you might think of calculating them on the fly instead of using the plugin. Otherwise this might be a solution for you.
As the code changes often, there’s no guarantee the current version works, please use tags instead.
Comments and feedback appreciated.
The idea is based on a few assumptions:
- Each statistic is separate
- Each statistic has one db table and one model which contains all the code, rake tasks etc.
- A statistic is a function that takes a set of arguments and returns a set of results
- One record is associated with just one set of arguments and one set of results
- We calculate one record at a time (the way the logic works)
- We optimize the amount of queries and computations by caching partial results, not by sharing the computations between records
- Each statistic has:
- a definition of input parameters
- a method for calculating a record
- a method for checking if a record is up to date
- a method for checking if a record is coherent (correct)
- rake tasks definitions
- output definitions (after read processing)
- Both input parameters and output vaues are stored as normal activerecord fields and are of standard types (Date, Integer, String, etc..)
- Checking for errors relies on redundant computations that sould show the same results
script/plugin install git://github.com/szarski/stat_fu.git
Let’s say we’ve got an app with models:
- User
- Project
There’s a named_scope Project.with_category(category_name* )*
User has_many *Project*s. Projects have a category_name field. We want to be able to retreive the amount of projects for a certain user for a certain date. Also, we want to be able to filter them by category name. Also, we want to be able to determine which user was the most productive and how many projects there were on a certain date.
We generate a model and a migration:
script/generate statistic_model users_projects_daily date:date user_id:integer category:string projects_count:integer
Open app/models/statistic/users_projects_daily.rb. We get a prepopulated model. We want to:
- set the default parameters
- fill in the count, check, up_to_date? methods (check will be a redundant form of count)
- define rake tasks
- define our output
After filling everything in we get a model that looks as follows:
class Statistic::UsersProjectsDaily < Statistic::Base
set_table_name "statistic_users_projects_dailies"
parameters :date, :user_id, :category, :optional => [:category]
def count
user = User.find self.user_id
if self.category
self.projects_count = user.projects.with_category(self.category).count
else
self.projects_count = user.projects.count
end
end
def check
if self.category
return (self.projects_count == Project.find(:all, :conditions => {:user_id => self.user_id, :category_name => self.category}).count)
else
return (self.projects_count == Project.find(:all, :conditions => {:user_id => self.user_id}).count)
end
end
def up_to_date?
return (self.updated_at.to_date - self.date > 1)
end
UPDATE_HASH = {:user_id => lambda {User.all.map(&:id)}, :date => lambda {(3.months.ago.to_date..Time.now.to_date).to_a}}
rake_tasks do |r|
r.namespace :users_projects_dailies do |n|
n.namespace :update do |u|
u.default_update :all, "update all stats for users_projects_dailies", UPDATE_HASH
u.namespace :all do |a|
a.default_update :force, "update all stats for users_projects_dailies, force up_to_date ones", UPDATE_HASH.merge({:force => true})
end
end
n.namespace :clear do |c|
c.default_clear :all, "clear all users_projects_dailies stats"
end
n.custom_task "Some Custom Task" do
# code here
end
end
end
describe_output :projects_count => :sum, :most_busy_user_id => lambda{|records| records.sort{|a,b| a.projects_count <=> b.projects_count}.last.user_id}
end
To prevent from doing too much db queries, say, when checking result, we can cache them as follows:
def check
projects = self.cache_in_batch "projects for user #{self.user_id}" do |records|
return Project.find(:all, :conditions => {:user_id => self.user_id})
end
if self.category
return (self.projects_count == projects.select {|p| p.category_name == self.category}.count)
else
return (self.projects_count == projects.count)
end
end
Off course we can cache many more things this way, we could, for instance cache projects for the whole stats collection (reffered here as records).
Now for the fun part. To run your stats, just add to your crontab:
rake stats:users_projects_dailies:update:all
Or to remove:
rake stats:users_projects_dailies:remove:all
To see all the rake tasks related to stats run:
rake stats:help
Now, to run in the console:
# Retrieve all available stats and check if they're up_to_date
batch = Statistic::UsersProjectsDaily.batch(:user_id => [1,2,3], :date => Time.now.to_date)
# Update if needed
unless batch.unsatisfied_combinations.empty?
batch.fill_up
end
batch.most_busy_user_id # gives us the most productive user
batch.projects_count # gives us the total amount of projects
The way we use the stats is we:
- retrieve a batch
- check if all the stats are up to date
- update the old ones
- grap the output (with methods defined with describe_output)
To update all the records in batch, including these that are already up_to_date?, we just pass:
:force => true
This might be helpfull if, for instance, we fixed a bug and some of the stats may be corrupted.
As we process stats in background, we don’t care that much about the performance of the count and check functions.
What we do care about though is the up_to_date? function as it will be run each time we retrieve a batch. Since it’s run while the user is waiting for stats to display, we want it to be as light as possible.
Also, the output functions should just perform simple transormations of the results we store in out db table.
To get a better idea of what methods are available, you can always take a look at the specs.
Copyright © 2010 Jacek Szarski (jacek@applicake.com), released under the MIT license