Skip to content

Shape Performance Troubleshooting

Nicu Listana edited this page Mar 11, 2020 · 9 revisions

As Shape has grown from its infancy into its early childhood, it's also gotten more complicated, and sometimes gets a little cranky. This page is going to dive into some issues that we have encountered and what we have done (or might do) to troubleshoot those.

Heroku performance

The first place to start is the Heroku Metrics page, which will show slowness in terms of response times, and if memory usage is spiking above the max limits: Heroku Metrics

To date, we've had intermittent problems with memory issues, and there is not usually an obvious diagnosis. We do have autoscaling enabled on Heroku which will add extra dynos if response times are slow, but this doesn't usually seem to do a whole lot -- what seems to happen in one of these cases is that Heroku spins up/down more dynos but doesn't necessarily have a major effect (memory continues to spike). Doing a full dyno restart (through this UI or heroku restart -a shape-production) usually does help temporarily alleviate a memory issue. Take note that when restarting the dynos, the site may be unresponsive for 15-20 seconds, and it could interrupt people's current work, so don't just jump into restarting things more often than necessary.

Tweaking advanced settings

  • Puma - web server
    • ENV['RAILS_MAX_THREADS'] (defaults to 5) determines both how many threads are run per Puma process, and sidekiq workers per sidekiq process.
    • ENV['WEB_CONCURRENCY'] (defaults to 2 in prod) determines how many "puma workers" (totally unrelated to sidekiq workers) which are each their own webserver process. You can generally move this number up to 2-3+ when adding more RAM by upgrading the dyno type (whereas if running out of memory, can move this number back down). From puma:

    If using threads and workers together, the concurrency of the application would be max threads * workers

  • TuneMyGC has been used to add some ENV vars for tweaking the Ruby garbage collector, but perhaps this could be run again (?) and it's also unclear how much those settings are helping or not.

Scout add-on

To diagnose individual requests, you can open the Scout add-on from the Heroku Resources page. From here you can dive into individual slow requests and even see what queries were generated, and how long they took.

Scout Dashboard

Unfortunately, once again, you may be not finding the root cause of a memory / timeout issue, because you may see that there are 15-20 second requests where every query (even simple lookups) are just much slower than they should be -- so it may be showing you more the result of the issue, than the source of it. Even still, it is certainly a valuable resource to investigate individual requests and see what's going on, or if some requests are particularly always slow and taking up a lot of memory/resources.

Sidekiq (background worker) performance

To go to the Sidekiq web UI on production, you have to login as admin@shape.space (shared in 1pass), and then you have access to shape.space/sidekiq.

Sidekiq Dashboard

The main thing to investigate here is if there are:

  • A high number in the "Busy" column, which is usually a bad sign that there are job failures/retries or that something has gone out of control. In general jobs get cleared out the queue pretty quickly so there should never be more than a handful going at once.
  • A high number in the "Retries" column, which indicates (usually) a problem in code, like a job is trying to run but getting some kind of error/exception that is causing it to fail and retry. This is generally easier to diagnose because the error column will tell you the exception and you can usually trace that back to what went wrong in the code. And then it's possible to redeploy with fixed code, and those workers will simply be able to retry, run again and successfully complete. Sidekiq is also smart about further delaying retries from failed jobs to longer and longer intervals so they don't keep hammering over and over.

For the situation where jobs have gone out of control, you might see for example one particular worker like CollectionCardDuplicationWorker with hundreds of jobs. The first thing you can do to troubleshoot is open up a rails console:

heroku run console -a shape-production

From there you can do things to inspect the queues. It's not obvious where to find this, but in the Sidekiq wiki under API

default_queue = Sidekiq::Queue.new('default')
default_queue.size # => 150

default_queue.each do |job|
  job.klass # => 'MyWorker' 
  job.args # => [1, 2, 3]
  # As a very last resort you may need to kill some broken/out-of-control jobs
  job.delete if job.klass == 'CollectionCardDuplicationWorker'
end

Before just deleting jobs you should investigate the parameters of the job, for example CollectionCardDuplicationWorker receives batch_id, card_ids... and I could look up those card_ids to see where they are. In a couple of cases, this has shown collections where the breadcrumb has gone 100+ levels deep and if I inspect the parent names of the collection, I might see an obvious indicator of something being wrong like:

bad_collection.parents.map(&:name)
> ["Foo-org Templates", "My Design Research Plan", "My Research Planning", "My Design Research Plan", "My Research Planning", ...]

In this case I did kill some of those duplication jobs because there was an infinite loop issue and we had no reason to keep all those problematic collections (and their copies) around, and I even looked for those nested instances-in-instances that weren't supposed to be there, and deleted them from the DB as well. When going to these more drastic measures you will probably want to also stop the Heroku worker dynos so that additional jobs aren't being created while you diagnose. This is for example in the event that the site is unresponsive and having issues anyway, so stopping the workers isn't really harming any user experience because it's already pretty broken.

Database performance

Error capturing

ActionCable

More info: https://github.com/ideo/shape/wiki/Real-time-collaboration-overview

Clone this wiki locally