Heroku Operational Components Overview

We don’t currently really have “infrastructure-as-a-service” with our heroku setup, it’s just set up on the heroku system (and third-party systems) via GUI and/or CLI’s, there isn’t any kind of script to recreate our heroku setup from nothing.

So this serves as an overview of the major components and configuration points – anything that isn’t included in our source code in our git repo, we mean to mention here.

Inside the heroku system, we have config variables, buildpacks, dyno formation, and add-ons. Separate to heroku (and bill/invoiced separately), we use SearchStax.com to provide Solr-as-a-service, and HireFire.io to provide auto-scaling of our heroku dynos. Inside AWS used directly, we are using S3, SES (email delivery), and Cloudfront (CDN for static assets). Some terraform config for our AWS resources is at https://github.com/sciencehistory/terraform_scihist_digicoll .

We use HoneyBadger for error tracking, and a few other features that come with it like uptime monitoring. We currently have a gratis HoneyBadger account for open source, set up separately from heroku.

We will provide an overview of major configuration choices and motivations to provide background for our initial choices – but this can easily get out of date, the best reference for current configuration choices is always the live system.

1 Heroku
- 1.1 Dyno Formation
  - 1.1.1 Web dynos
  - 1.1.2 Worker dynos
  - 1.1.3 special_worker dynos
- 1.2 Config Variables
- 1.3 Buildpacks
- 1.4 Add-ons
2 SearchStax – Solr
3 HireFire.io
- 3.1 Worker dynos
- 3.2 Web dynos
- 3.3 HireFire documentation
4 Cloudfront
5 Honeybadger
6 Scout
7 Papertrail (logging)
- 7.1 History
  - 7.1.1 After switching to 550MB max log plan
- 7.2 Avenues for further research

Heroku

The heroku dashboard is at http://dashboard.heroku.com , you’ll need an account that can log in to our scihist-digicoll team, and then you’ll see we have two apps – scihist-digicoll-production and scihist-digicoll-staging.

Dyno Formation

Heroku calls a host or VM/container a “dyno”. We have “web” dynos that run the Rails web app , and “worker” dynos that run our background task workers.

The size and number of these dynos is configured on the “Resources” tab at https://dashboard.heroku.com/apps/scihist-digicoll-production/resources

It can also be listed and configured on the heroku command line with eg heroku ps and heroku ps:scale among others.

Web dynos

We run web dynos at a somewhat pricy “performance-m” size, not because we need that much RAM, but because we discovered the cheaper “standard” dynos had terrible performance characteristics leading to slow response times for our app even when not under load.

We run a single performance-m dyno normally – we may use auto-scaling to scale up under load, but keep in mind that if you are running two (or more) performance-m dyno’s for any length of time, a performance-l dyno costs the same as two performance-m’s but is much more powerful! (But there is no way to autoscale between dyno types instead of just count. )

Within a dyno, the number of puma workers/threads is configured by the heroku config variables WEB_CONCURRENCY and RAILS_MAX_THREADS determine how many puma workers/threads are running. These vars are conventional, and have effect because they are referenced in our heroku_puma.rb, which is itself referenced in our Procfile that heroku uses to define what the differnet dynos do. (We may consolidate puma_heroku.rb into the standard config/puma.rb in the future).

Heroku docs for recommend two puma processes with five threads on a performance-m; in production on our performance-m. Jrochkind doesn’t tottally trust that and we are trying three worker processes (WEB_CONCURENCY=3) with three threads each (RAILS_MAX_THREADS=3), because we can afford the RAM and jrochkind feels like this might be preferable. (Previously tried WEB_CONCURRENCY=5 and RAILS_MAX_THREADS=2, but wondered if that was not helping with cpu contention under spikes)

Worker dynos

The performance problems with standard size heroku downloads aren’t really an issue for our asynchronous background jobs, so worker dynos use standard-2x size.

In production we run 1-2 of them at minimum, but autoscale them with . This is convenient, because our dynos are mostly only in use when we are ingesting, so they can scale up to ingest quicker when needed.

From experimentation we learned we can fit 3 resque workers in a standard-2x, without exceeding memory quota, even under load. (There are only 2 vcpu’s on a standard-2x, so hopefully they aren’t starving each other out). The number of resque workers running is configured with heroku config vars ON_DEMAND_JOB_WORKER_COUNT (set to 1, workers that will handle on_demand_derivatives queue if needed, otherwise can work on standard mailers and default, queue; and REGULAR_JOB_WORKER_COUNT (set to 2; does not work on on_demand_derivatives queue). These heroku config vars work to configure, because they are referenced in our resque_pool.yml which is used by default by the resque-pool command in our Procfile.

special_worker dynos

For computationally intensive jobs, we also maintain the option of assigning background jobs to the special_jobs queue. These are not managed by HireFire.

Jobs assigned to this queue are handled ONLY by dynos of type special_worker . When you start one of these dynos , it executes resque-pool --config config/resque-pool-special-worker.yml, which tells the dyno to fire up two resque workers (by default) and get to work.

Example special_worker workflow:

Delete all failed jobs in the rescue admin pages.
Make a rake task to enqueue all the jobs to the special_jobs queue.
- The task should be smart enough to skip items that have already been processed. That way, you can interrupt the task at any time, fix any problems, and run it again later without having to worry.
- Make sure you have an easy way to run the task on individual items manually from the admin pages or the console.
- The job that the task calls should print the IDs of any entities it’s working on to the Heroku logs.
- It’s very helpful to be able to enqueue a limited number of items and run them first, before embarking on the full run. For instance you could add an extra boolean argument only_do_10 (defaulting to false ) and add a variation on:
  scope = scope[1..10] if only_do_10
Test the rake task in staging with only_do_10 set to true.
Run the rake task in production but only_do_10 for a trial run.
Spin up a single special_jobs dyno and watch it process 10 items.
Run the rake task in production.
The jobs are now in the special_jobs queue, but no work will actually start until you spin up dedicated dynos.
2 workers per special_jobs dyno is our default, which works nicely with standard-2x dynos, but if you want, try setting SPECIAL_JOB_WORKER_COUNT env variable to 3.
Our redis setup is capped at 80 connections, so be careful running more than 10 special_jobs dynos at once. You may want to monitor the redis statistics during the job.
Manually spin up a set of special_worker dynos of whatever type you want at Heroku's "resources" page for the application. Heroku will alert you to the cost. (10 standard-2x dynos cost roughly $1 per hour, for instance; with the worker count set to two, you’ll see up to 20 items being processed simultaneously).
Monitor the progress of the resulting workers. Work goes much faster than you are used to, so pay careful attention to:
- the Papertrail logs
- the resque pages of the app
- the redis statistics for the app in Heroku (go to the resource page then click “Heroku data for redis”.
If there are errors in any of the jobs, you can retry the jobs in the Rescue pages, or rerun them from the console.
Monitor the number of jobs still pending in the special_jobs queue. When that number goes to zero, it means the work will complete soon and you should start getting ready to turn off the dynos. It does NOT mean the work is complete, however!
When all the workers in the special_jobs queue complete their jobs and are idle:
- rake scihist:resque:prune_expired_workers will get rid of any expired workers, if needed
- Set the number of special_workerdynos back to zero.
- Remove the special_jobs queue from the resque pages.

Config Variables

Heroku has a list of key/values that are provided to app, called “config vars”. They can be seen and set in the Web GUI under the settings tab, or via the heroku command line heroku config, heroku config:set, heroku config:get etc.

Note:

Some config variables are set by heroku itself/heroku add-ons, such as DATABASE_URL (set by postgres add-on), and REDIS_URL (set by Redis add-on). They should not be edited manually. Unfortunately there is no completely clear documentation of which is which.
Some config variables include sensitive information such as passwords. If you do a heroku config to list them all, you should be careful where you put/store them, if anywhere.

Buildpacks

Heroku lets you customize the software installed on a dyno via buildpacks.

These can be seen/set through the Web GUI on the settings tab, or via the Heroku command line eg heroku buildpacks, heroku buildpacks:add, heroku buildpacks:remove.

Buildpacks can be provided by heroku, or maintained by third parties.

In addition to the standard heroku ruby buildpack, we use:

The Heroku node.js buildpack, heroku-buildpack-nodejs. See this ticket for context.
to install the libvips image procesing library/command-line
An apt buildpack at https://buildpack-registry.s3.amazonaws.com/buildpacks/heroku-community/apt.tgz, which provides for installation via apt-get of multiple packages specified in an Aptfile in our repo. We install several other image/media tools this way.
- (we could not get vips sucessfully installed that way, is why we used a seperate buildpack for that)
- For tesseract (OCR) too, see
We need ffmpeg and had a lot of trouble getting it built on heroku! Didn’t work via apt, didn’t find a buildpack that worked and gave us a recent ffmpeg version. Until we discovered that since ffmpeg is a requirement of Rails activestorage's preview functionality this heroku-maintained one gave us ffmpeg: https://github.com/heroku/heroku-buildpack-activestorage-preview
- That buildpack is mentioned, along with mentioning it installs ffmpeg, at:
- We don’t actually use activestorage or its preview feature, just use this buildpack to get ffmpeg installed.
- If looking for an alternative in the future, you could try: (we haven’t tried that yet)
Buildpack to get exiftool CLI installed – installs the most recent exiftool available on every build, unless we configure for specific version.
The standard heroku python buildpack, so we can install python dependencies from requirements.txt. (Initially img2pdf). It is first, so the ruby one will be “primary”. https://www.codementor.io/@inanc/how-to-run-python-and-ruby-on-heroku-with-multiple-buildpacks-kgy6g3b1e

We have a test suite you can run, that is meant to ensure that expected command-line tools are present, see: https://github.com/sciencehistory/scihist_digicoll/blob/master/system_env_spec/README.md

Add-ons

Heroku add-ons are basically plug-ins. They can provide entire software components (like a database), or features (like log preservation/searching). Add-ons can be provided by heroku itself or a third-party partnering with heroku; they can be free, or have a charge. Add-ons with a charge usually have multiple possible plan sizes, and are always billed pro-rated to the minute just like heroku itself and included in your single heroku invoice.

Add-ons are seen and configured via the Resources tab, or heroku command line commands including heroku addons, heroku addons:create, and heroku addons:destroy.

Add-ons we are using at launch include:

Heroku postgres (an rdbms) (the standard-0 size plan is enough for our needs)
- Note: Does our postgres plan offer enough connections for our web and worker dynos? See this handy tool to calculate.
Heroku redis (redis is a key/value store used for our bg job queue)
- We are currently using premium-1 plan – our needs are modest, but we seemed to be running out of redis connetions on hirefire autoscale up of workers with the premium-0 plan.
  - Note that “not enough connections” error in redis can actually show up as OpenSSL::SSL::SSLError we are pretty sure.
  - The numbers don’t quite add up for this, I think resque_pool may be temporarily using too many connections or something. But for now we just pay for premium-1 ($30/month)
Memcached via the Memcached Cloud add-on
- Used for Rails.cache in general – the main thing we are using Rails.cache for initially is for rack-attack to track rate limits. Now that we have a cache store, we may use Rails.cache for other things.
- In staging, we currently have a free memcached add-on; we could also just NOT have it in staging if the free one becomes unavailable.
- In production we still have a pretty small memcached cloud plan, if we’re only using it for rack-attack we don’t need hardly anything.
Heroku scheduler (used to schedule nightly jobs; free, although you pay for job minutes).
Papertrail – used for keeping heroku unified log history with a good UX. (otherwise from heroku you only get the most recent 1500 log lines, and not a very good UX for viewing them!). We aren’t sure what size papertrail plan we’ll end up needing for our actual log volume.
Heroku’s own “deployhooks” plugin used to notify honeybadger to track deploys. and

Some add-ons have admin UI that can be accessed by finding the add-on in the list on the Resources tab, and clicking on it. For instance, to view the papertrail logs. No additional credentials/logins needed.

Some add-ons have configuration available via the admin UI, for instance the actual scheduled jobs with the Scheduler, or papertrail configuration. Add-on configuration is not generally readable or writeable using Heroku API or command line.

SearchStax – Solr

Solr is a search utility that our app has always used. SearchStax provides a managed “solr as a service” in the cloud. While there are some Solr providers available as heroku add-ons, we liked the feature/price of SearchStax.com better, and went with that.

We have a production and a staging instance via SearchStax

Login credentials to searchstax can be found in our usual credentials storage.

SearchStax is invoiced separately from heroku – we currently pay annually, in advance, to get a significant discount over month-by-month.

Heroku doesn’t know about our searchstax Solr’s automatically, we have to set the heroku config var SOLR_URL to point to our searchstax solr, so our app can find it. The SOLR_URL also currently includes user/password auth information.

With SOLR_URL being set to a properly functioning solr (whether searchstax or not), the app can index things to solr, search solr indexes, and sync solr configuration on deploys.

HireFire.io

HireFire.io can automatically scale our web and worker dyno counts up and down based on actual current usage. While there is an autoscaling utility built into Heroku for performance type web dynos, and then also as the rails-autoscale an add-on, we liked the feature/price of the separate HireFire.io better.

HireFire isn’t officially a heroku partner, but can still use the heroku API to provide customers with heroku autoscale services – this can make the integration a bit hacky. Our hirefire account talks to heroku via OAuth authorization to a specific heroku account – currently jrochkind’s, if he were to leave, someone else would have to authorize and change that in settings. Additionally, there is a HIREFIRE_TOKEN you get from editing the hirefire “manager” (ie, the “worker” manager specifically), and need to set as a heroku config variable.

We have multiple accounts authorized to our “Science History Institute” team on hirefire – each tech team member can have their own login.

For now, we use the hirefire “standard” plan at $15/month, which only checks for scale up/down once every 60 seconds. We could upgrade to $25/month “overclock” plan which can check once every 15 seconds.

You can log into hirefire to see some scaling history and current scale numbers etc.

You can also turn on or off the different “managers” (worker vs web), to temporarily disable autoscaling. Make sure the dyno counts are set at a number you want after turning off scaling!

Worker dynos

We auto-scale our worker dynos, using the standard hirefire Job Queue (Worker) strategy. It’s main configuration option is ratio, which is basically meant to be set to how many simultaneous jobs your heroku worker dyno can handle. Using resque with 3 workers, as we are at time of writing – that’s 3.

We are scaling between 2 workers minimum (so they will be immediately available for user-facing tasks), and for now a max of 8.
There are settings to notify you if it runs more than X dynos for more than X hours. We have currently set it to alert us if it’s using more than 2 workers for more than 6 hours, to get a sense of

Other settings can be a big confusing and we took some guesses; you can see some hirefire docs at Manager Options, and they also respond very promptly to questions via their support channels.

Heroku config var HIREFIRE_TOKEN needs to be set according to value you can find on Hirefire manager (for “worker” manager) settings.

Web dynos

TBD, not currently scaling web dynos.

HireFire documentation

Cloudfront

Per Heroku suggestions to use a CDN in front of Rails static assets, we are doing so with AWS Cloudfront.

We currently have two Cloudfront distributions, one for production and one for staging. Might not be entirely necessary in staging, but prod/staging parity and being able to test things in staging seems good.

Our cloudfront distributions ARE controlled by our terraform config. Here’s a blog post on our configuration choices.

The Heroku config var RAILS_ASSET_HOST should be set to appropriate cloudfront hostname, eg dunssmud23sal.cloudfront.net. If you delete this heroku config var, the Rails app will just stop using Cloudfront CDN and serve it’s assets directly. You can get the appropriate value from terraform output.

Honeybadger

We have been using honeybadger since before heroku, and have an account set up separately from heroku. We currently get it gratis from Honeybadger as non-profit/open source.

We have set up heroku-specific deployment tracking and Heroku platform error monitoring as detailed in honeybadger heroku-specific documentation.

Scout

We use Scout to monitor the app’s performance and find problem spots in the code. The account is free, as we’re an open-source project, although billing information is maintained on the account.

Papertrail (logging)

Settings are here:
https://papertrailapp.com/account/settings

Notes re: tuning lograge (which controls the format of log messages) in our app:

Recipe for downloading all of a day's logs:

THE_DATE=$1    # formatted like '2023-12-21'
TOKEN="abc123" # get this from <https://papertrailapp.com/account/profile.>
URL='https://papertrailapp.com/api/v1/archives'

for HOUR in {00..23}; do
	DATE_AND_HOUR=$THE_DATE-$HOUR
	curl --no-include \
		-o $DATE_AND_HOUR.tsv.gz \
		-L \
		-H "X-Papertrail-Token: $TOKEN" \
		$URL/$DATE_AND_HOUR/download;
done

# Remove files that aren't really compressed logs
rm `file * | grep XML | grep -o '.*.gz'`

# uncompress all the logs
gunzip *.gz

To separate logs into router and non-router files, resulting in smaller and more readable files:

mkdir router
mkdir nonrouter
ls *.tsv  | gawk '{ print "grep -v 'heroku/router' " $1 " > nonrouter/" $1 }'  | bash
ls *.tsv  | gawk '{ print "grep    'heroku/router' " $1 " > router/"    $1 }'  | bash

History

We started out with the the "Forsta" plan (~4.2¢/hour, max of $30 a month; 250MB max).

In late 2023 and early 2024, we noticed an increase in both the rate and the volume of our logging, resulting in both:

A) L10 error messages (sent when Heroku’s log router, Logplex, can’t keep up with a burst of logging and starts to drop messages without sending them to Papertrail.)
B) Days on which the total storage needed for the day’s accumulated error messages exceeded our 250MB Papertrail plan’s size limit. (Note that Heroku add-on usage resets daily at midnight (UTC) which is early evening EST, so the notion of a “day” can be confusing here).

A) and B) don’t always co-occur: high rates per second cause the first, large storage requirements the second.

On Jan 10th we decided to try the "Volmar" plan (~9¢/hour; max of $65 a month; 550MB max) for a couple months, to see if this would ameliorate our increasingly frequent problems with running out of room in the Papertrail log limits. It’s important to note that the $65 plan, based on our current understanding, will not fix the L10 errors, but will likely give us more headroom on days when we get a lot of traffic spread out over the entire day.

After switching to 550MB max log plan

Since switching to the new high-capacity plan on Jan 10th we had:

only one new instance of L10 messages (see A above), on March 20th at 3:55 am.
no instances of running over the size limit (see B above).

Avenues for further research

Confirm that the L10 warnings are caused by a surge in bot traffic, rather than a bug in our code or in someone else’s code. Several clues so far point to bots as the culprit.
- If so, this is a good argument for putting cloudflare or equivalent in front of our app, which would screen out misbehaving bots
Consider logging fewer bytes: either by making some or all log lines more concise, or by asking Papertrail to drop certain lines that we’re not really interested in:
- some postgresql messages?
- do we really need to log all status 200 messages? (Probably.)
- As a last resort, we could also decide not to log heroku/router messages (typically 40-60% of our messages), although those can be really helpful in the event of a catastrophe.