Muskie is implemented as a restify
service, and is deployed highly redundantly, both with a zone and in many
zones. In a typical production deployment, there are 16 muskie processes
per zone, with an HAProxy instance sitting in
front of them (in TCP mode). All logging from muskies in a production
setting is done through bunyan to
the service instance's respective SMF log (svcs -L muskie
). The HAProxy
listens on both 80 and 81, where 80 is meant to mean "secure" requests, which
either means non-TLS from the "internal" network (e.g., marlin) or TLS requests
from the world. 81 is used to serve up anonymous traffic, which is only
allowed in muskie to support anonymous GETs (signed URLs and GET /:user/public
).
At runtime, muskie depends on mahi for authentication/authorization, the electric-moray stack for object traffic, and mako for writing objects.
One additional dependency in muskie is on the Moray shard that maintains
the storage node statvfs
information. There are two options for muskie to
query this information:
-
use the built-in picker that periodically refreshes the data from moray and selects storage nodes from the local cache on writes (this is purely in-memory so all muskie processes have a cached copy)
-
use the Storinfo service which works in a similar way as picker to avoid hitting Moray directly for every object write
By default, Muskie is configured to use Storinfo. If the service is not in the Manta deployment and you wish to eliminate the dependency on it, you can switch to use the local picker by updating SAPI metadata:
$ sdc-sapi /services/$(sdc-sapi /services?name=webapi | json -Ha uuid) \
-X PUT -d '{"action": "update", "metadata": {"WEBAPI_USE_PICKER": true}}'
There is a second server running in each muskie process to
facilitate monitoring. This is accessible at http_insecure_port + 800
. For
example, the first (of 16 in production) muskie process within a zone usually
runs on port 8081, so the monitoring server would be accesible on port 8881 from
both the manta
and admin
networks. The monitoring server exposes
Kang debugging information and Prometheus application metrics.
Kang debugging information is accessible from the route GET /kang/snapshot
.
For more information on Kang, see the documentation on
GitHub.
Application metrics can be retrieved from the route GET /metrics
. The metrics
are returned in Prometheus v0.0.4 text format.
The following metrics are collected:
- Time-to-first-byte latency for all requests
- Count of requests completed
- Count of bytes streamed to and from storage
- Count of object bytes deleted
Each of the metrics returned include the following metadata labels:
- Datacenter name (i.e. us-east-1)
- CN UUID
- Zone UUID
- PID
- Operation (i.e. 'putobject')
- Method (i.e. 'PUT')
- HTTP response status code (i.e. 203)
The metric collection facility provided is intended to be consumed by a monitoring service like a Prometheus or InfluxDB server.
On a PutObject
request, muskie will, in order:
- Lookup metadata for parent directory and current key (if overwrite)
- Ensure the parent directory exists
- If etags, enforce etag
- select random makos
- stream bytes
- save metadata back into electric-moray
Note that there is a slight race where users could theoretically be writing an object while they delete the parent directory, leading to an "orphaned" object, but this could only happen if there were no objects in the directory, and an upload was happening while the user deleted the parent.
Probably the most interesting aspect of PutObject that's hard to figure out from
code inspection is how storage node selection works. In essence, we always
select random "stripes" of hosts (where a stripe is as wide as
durability-level
), across datacenters (if a multi-DC deployment) that have
heartbeated in the last N seconds and have enough space. Muskie then
synchronously starts requests to all of the servers in the stripe, and if
any fail, it abandons and moves to the next stripe. Assuming 1 of the stripes
has 100% connect success rate, muskie completes the request. If all the stripes
are exhausted, muskie returns an HTTP 503 error code. To make a practical
example, assume the following topology of storage nodes:
DC1 | DC2 | DC3
-----+-----+-----
A,D,G|B,E,H|C,F,I
On a Put request, muskie might select the following configurations (totally random):
1. B,C
2. I,A,
3. E,D
4. G,F
And supposing C
and A
are somehow down, then muskie would end up writing the
user data to pair E,D
.
On a GET request, muskie will go fetch the metadata for the record from electric-moray and then start requests to all of the storage nodes that hold the object. The first one to respond is the one that ultimately streams back to the user, and the other requests are abandoned.
Deletion of objects is actually very simple - muskie simply deletes the metadata record in moray. Actual blobs of data are asynchronously garbage collected. See manta-mola for more information.
Creating or updating a directory in Manta can be thought of a special case of creating an object where all aspects of streaming the user byte stream are ignored. But otherwise muskie does the same metadata lookups and enforcement, and then simply saves the directory record into Moray.
Additionally, directories are setup with postgres triggers to maintain a count
of entries (this is returned in the result-set-size
HTTP header). See
moray.js
in node-libmanta for the
actual meat, but this allows muskie to not ask moray to do SELECT count(*)
,
which is known to be very slow at scale in postgres.
Listing a directory in muskie is just a bounded listing of entries in moray
(e.g. a findobjects
) request. The manta_directory_counts
(trigger-maintained) bucket is used to fill in result-set-size
.
Deletion of a directory simply ensures there are no children and then drops the dirent; again there is a tiny race here as this spans shards.