Choosing A Service Discovery System

Drew Csillag
WorkMarket Engineering
10 min readJan 2, 2017

--

As part of the series on bootstrapping microservices at Work Market, we’re going to show you the decisions we made, and expose some of the rationale about why we picked what we picked. Keep in mind that these decisions were made in mid-late 2015, so the world may have moved on since then.

This post is the result of an investigation we did to select a service discovery system back in about October 2015 or so.

Definition: CP store — a store that is consistent and can tolerate partitions at the expense of availability. Examples: etcd, ZooKeeper, and things that either implement either the Paxos or Raft consensus algorithms.

Why Client Java-Centricity Is A Disadvantage

Some discovery systems are java centric, but just because all the software we write happens to run on the JVM for the foreseeable future doesn’t mean that all the software we want discoverable is. We should be able to make bottom-tier services like graphite, redis, Kafka (which happens to be in Java already) or the database discoverable.

Why DNS Based Systems are a Disadvantage

DNS libraries, while ubiquitous, do have the “fun” side effect that the way they cache is not always known/consistent. Also, if the system only supports A records, you have to make sure that things round-robin appropriately. If you use SRV records, your clients need to be aware of this and handle them appropriately. Additionally, the DNS lookups are done in the request path of outbound requests, making DNS server latency part of any request latency you already have. DNS solutions cannot always be used to discover or advertise services we didn’t write.

Rejected First Round Contenders

Straight ZooKeeper or Netflix ZooKeeper Curator Service Discovery Library

Basically, use Netflix Curator to register in ZK, and do regular polls or watches of ZK info.

Advantages: Simplicity

Disadvantages:

  • Java-centric client, though Netflix Exhibitor mitigates as it provides a REST interface to ZooKeeper.
  • Service has to implement registration.
  • Client has to implement fetch logic and cache and ZooKeeper failure handling. Could mitigate by a Work Market custom library, but still one for every language we use.
  • Clients have to implement their own load-balancing, timeout, and retry logic. This is easy to get wrong.

Straight use of etcd

Could write a shell script to poll a health check url and post relevant things to etcd on success/failure — probably would be a short shell script with curl well less than 80 lines. A secondary script could use watches on etcd paths to be notified when things change.

Advantages: Simplicity. Etcd’s interface is RESTful, so easy to write a client in any language.

Disadvantages:

  • Client has to implement fetch logic and cache and etcd failure handling. Could mitigate by a Work Market custom library, but still one for every language we use.
  • Clients have to implement their own load-balancing, timeout, and retry logic.

serf[serfdom.io]

Uses a gossip protocol to advertise, so it only needs to know the addresses of at least one other member. No central servers to configure. Is more of a building block than a solution in and of itself. Sorta like raw ZooKeeper and/or etcd — it’s not a discovery service in and of itself, you have to build it on top. By Hashicorp.

For discovery you’d set tags containing the service name and endpoint, and be able to query for a service using the command-line tool, or more practically, serf supports event handlers to handle membership changes, which could be used to manage an on-disk file containing the service endpoints.

Advantages:

  • Is always available, but contents aren’t necessarily correct
  • Simplicity. Setup is a breeze.

Disadvantages:

  • Serf’s remote protocol is MsgPack based over plain TCP. [ed: why this was considered a disadvantage is lost in the mists of time]
  • Service has to implement registration.
  • Would need to write event handlers to manage the discovered endpoints. Would need to write code to maintain this list in a service. This could mitigated by a Work Market custom library, but still one for every language we use.
  • Clients have to implement their own load-balancing, timeout, and retry logic.

bitly’s NSQ lookupd

Not really a general discovery service. Very specific to NSQ. You *could* build a discovery service on it, but I wouldn’t recommend it.

SkyDNS

Basically a DNS front end (on the discovery client side) to etcd.

Advantages:

  • See raw etcd summary.
  • Doesn’t require an it’s own discovery service client as you can use http://www.dnsjava.org/ or equivalent libraries in your language of choice.

Disadvantages:

  • See raw etcd summary above. See also part about using DNS.
  • Clients have to implement their own load-balancing, timeout, and retry logic
  • For “one more thing that could break,” it delivers only a little value

Final Contestants

Puppet

Wins on the simplicity side, and we already have it, fails when it comes to autoscaling barring some cool puppet magic. If this is sufficiently rare, this is actually a workable solution.

Advantages:

  • Is always available in that the configuration it drops doesn’t go away due to network outage.
  • Simplicity. We already have it in-house, it’s already a known thing.

Disadvantages:

  • Doesn’t handle autoscaling.
  • Clients have to implement their own load-balancing, timeout, and retry logic
  • Clients have to implement something to watch the file containing the discovered service endpoints.

consul

Built partly on top of serf (also by Hashicorp) for a gossip protocol, but it seems to mostly be so that the consul servers are discoverable by being able to contact any other machine that uses consul. It uses the same underlying raft implementation that etcd uses. From Hashicorp (Vagrant, Packer, Atlas and Terraform: the latter two may be worth researching independently). Consul has a DNS front end, can use SRV records (can also use A), but no support for the priority bits in the SRV records.

Advantages:

  • Registration is done outside the service to be advertised
  • Externalized health checking used as criteria for registration, so a service cannot accidentally lie about its availability.
  • Can be used by services we didn’t write
  • Simple to setup, use and configure.
  • Good documentation
  • Straightforward REST API if you don’t want to use DNS (we don’t).
  • DNS publishes A records, so if the port for a given service is well known, it can be used (without auto-retry) by clients we didn’t write.
  • Web UI available separately

Disadvantages:

  • Clients have to implement their own load-balancing, timeout, and retry logic
  • Client has to implement fetch logic and cache and consul failure handling. Could mitigate by a WM library, but still one for every language we use.
  • Unlike SkyDNS, there’s no way to set priorities, but this may not be a real problem.

Netflix Eureka

https://github.com/Netflix/eureka/wiki/

Servers are built on Tomcat as a servlet. RESTful API with Java library to register/unregister/query.

Advantages:

  • Just a servlet, running in a servlet container we are familiar with, so it is a known thing as far as deployment is concerned.
  • It’s registration store is not CP, but AP. No quorum requirement.
  • Would merely be in a split-brain situation while the servers are partitioned. Registration/deregistration still works, the changes just might not be visible to everyone until the partition is healed.

Disadvantages:

  • Clients have to implement their own load-balancing, timeout, and retry logic (hence things like Ribbon).
  • Documentation is poor; includes an example, but they don’t document what does what and it uses deprecated APIs. Javadoc is weak. APIs don’t describe well what they do.
  • Currently between major versions
  • Doesn’t support Jersey 2 (https://github.com/Netflix/eureka/issues/600) nor do they plan to at least until next major version, how big of a problem this turns out to be in practice is arguable
  • Wants you to use Guice — how big a problem this is in practice is not known to me (I’m not a big fan of annotation based DI frameworks).
  • Had a generally sloppy feel to it all. Might be clean inside Netflix, but what they published externally doesn’t look well maintained.
  • Eureka Web UI is Uuuuuug-lee.

AirBNB’s SmartStack

SmartStack is really a fully-baked solution. While the others can provide the publishing and retrieval of available service locations, SmartStack solves the problem you actually want to solve: I just want to make a request to an available service endpoint and have it “just work”.

Caveat: Synapse only runs with Ruby 1.9.3, but Nerve can run with any of the versions I tested (1.9.3, 2.0.0, 2.1.1, 2.1.2). Actual services and clients can be anything though. Bug is reported for synapse only on 1.9.3. Appears it’s actually a bug in the zookeeper library for Ruby in these latter versions.

Getting a service published is very simple and may require no code changes at all — depends on whether there’s an health check endpoint existing test that can be run to determine the liveliness of the service.

Advantages:

  • Retry (with all the timeout bits) and load balancing are done outside the service client code (via HAProxy).
  • Service publishing is done outside service code.
  • Externalized health checking used as criteria for registration, so a service cannot accidentally lie about its availability.
  • ZK outages are mitigated in that a ZK outage only prevents updates to the existing data set, as well as HAProxy will notice that any downed downstream machines have gone away and react accordingly. So the practical upshot is that you won’t be aware of new machines brought up during the outage.
  • Can be used for clients and services we didn’t write
  • Provides a nice web gui for introspection as to what’s going where via HAProxy
  • Ability to debug servers and clients in-situ while still preventing it from receiving/sending traffic — i.e. it’s easy enough to “hide” it from the rest of the network just by stopping nerve on the affected box.
  • If you decided you didn’t want to use the HAProxy bits, synapse can be configured to dump and keep up to date a file with the current state of your downstream services.
  • Can be AWS aware and do discovery via AWS tag in lieu of ZK, in which case the health checks can be configured by synapse instead, and nerve is probably unnecessary, but harder to “hide” the box as above in such a case.

Disadvantages:

  • More moving pieces.
  • In accordance with the moving pieces, initial configuration may be trickier, but can be mitigated by puppet classes containing pre-baked best practices. By tricky, there’s HAProxy configuration that gets generated by synapse by way of the config file, so some of the options don’t make any sense unless you know HAProxy configuration.

Recommendation

SmartStack. It gives discoverability and handles all the timeout, retry, load balancing bits are can be the bane of writing good service clients, without sacrificing control for those times when you want it. It can handle autoscaling. It fails in a reasonable way in the face of a ZK outage. It can discover and provide reliable clients for things that know nothing about any of it, and not just HTTP. But more compellingly: at many companies I’ve worked at and some I’ve observed, they either used HAProxy, or wound up writing its equivalent… more than once… and rarely done it nearly as well (Google’s GSLB is the lone exception).

When I first started the working on the research for this document, I wanted to like Eureka as the initial documents I read on it were promising, but at least where it is [again written in mid-late 2015], I cannot recommend it. The more I dug into it, the worse it got. I really like its story about the way to deals with server outages, but when you get to the practical aspects of its API, the implementation and its documentation, it falls down.

When it comes to the more dynamic discovery options (consul, SkyDNS) built on ZK, etcd or serf, they don’t really give much in the way of value-add. Consul adds externalized health checks and a DNS facade over its internal CP store, SkyDNS just puts a DNS facade over etcd. Consul might be able to manage both ends if the wind is blowing just the right way for a service we didn’t write, SkyDNS only the client, again only if the wind blows just the right way.

So in the case of building directly on things like ZK, etcd or serf, the “Nerve” equivalent (that is the externalized health check and service publisher part) would be fairly straightforward to write if it were necessary. In any case, Nerve supports a few different “reporter” backends, of which ZK is the best tested, etcd is still marked experimental. The fact that the reporter backend is pluggable means whichever store we chose, we could just use Nerve even if we skip Synapse.

If we’re going for Simplicity as a first principle, and we’re not to concerned about our AWS bill (i.e. not worried about autoscaling), Puppet would be the way to go. We could then just add in the HAProxy bits if we wanted. If we skip HAProxy, clients would have to do all the load balancing, etc.

Thing That Should Be Mentioned

If we choose something ZK based, running Netflix Exhibitor is recommended as it provides a nice gui to examine and/or change the contents of what’s in ZK as well a REST interface to ZK and ways to do other maintenance tasks.

The Drawing of SmartStack

A few details:

  • One ZooKeeper cluster for all of service discovery, cluster of size 5 for prod. Can be same cluster used for Kafka, though we may eventually want to have separate instances of ZooKeeper on those servers depending on failures we encounter as to separate the failure domain somewhat.
  • One HAProxy instance per client box irrespective of the number of microservices it talks to.
  • One instance of nerve per microservice box.
  • In this diagram, a microservice client may very well be itself a microservice.
  • One instance of Synapse per microservice box.

Why not use the ELB?

  1. The SmartStack system has no SPOFs. Any single component can fail and the impact is limited. If the ELB is down, you lose all the services to which it proxies.
  2. Using a local HAProxy makes it so HTTP client implementations can be very stupid (with a capital DUH). If HAProxy is unavailable, the connect attempt will fail immediately as it’s on localhost, no timeouts required in the case where the network between the node and the ELB is slow, unavailable, or the ELB itself is down.
  3. ELBs cost extra money. Compare that to the fact that HAProxy’s resource utilization is minuscule compared to the JVM, so should not significantly affect the servers on which it runs since the only things that connect to it is the local service instance.

How This Decision Has Aged

In the year+ since we made this decision, it has aged very well. It has never been the source of an outage, and I don’t recall it ever being a significant problem in getting new services in — occasionally, we’ll forget to add it’s info to puppet, but when it doesn’t show up on the client side, it’s pretty obvious what didn’t happen. Overall, if starting over, I would definitely pick this again.

--

--