Interview: Seth Vargo (long version)

Seth Vargo, Director of Technical Advocacy at HashiCorp, will talk at ASAS about Graph Theory and Infrastructure as Code. Benny Cornelissen, Infrastructure Architect at Avisi and user of HashiCorp tools, sat down with Seth and asked him all about his work as Director of Technical Advocacy, creating workflow tools at HashiCorp, and the importance of guiding principles for a product company. Seth and Benny had a really interesting talk, so the interview is quite long. Therefore we made two versions: a short version (estimated reading time: 7 minutes) and a long version (estimated reading time: 25 minutes). Decide for yourself!

This is the long version. Don't have time for that? Let's take you to the short version!

null

Long version: You are Director of Technical Advocacy at HashiCorp. What exactly is that?

I have been at HashiCorp for 3 years now. I'm employee number 4, so I built a lot of these tools and still work on a lot of them. However, as the company evolved, we hired more people who are specialised in different areas. We have people working on Terraform full-time. Me being a generalist and working on everything didn't really scale.

At the same time, we're an Open Source company, and a lot of our early success was driven by community. Our founders, Mitchell (Hashimoto, red.) and Armon (Dadgar, red.) were very active in that community, speaking at conferences, etcetera. This evolved into a full-time advocacy role. What I do is about 20% engineering, 80% speaking at conferences, teaching workshops, meeting with customers, and working with our product teams to help prioritise and identify areas where the products should improve.

A big part of your job is traveling. How do you maintain a healthy work/life balance?

It's definitely challenging and it's certainly not for everyone. I would not recommend this job if, for instance, you have a family with small children. Even if you can get past the traveling aspect, with last-minute delays or being stuck at airports.

But at the same time it's really rewarding, too. I've met so many great people, learned about new technologies, interacted with people who are employee number 2 at companies that are now 500+ employees. It's really cool to see them grow. There's a community of people who do this. We have a dedicated Slack channel where we offer technical or emotional support. It's a very stressful job, basically not having a home.

You joined HashiCorp as employee number 4 in 2014. Today HashiCorp has over 150 employees. How did you experience the fast growth?

Things change. I think the biggest change is in how we consume and digest information. We went from 'everything is on Slack', to eimail, back to Slack, to internal Wiki/GDocs. There are different tools that suit organisations' needs at different sizes. When you're 10 people, it can be iMessage or SMS, but at 200 people that really starts to break down, especially when you're across multiple time zones.

When I joined we were all friends, almost like a family. I interacted with my co-workers more than with my family. That 'family feeling' hasn't gone away, but if I go to company meetings now, I see faces that I haven't seen before. We've gone from 'everybody knows everybody' to 'hey, you work here too?'. It's shocking at times, but in a good way.

Let's discuss product development. You built Consul Template shortly after joining HashiCorp. Was it difficult to build a new product so quickly after joining?

That's actually a pretty funny story. When I first joined I would have called myself a Ruby developer that does public speaking. I didn't know anything about Golang, the language HashiCorp writes all of their tools in. My introductionary assignment was to write what is now Consul Template, in a language that I didn't know, with a tool that I wasn't familiar with, with basically no resources. They had never onboarded anyone before, they had no idea what they were doing (laughing). Nowadays we have very well-defined procedures.

It definitely was a bit stressful. I was dealing with multiple things. The language itself was something I wasn't familiar with. I didn't know where to look for things, or Google with a language called 'go'. If you Google for 'go array', you get no results for what you want. I learn through example, so that was quite frustrating. I asked for a lot of help, which was interesting as well. In my previous role, I was the expert and didn't really need to ask for help, and now I was in a position where I felt kind of dumb.

Consul Template became one of the main 'glue' tools between all of the HashiCorp tools. The de facto way to pull information from Consul or Vault onto the filesystem. It's kind of crazy to think that this tool that I wrote in a language that I didn't know, that interacted with another tool that I didn't know anything about, has actually become one of our flagship tools. That's really cool.

HashiCorp is quite public about their philosophy, or Tao. Sceptics might even call it a marketing thing. Can you give some insight on how the Tao works in practice?

The Tao was actually something we were never going to make public. A few years ago, before we launched our first commercial product, we were all in Los Angeles. It was like an episode of Silicon Valley. We'd gotten a house, all coding, cooking, and living together as a family on the beaches of LA. We had press embargos. It was stressful, but relaxing. It was very odd.

One of the things that we read as employees as part of our onboarding was the Tao of HashiCorp. You can think of it as the ideology or the governing principles for the products we build and the interactions, internal or external. We read this document a number of times before someone suggested "Why don't we make this public, so other people can benefit from it?"

Also, by putting it out in the open, we can't go back on it. This is what we stand for, and it still holds true today. It's still required reading. Is it a marketing tool? Maybe it is today, but it was never intended as such.

With HashiCorp products attracting an increasingly big crowd, people are bound to be disappointed at some times by decisions made. What do you think is the key to successfully dealing with that?

Let's start by looking at the monetisation model. Really there are 3 main monetisation strategies in Open Source development. The first is to build an Open Source thing, get a bunch of users, and hope you get acquired by a bigger company that wants your flagship product. The second is support. This is where you see early-stage Puppet and Chef operating. Everything is Open Source, but you pay for professional services and support. The challenge with that is that it doesn't tend to scale well, and it actually hurts you to write good documentation, as people will then not pay you for support. That's a bit of strange business ideology there. As you get more customers, you have to bring on more support engineers, additional training, etcetera. Basically, you don't scale linearly, but exponentially instead.

The third model, which is the one that HashiCorp uses, is this notion of an open core with enterprise offerings. We make a bunch of open source tools that are on Github, that are licensed under the most liberal public license and we have enterprise offerings. Some are a SaaS product, like an on-premise web UI, and some are different binaries (from the Open Source tools) that provide specific enterprise functionality. Where we draw the line is what we think enterprises need. A good example of this is that we think most small to medium companies don't really care about Active Directory or LDAP integration. They either don't need it, or if they do, they can easily write a script and get it figured out. But when you get into the Fortune 500 companies, with hundreds of thousands of employees, distributed geographically across the world, they want Active Directory integration. They don't want another account management system, or they want SAML. The same goes for HSM integrations in Vault. No small to medium company has a hardware security module (HSM), but banks and financial companies do. That's why that is an enterprise feature.

We try to choose those enterprise features in such a way that we avoid as many of those conversations as possible, because we're trying to build add-ons that really benefit those enterprise customers while keeping the core open. That doesn't mean that we never have disputes or conflicts in the community. One of the pillars of the Tao is pragmatisism. We believe we've made the best decisions with the information we have, and in the past we've been wrong. A really good example of that is Terraform State Locking. We decided that State Locking was going to be an enterprise feature, and there was pretty significant backlash from the community. "Why is this not Open Source? This affects everybody." We went back to the drawing board, and we thought about it, and we decided they were right. This wasn't an enterprise-only feature, so we moved it to Terraform Open Source. We really try to listen to our users, because we've all been there. We try not to lose sight of that. We're going to make mistakes as our company grows, but we really try to keep an open and honest conversation with our community.

Of course we cannot make everybody happy, and that's also the nature of business. We're going to do our best to make as many people as possible successful. But if a use case really doesn't fit we're, going to refer back to our Tao, to the pragmatisism and honesty component, and just tell that user: "this isn't the right tool. We're sorry, but round peg, square hole, they're not going to fit together. Maybe consider some other tool"

At HashiConf in 2015, Otto was announced, basically as a successor to Vagrant. A year later, Otto was decommissioned. I can imagine this wasn't a decision taken lightly. Can you shed some light on how this unfolded behind the scenes?

For a little bit of background, Otto was a tool to abstract the complexity of cloud providers. Terraform is a tool for managing infrastructure but it still exposes the complexity of the cloud. We wanted something like 'just give me an instance', 'give me network', that was a little more abstract. That essentially proved to be impossible. We severely underestimated the technical resources required. We underestimated the amount of investment and buy-in we would need to get from cloud providers themselves, and the rate at which the cloud industry was changing. Back when Otto was announced, AWS (Amazon's cloud service, red.) only had a couple of additional offerings, but the majority of things was 'run it on EC2' (the compute component of AWS, red.). And now they have over 300 of these offerings. They have 'scratch your back as a service'. And other cloud providers are very similar. Even if we had pushed Otto's development further, seeing the direction the cloud providers were taking, it wasn't going to match. We were building an abstraction on top of something that inherently couldn't be abstracted.

We talk about Otto a lot internally, we had a lot of meetings, but it all comes back to the Tao. We had this novel idea, that could have worked, but the direction the industry was moving didn't align with the product we were building. We could try to force the round peg in the square hole, or we could just admit that we were wrong. We really had to make a decision. Do we let it go, let it be Open Source, let people submit issues, but not really devote resources to it, or do we terminate the project and explain why we were wrong. If you look again at the Tao, and it's really not a marketing thing, we build workflows, not technologies, and Otto was a technology, not a workflow. So basically, we built the wrong thing. We built something that was technically correct, but it solved the wrong problem. And we decided that leaving it around would be worse for the community than destroying it and leaving the source available for anyone to pick it up.

There obviously was some backlash from the community, but I think we took a clear stance and explained why it was the wrong solution in the current situation. We also got questions from our enterprise customers. "How do we know you're not just going to kill Terraform, or Vault?" I think that answer is actually very easy. Otto received a bunch of popularity when we launched it. We launched Otto at about the same time the Mars Rover landing, and Otto was above landing a machine on Mars in Google Trends. But we have pretty detailed anonimized statistics on our downloads, and it looked like in the weeks after the release nobody was using it. It wasn't popular, as it was the wrong abstraction. So it was very clear to us that by pulling it from the market we weren't really hurting anyone. With other tools like Terraform or Vagrant those numbers are very different, so there's no risk of us pulling any of those products. We just made a mistake (with Otto, red.) We built the wrong product, at the wrong time.

HashiCorp tools are workflow tools. It's up to the user to build awesome things with them. Have you ever been amazed by what users created using HashiCorp products?

Every day. (Laughs) When I think about cool things people built in the past, I think of things like Instruqt, which is a tool that the people at Xebia made. That was then spun off into its own company. That was built primarily on top of the HashiCorp tools and Google Cloud, and it's very cool. I've used it a lot, I've run the leaderboards on it. That's cool. On the flip side, there are also very hard technical challenges that people are solving that aren't quite as lavish, but still impressive. There's a very large news organisation that uses Vault to secure all of their data. Encryption as a Service.

Another great example would be the Jenkins Autoscaler for Nomad, to scale CI/CD. Last week Under Armour, the clothing company, open sourced an autoscaler for Nomad, called Libra. The community was very ecstatic about that. We first learned they used Nomad (we didn't know) and we learned they had a need for autoscaling. Due to our anonimised usage statistics we don't really know about the awesome things people are doing, unless they reach out and tell us.

HashiCorp tools are quite known for being pretty safe for early adopters. A lot of HashiCorp's tools still are on a pre-1.0 release, while being completely production-ready. How do you translate that to enterprise customers, that would normally be very wary of running pre-1.0 software?

In the banking and financial industries, Vault is very popular. I think what it boils down to, is being honest. We explain what our versioning means. Version 0.1 is an initial release, a thing you should download on your laptop, play around with, and that's it. When we hit 0.2, that's production stable, but not production-ready. There's a low probability it will panic, but you're not going to have a lot of insights. We probably haven't finished the telemetry bits, the logging. It's operationally challenging. Then when we hit 0.3, that's really when we're telling customers it's production-ready, we feel confident you can run this in production. There might be some missing features, but the major bugs have been ironed out. At 0.5 or 0.6 it has seen significant production workloads. Versions 0.7 to 1.0 is mostly features. The reason we don't hit 1.0 very quickly, is that for us 1.0 is a promise. It's a promise we won't change any APIs, internal or external. That's a very big deal to us.

We understand it's a big deal to customers, but in the end it's a version number, and some of our tools have been around a lot longer than other tools that are on version 3.0. Also, going to Bank B and telling them Bank A is already using our product, they're more likely to adopt it. And with the core being open, companies are also openly speaking about using Vault. It really changes the conversation.

Still, the version number comes up in every conversation. They will want to know why we're not on version 3.0, but then we have this honest conversation "Why does this matter to you?". It often comes down to features, and if we can definitively prove that our current version offers that, the conversation usually ends. I think to a certain extent version numbers are a marketing ploy, and we're not a company that pulls the smoke and mirrors. We want you to use our products because they solve your problems, not because they are version 5.0. It's silly to discount a tool because of its version number.

During your talk at ASAS we will probably get to see a bit of Terraform. For those people that don't know Terraform, can you explain how Terraform is different from other well-known tools like Puppet, Ansible, or Chef? What can people expect from your talk?

Terraform is a tool for managing infrastructure as code. When you think about infrastructure, Terraform can manage everything that has an API, and it provides lifecycle management for those resources. Puppet, Chef, Ansible, and Salt manage a machine. It installs Apache, the user, and some config files. That's really not Terraform's responsibility, it doesn't do that machine layer. It's managing your infrastructure. How do those 10 instances that you provisioned with Chef get connected to a load balancer? And how does that load balancer get connected to a DNS entry? That's what Terraform helps you model, in a single text file.

And by capturing that in a text file, you can enable Github pull-requests, code reviews, automatic reporting, a lot of features you'd associate with traditional software engineering. You can bring that to Terraform. Terraform also separates the planning phase from the apply phase. When you're dealing with thousands of machines, and you want to change something, you want to see the result of that roll-out, before you do it. Terraform's plan phase shows you what is going to happen, before it happens.

The last pillar that separates Terraform from the other solutions is the graph theory, which is something I'm actually going to spend a lot of time talking about in my talk. The graph theory is how Terraform uses the mathematical construct of a graph to model these complex relationships between infrastructure and infrastructure providers. And by using this graph, we get amazing parallelization. We can couple resources across cloud providers, and we see more and more users running across clouds. Terraform composes those resources across clouds, in a single text file.

Also, because Terraform uses this graph plugin model, we can manage anything with a API. One of my favorite examples is that you can manage Github repositories and permissions using Terraform. You can manage your DNS. You can submit Kubernetes jobs. Anything that has an API. That's why I think Terraform really enables Infrastructure as Code, being able to compose all of these resources together, collaborate on these changes, and ultimately easily roll-out production or production-like infrastructures.