Posted 12 May 2023, 7:10 pm

Reliability Engineer at Pyth

Sorry, but this job listing has expired!

Operating Pyth Network is a nontrivial challenge. Our price feeds run 24x7. DeFi applications depend on the accuracy and availability of these feeds; an inaccurate price or offline feed can cause serious financial losses. Each feed in turn depends on many different services, some of which are run by our data providers and some by us. It’s a complex system with many different failure modes, but it has to work correctly all the time.

We also run a variety of off-chain services, such as the backend for the pyth.network website, and tools for logging historical data. These services run in a Kubernetes cluster that is managed using Terraform. We also need to ensure these services are running and healthy at all times.

We’re looking for people to help us operate this system and improve its reliability over time. This job has many different aspects, including providing front-line support for incidents, developing automation to manage our infrastructure, and defining deployment plans for high availability.

About us and the Job

  • We are a small team. About half the team is technical; the other half manages relationships with data providers, developers, and the broader community. (Building a network requires talking to people!)
  • We are mostly remote. Team members live across the world, in the US, Europe, and Asia. We do have offices in some locations (Porto, Chicago, London, Amsterdam, Singapore) for those who prefer in-office work.
  • Our team communicates with each other and external developers in English. Strong spoken and written English skills are required.
  • We operate like a startup in the rapidly-growing and changing DeFi ecosystem. In order to be successful, we must adapt to meet the current needs of the market. Good candidates will help our organization adapt; they are flexible problem solvers who are willing and able to jump on whatever the occasion demands.
  • Most of our software development is open source. You can look at our github repositories to understand what we typically work on.
  • We offer a competitive salary and generous benefits package. Furthermore, where applicable, employees may be eligible for token allocations as part of Pyth Network’s employee incentive program.

What You'll Do:

  • Provide front-line response to incidents and outages, such as unavailable price feeds, or website downtime.
  • Develop automation tools to provision and manage our infrastructure, including cloud services and Kubernetes clusters. We currently use Terraform to manage our infrastructure, but we’re not married to it and may use different tools in the future. Some of our tools are written in Python and others in Go.
  • Design and implement operational plans to achieve high availability guarantees for our price feeds and web services. Build redundant service deployments, monitoring solutions, dashboards, and alerting tools to ensure that critical services are running continuously. Support services on development and production environments, from before launch through launch. Benchmark application resource consumption to allocate capacity.
  • Measure and monitor application metrics (availability, latency, etc.) to understand the health of the system. Work with developers to add metrics and logging to their applications in order to facilitate Grafana dashboards and alerts. Develop logging practices and libraries to standardize metric reporting and alerting across multiple programming languages.

Skills You'll Need:

  • Comfortable developing software. Writing software is a big part of the job, as we write lots of tools to automate processes and monitor deployments.
  • Solid understanding of Linux fundamentals, such as processes and permissions, along with an understanding of containers (Docker) and cloud deployments.
  • Experience troubleshooting, monitoring and debugging cloud-native applications and distributed systems.
  • Ability to handle shared operational and periodic on-call duties
  • 1+ years of experience supporting critical production environments. Work in financial and crypto markets is a plus.
  • Predictable and reliable availability.



Please mention the word **HAPPIER** and tag RNDQuMjM0LjE1MS4xMzY= when applying to show you read the job post completely (#RNDQuMjM0LjE1MS4xMzY=). This is a beta feature to avoid spam applicants. Companies can search these words to find applicants that read this and see they're human.

The offering company is responsible for the content on this page / the job offer.
Source: Remote Ok