Screenshot 2024-06-25 at 11.11.45.png

Introduction

A distributed job scheduler is responsible for scheduling and executing jobs across multiple nodes in a distributed system. This is a crucial component in many large-scale applications, where tasks need to be executed on a timely basis and often require coordination across different services or systems. In this design, we will design a job scheduler that allow users to submit periodical jobs like cron job.

Key operation flows

This system has two core flows: 1) users can submit job configurations 2) the system executes the jobs according to the submitted configurations. We will not cover details on configuration update, job status monitoring, job execution failure, loggings, etc. They are not the core function of a distributed job scheduler.

Job submission flow

https://documents.lucid.app/documents/605e4398-a8c8-4d7b-8be0-0713484d6946/pages/0_0?a=254&x=513&y=-1196&w=951&h=332&store=1&accept=image%2F*&auth=LCA ad32a4bb75b3b598b6900aa50a03c2b45833d10ab20b261b6924d073177a14b0-ts%3D1719340742

  1. Using a client software, users submit new job configurations to our system. Besides book-keeping data, the core configuration format is

<request_id, job_id, start_time, end_time, period, execution_routine>

Period is the number of seconds between each execution. Execution routine specifies how to execute the job, for example, calling some API, executing Hive queries, etc. 2. Our system process the request, if successful, return the OK status to users. That guarantees the job will be executed in the next specified time and repeat according to the specified period. 3. If the process failed, our system returns failure status to the client, the user is responsible for retrying if appropriate.

Job execution flow

For each submitted job, our system execute it according to the specified start_time and period. To simplify the discussion without sacrificing any merits, the execution is on best-effort basis: in case of outage or failure, we won’t retry the job. The recovery should be a separate design. When our system restarted and found jobs past due during the outage, we will just ignore them in this MVP design.

Design challenges

There are three hard questions in this system.

  1. Since most of the time our system should be waiting for the next execution moment, how to sleep correctly that saves CPU cycles without missing any scheduled jobs?
  2. Assuming our execution process will be sleeping most of the time, even it knows when to wake up for existing schedules, how to wake it up when there are new schedules?
  3. How to scale the job capacity indefinitely? Namely, how to horizontally scale so that it can store job configurations more than one machine can store and execute jobs more than one machine can handle?

Now, let’s keep the above three questions in mind while architect the system.

Architecture

The job scheduler system consists of four components.