A distributed job scheduler is responsible for scheduling and executing jobs across multiple nodes in a distributed system. This is a crucial component in many large-scale applications, where tasks need to be executed on a timely basis and often require coordination across different services or systems. In this design, we will design a job scheduler that allow users to submit periodical jobs like cron job.
This system has two core flows: 1) users can submit job configurations 2) the system executes the jobs according to the submitted configurations. We will not cover details on configuration update, job status monitoring, job execution failure, loggings, etc. They are not the core function of a distributed job scheduler.
https://documents.lucid.app/documents/605e4398-a8c8-4d7b-8be0-0713484d6946/pages/0_0?a=254&x=513&y=-1196&w=951&h=332&store=1&accept=image%2F*&auth=LCA ad32a4bb75b3b598b6900aa50a03c2b45833d10ab20b261b6924d073177a14b0-ts%3D1719340742
<request_id, job_id, start_time, end_time, period, execution_routine>
Period is the number of seconds between each execution. Execution routine specifies how to execute the job, for example, calling some API, executing Hive queries, etc. 2. Our system process the request, if successful, return the OK status to users. That guarantees the job will be executed in the next specified time and repeat according to the specified period. 3. If the process failed, our system returns failure status to the client, the user is responsible for retrying if appropriate.
For each submitted job, our system execute it according to the specified start_time and period. To simplify the discussion without sacrificing any merits, the execution is on best-effort basis: in case of outage or failure, we won’t retry the job. The recovery should be a separate design. When our system restarted and found jobs past due during the outage, we will just ignore them in this MVP design.
There are three hard questions in this system.
Now, let’s keep the above three questions in mind while architect the system.
The job scheduler system consists of four components.