Apache Airflow Schedule: The scheduler does not appear to be running. Last heartbeat was received % seconds ago.

Hello everyone,

Are you facing the same?

Well, after opening some tasks to check Apache Airflow test environment for some investigation, I decided to check Apache Airflow configuration files to try to found something wrong to cause this error. I noticed every time the error happens, the Apache Airflow Console shows a message like this:

The scheduler does not appear to be running. Last heartbeat was received 14 seconds ago.

The DAGs list may not update, and new tasks will not be scheduled.

In general, we see this message when the environment doesn’t have resources available to execute a DAG. But in this case, it is different because CPU usage was 2%, memory usage was 50%, no swap, no disk at 100% usage. I checked the DAGs logs from the last hours and there were no errors in the logs. I also checked on the airflow.cfg file, I checked the database connection parameter, task memory, and max_paralelism. Nothing wrong. Long history short: everything was fine!

I then searched for the message in Apache Airflow Git and found a very similar bug: AIRFLOW-1156 BugFix: Unpausing a DAG with catchup=False creates an extra DAG run . In summary, it seems this situation happened when the parameter catchup_by_default is set to False in airflow.cfg file.

This parameter means for Apache Airflow to ignore pass execution time and start the schedule now. To confirm the case I checked with change management if we had some change in this environment. For my surprise, the same parameter was changed one month ago.

I then changed the Apache Airflow configuration file and set the parameter catchup_by_default to true again. The environment was released to the developers team to check everything is alright. One week later and we don’t have any issues reported.

Conclusion?

This issue showed us that the development environment is a no man’s land. The change management process exists alone without an approval process to support it. The lack of an approval process leads us to a 4 hours outage and 2 teams unable to work.

I hope you enjoy it!

And please be responsible on your environments!

3 Comments

Leave a Reply to mimiicCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.