Many organizations using the popular open source Apache Airflow platform to schedule and manage workflows may be exposing credentials and other sensitive data to the Internet because of how they use the technology, researchers have found.
Security vendor Intezer this week said it recently discovered several misconfigured Airflow instances exposing sensitive information belonging to organizations across multiple industries, including manufacturing, media, financial services, information technology, biotech, and health.
The exposed data included user credentials for cloud hosting services, payment processors, and social media platforms, including Slack, AWS, and PayPal. Intezer found that at least some of the data exposed via misconfigured Airflow instances could allow threat actors to gain access to enterprise networks or execute malicious code and malware in production environments and on Apache Airflow itself.
“It is quite easy to find exposed instances,” says Ryan Robinson, security researcher at Intezer. To locate one, all a threat actor must do is scan IP addresses and check them for the expected HTML file. “It is trivial to find sensitive information on exposed instances, but to exploit it to run code is much harder and requires a solid understanding of each platform,” Robinson adds.
Organizations use Apache Airflow to create and schedule automated workflows, including those related to external services, such as AWS, Google Cloud Platform, Microsoft Azure, Hadoop, Spark, and other Apache software. A survey of its usage in 2020 showed most of its users are data engineers, scientists, or data analysts at midsize to large companies. More than three-quarters of organizations do little to no customization of the technology before using it.
Airflow allows users to orchestrate jobs that involve multiple tasks, Robinson says. For example, he says, a job might involve generating reports, then emailing them to clients; another job might involve collecting, processing, and uploading data to AWS buckets.
While Airflow gives users multiple options to use it securely, organizations can put data at risk through the way they use the platform.
Intezer, for instance, found insecure coding practices to be the most common cause for credential leaks in Airflow. Intezer’s research uncovered multiple Airflow instances in which passwords had been hardcoded either into the Python code for orchestrating tasks or in a feature that allows a user to define a variable value. In other instances, Intezer found users misusing an Airflow feature called Connections and storing passwords in plaintext instead of encrypting them.
“Airflow gives good options to store sensitive information securely through their Connections feature,” Robinson says. The feature allows organizations to ensure passwords that are used to push and pull data from other systems are stored in encrypted fashion. “For example, a task will download data from one platform using an API key, then process this data in another task and store this data in a database using a password to connect. One workflow may need to interact with multiple remote systems,” Robinson says. Users often misuse the Connections feature or directly hardcode the credentials into the Python scripts, bypassing the feature altogether, he notes.
Intezer found other ways in which users can put enterprise data at risk through insecure use of Airflows. One example involves the settings related to an Airflow configuration file that often contains sensitive information, such as passwords and keys. If the setting is not secure, anyone can access the configuration file from the Web server user interface, Intezer said in its report. Similarly, a feature in older versions of Airflow that allows users to run ad hoc database queries is dangerous because it requires no authentication and allows anyone with server access to get information from the database.
Intezer recommends all organizations using Apache Airflow update to the latest 2.0.0 version of the platform and to make sure that only authorized users are allowed to connect to it.
“Version 2.0.0 has made great improvements in security,” Robinson says. The new version has a fully supported API, unlike the experimental API in previous versions. Other major improvements include enforcing authentication and removing sensitive information from logs, as well as changes to the structure of the main configuration file, he says. Some older — and dangerous — features such as Ad-Hoc Query have been deprecated in the new version of Airflow.
Robinson says it’s hard to know for sure if attackers are targeting insecurely configured Airflow platforms; however, he says it would be a reasonable assumption that Airflow instances have been targeted.