top of page
Search

Adyen's Journey with Apache Airflow to Create Reliability at Scale

  • Writer: Peter Johnson
    Peter Johnson
  • Dec 14, 2023
  • 10 min read

Jorrick Sleijster and Natasha Shroff have written a blog post. Adyen is a technology platform that enables thousands of businesses globally to obtain end-to-end payments facilities, financial solutions and data-driven insights. At the outset, payments were based on a conglomeration of outdated systems and infrastructure... but our goal was greater. We began our endeavor to develop a financial technology platform designed for the current age, from the first stages of formulation to the finished product. Now, businesses from all over have access to our payment system, enabling us to process billions worth of transactions. It is no surprise that efficiently handling and retaining a high volume of payments each second necessitates an appropriate, dependable, and rapid system. Even though we love constructing these intricate systems and architectures in-house, when it comes to managing data we count on Airflow to get the job done. In this blog post, we will demonstrate how we strive for reliability at a large scale. We will discuss the various issues we have encountered in the process and present our answers to these questions: Our data warehouse must be supplied with data from payment processing in a timely fashion and as accurately as can be. Failing to provide the right data to our partners means they won't have the intelligence they need to make sound decisions, such as implementing better fraud protection measures to avoid future hits from criminal activity. Merchants rely on the availability of detailed payment reports to evaluate the amount of payments they have received and how many were settled. If these reports are not provided in a timely manner, their ability to detect fraud or payment errors is impacted - which can result in an immense amount of lost revenue. Consequently, if merchants are unable to access capital for new product production, their businesses could be at risk of closure. We have a team of over 250 data professionals organized into 40+ groups focused on product analytics or ML models. Together, these groups manage and monitor a huge 600+ DAGs (Directed Acyclic Graphs) via Airflow, with in excess of 10,000 tasks that are usually performed hourly or daily. Spark and HDFS are also leveraged to work with Airflow; please refer to this blog about our Airflow setup to learn more. We looked at some of the issues we have encountered on our journey, so let us investigate those further in the subsequent sections. We are committed to keeping our Airflow cluster continuously available, dependable and secure for our tenants - the product teams that provide value to our merchants through scheduled observations, predictions, ML training pipelines, etc. Adyen is entirely on-premise, meaning that we manage the servers ourselves rather than hosting our software in the cloud. Additionally we strive to employ solely open-source instruments, a quality which Airflow is an ideal match for. is an important goal for many companies. Many businesses strive to attain a high level of availability. In order to ensure high availability, we have established an Airflow cluster consisting of three celery workers, schedulers, and web servers each with NGINX and Keepalive running in front for web traffic routing. We have further deployed a highly-available PostgreSQL setup across three separate servers to store the Airflow metadata. This arrangement allows us to maintain the cluster's functionality despite single-machine or rack malfunctions. Currently, our Airflow setup is installed directly on the machines, but we are planning to containerize it and migrate to Kubernetes for future updates. A permission structure that allows multiple tenants to have access is known as multi-tenant permission structure. We optimized our user access by adjusting the Airflow permission management. It is not suitable to give all Airflow users editing rights to all 600 of our DAGs. We began by incorporating LDAP (Lightweight directory access protocol) to add users to the groups for their teams. When we update the DAGBag, we renew the edit permissions for the DAGs. This means that all personnel have the ability to read any of the DAGs, but only the teams that own each DAG are able to make alterations. will occur soon. The cluster will be updated shortly. We strive to keep our Airflow up to date, with the most current security fixes and features. To facilitate this, we review change logs for any breaking changes, then perform local testing to ensure our code is backwards and forwards compatible. After testing within our sandbox cluster, we deploy the updates for our tenants. As we have added several custom components to Airflow, this process is not always straightforward; however, these contributions have been presented in the Airflow Summits for the past two years [1][2]. Having more machines can lead to hardware issues, and this has been the case for us on numerous occasions. To address such issues, Airflow has specific protocols in place to address them, such as the scheduler automatically retrieving tasks from other failed schedulers. Though this works well, it is more complicated when it comes to databases. Our PostgreSQL setup is on-premise and highly-available, and Patroni is what makes this possible. In the past, the primary PostgreSQL server has gone down, and one of the standby PostgreSQL servers had to take its place, resulting in a ‘fail-over’ scenario. This ensured the service stayed operational, however we observed that some states of Airflow tasks were still marked as running, even though they had finished before the fail-over. Consequently we were concerned that we were not getting all the data related to the fail-over. Commits that occur in a synchronized fashion The image below shows the flow of updating a PostgreSQL cluster with two active stand-by's from Airflow. Number 1 highlights any update action initiated by Airflow. Notifications from PostgreSQL of the success of the update are sent right away for asynchronous commits (2). Then, numbers 3 and 4 signify replication of the update to the stand-by's and 5 and 6 indicate confirmation of the replication. For synchronous commits, the confirmation of replication must occur before PostgreSQL sends a confirmation of the update to Airflow. If the primary PostgreSQL instance experiences a crash following an update, operations 3 to 6 cannot be completed due to asynchronous commits. This is a typical result of replicated data lag. Although Airflow receives confirmation that the data was written correctly on the primary instance, the update is not present anymore when one of the standbys is promoted to become the new primary as the update was not replicated to this node. However, this issue is prevented with synchronous commits, as the confirmation is only emitted after the replicas have written the data. We have decided to utilize synchronous commits in order to ensure that all PostgreSQL nodes have written each alteration before the transaction's completion is revealed to the application. This setup makes certain that two replicas with the same exact data are ready to assume control in the event of primary PostgreSQL server failure. Utilizing synchronous commits gives us peace of mind as the possibility of hardware failure is further mitigated. The downside of this strategy is that the speed of each write to the database is considerably reduced, meaning Airflow procedures that depend on many writes will suffer a performance degradation. We observed a decline of around 10% in our scheduler's performance. This might seem like a lot, but with our current stage of development, it could easily be offset by adding another scheduler. is becoming increasingly important. The significance of machine capability is rising. When setting up your PostgreSQL stand-by nodes, you cannot simply choose any server. These nodes need to be at least as powerful and have the same hardware profile as the primary. Disks should have equal latency to avoid extra hardware latency and the CPU should match the primary's power in the event of fail overs. They should be near each other to minimize network latency, e.g. in the same data center but not the same rack in case the rack is impacted. Furthermore, try your best not to run other applications on these machines. In one instance, other applications on one of the stand-by nodes were taking up 100% of the CPU and leaving little for PostgreSQL, which, due to the synchronous commit setting, caused write operations to be sluggish. We were only able to schedule a few tasks per minute when we had expected to be able to handle over a thousand during the same time frame. The quantity of our DAGs increased over time, resulting in numerous dependencies between them. In order to address the dependencies between them, a modified version of the ExternalTask sensor was created. This sensor regularly pings the database every 5 minutes to check if the respective tasks from the connected DAGs have completed. Presently, thousands of these sensors work on a day-to-day basis, mostly getting initiated at the commencement of each day. Consequently, the strain on our system is noteworthy, particularly on our investigation clusters, where the sensors take up a large portion of the available resources, since they are much less powerful. can be allocated to different tasks that are assigned Assigning priority weights to different tasks is a way of determining which should be done first. We investigated options to improve this scenario, including the priority weight of the tasks. The priority weight determines which factor the system should prioritize: the number of downstream tasks, the number of upstream tasks, or an absolute value that the user has to indicate. By default, the configuration is the number of downstream tasks. In our case, this means that sensors have a higher priority than any of our ETL processes. The image below displays a DAG in which the order of execution is marked by the colored squares. By default, the order of execution is usually as depicted in red, so that it guarantees that all sensors will have been completed before any ETL task is executed. Such a process, unfortunately, can operate like a waterfall of delays: the sensors will have to wait for the ETLs to finish, thereby suppressing the running of the ETLs which in turn expands the time taken for the sensors to complete. In order to guarantee better performance, the priority mechanism in the airflow configuration file was altered to an upstream system. That means if the initial sensor succeeds, the ETL Task will be run straight away, regardless of the results of the other sensors. This prioritization of ETL tasks decreased the amount of prodding done by sensors and led to a more efficient setup when the cluster was dealing with a high workload. At Adyen, we have many DAGs with a large number of tasks that are activated on a daily or even more frequent basis. You may be wondering who is responsible for keeping an eye on all these DAGs? At Adyen, we empower engineers with freedom to produce ETLs and machine learning models. This gives domain teams autonomy to create and manage their own Directed Acyclic Graphs (DAGs) for their data demands. However, this autonomy comes with the responsibility of guaranteeing the platform is operational and efficient. Our platform teams at Adyen ensure this by exercising best practices for automation and testing. To minimize the number of errors in our DAGs, our data developers must do certain things before combining any modifications: to guarantee that you don't introduce bugs in your code Ensure that pre-merge tests are conducted to prevent any bugs from being added to your code. Considering the pre-merge tests, factors to be taken into account include the DAG sensor and any operator dependencies; further, the task itself must be examined to decide if the resources are ample or need to be augmented for the task to succeed. We conduct a range of tests to address and detect any potential issues, such as: is a great way to pre-test a workflow Testing the functionality of a workflow prior to completion can be accomplished by running it without any actual data. After executing the unit tests, we go on to check our DAG by starting the Airflow webserver on our local machine and creating a dagrun of the altered DAG. We often detect DAG import errors due to improper sensor and task configurations while doing this. Therefore, we utilize our own CLI tool to launch the Airflow services. This involves turning on the PostgreSQL DB, Airflow scheduler, Airflow DAG processor, Airflow Celery Worker and Airflow Webserver so our DAGs can run on our local computer. is not an easy proposition. It is not straightforward to track DAGs in test setups. Before rolling out any modifications to our DAGs, we first launch them in our beta and test setups. This lets us run the tasks across several days and make sure any problems overlooked in our pre-merge testing process are resolved beforehand. As a result, we can deploy the changes to our live production environment safely. are an essential part of the data analysis process. Verification of data is a necessary component of the data analysis procedure. Our engineers crafted a high-functioning, user-friendly in-house data validation framework to ensure the accuracy of data across all pipelines. This system allows us to verify different data requirements, such as that the values of columns are unique or limited to a specific set. We also observe any proportionate data discrepancies that continuously occur over a period of time to make sure the quantity of data written is precisely what it should be. Our data reporting team is currently working to make this validation framework available to the public. must be thoroughly examined The thorough examination of challenge and future ideas is necessary. The current issue we experience with the present setup is the amount of time it takes to execute the Airflow unit tests. The DAG tests that we created are quite thorough as enumerated earlier and need approximately 20–30 minutes to finish running for each DAG. The longest part of this duration is occupied by the schema tests, which necessitate a spark session to manufacture blank dummy tables. This test is especially significant since it catches the majority of errors, thus it is essential that we keep it in the test compilation. Given the lengthy duration of the DAG testing process, we are exploring ways to expedite it. We recently discovered the in-built dag.test() tool, which can reduce the time it takes to flag errors due to not requiring the Airflow service to be running and accommodating custom operators. In this blogpost, we discussed a number of difficulties we experienced while striving to attain dependability at a large scale with Airflow at Adyen. To ensure data reliability, we use synchronous commits in our database setup. To facilitate a more efficient workflow, we altered our DAGs to prioritize ETL tasks over sensors. Additionally, we created a robust suite of tests to make our Airflow users more productive, enabling them to either test locally or as part of our CI/CD pipelines. We are actively seeking ways to make the testing process even more efficient and less time-consuming for our data experts. To top it off, we run each Airflow component on at least three servers and constantly update our Airflow instance.

 
 
 

Comments


bottom of page