Light up the Spark in Catalyst by avoiding UDFs
Processing data at scale usually results in struggling with performance, strict SLA, limited hardware etc. I've struggled with cutting Spark SQL query run-time and found the culprit! This culprit, and SOLUTION! I would like to share with you. Today in the world of Big Data and Spark we are processing high volume transactions. Catalyst is the Spark SQL query optimizer and in this talk, you will learn how to fully utilize Catalyst optimization power in order to make our queries as fast as possible, by pushing down actions and trying to avoid UDFs as much as possible and maximizing performance.
Adi Polak is a senior software engineer and a developer advocate at Microsoft working on Azure, where she focuses on microservices architecture, distributed systems, real-time processing, high scale and performance, big data analysis, machine learning at scale and functional programming. As a developer advocate, Adi brings her vast experience in tech and help both startups and enterprises to design, architect and build their software and infrastructure with cost-effective, scalability, team knowledge and business market in mind. Adi was nominated to be 1 of 25 influential women in Software De