Demystifying Bergabung di Apache Spark

Halo, Habr. Untuk calon siswa kursus "Ecosystem Hadoop, Spark, Hive" menyiapkan terjemahan materi.



Kami juga mengundang semua orang ke webinar "Menguji Aplikasi Percikan" . Dalam pelajaran terbuka ini, kami akan mempertimbangkan masalah dalam menguji aplikasi Spark: data stat, verifikasi parsial, dan start / stop sistem berat. Mari pelajari perpustakaan untuk solusi dan menulis tes.






Artikel ini secara eksklusif berfokus pada operasi Gabung di Apache Spark dan memberikan gambaran umum tentang fondasi tempat teknologi Gabung Spark dibuat.

Gabungan sering digunakan dalam aliran data mining tipikal untuk menghubungkan dua set data. Apache Spark, sebagai mesin analitik terpadu, juga telah memberikan dasar yang kokoh untuk menjalankan berbagai skenario Gabung.





Join , , , , . ( ) Join , Joined . .





Join:

, Join Apache Spark. :





1) : Join. , Join, Join.





2) Join: , , (Join Condition). () , . , : Join Joins.





, , . . , (A.x == B.x) ((A.x == B.x) (A.y == B.y)) -   x, y  A B, Join.





. , . , (A.x < B.x) ((A.x == B.x) (A.y == B.y)) -   x, y  A B, Join.





3) Join type: Join Join Join . Join:





(Inner Join): Inner Join Joined ( Join) .





(Outer Join): Outer Join , . , () .





(Semi Join): Semi Join , , , . , , , (Semi Join) (Anti Join).





: Cross Join , .





Join, Apache Spark Join.





Join

, Join, .





Apache Spark Join. :





  • (Shuffle Hash Join)





  • (Broadcast Hash Join)





  • (Sort Merge Join)





  • (Cartesian Join)





  • (Broadcast Nested Loop Join)





Broadcast Hash Join: «Broadcast Hash Join» ( Join) . - , , -.





“Broadcast Hash Join" . , . Spark , .





Shuffle Hash Join: 'Shuffle Hash Join' () ( , ”Guide to Spark Partitioning ( Spark)”. , , (shuffle) Join.





, , Shuffle Hash Join, , Hash Join. , - , -.





"Shuffle Hash Join" "Broadcast Hash Join". , - . , , Join 'Shuffle Hash Join'. , 'Broadcast Hash Join', Spark .





Sort Merge Join: 'Sort Merge Join' 'Shuffle Hash Join'. () . , , (shuffle) Join.





, , Sort Merge Join , Sort Merge Join.





'Sort Merge Join' 'Shuffle Hash Join' 'Broadcast Hash Join', , 'Sort Merge Join' , 'Shuffle Hash' 'Broadcast Hash'. , 'Shuffle Hash Join', , (shuffle) , , 'Sort Merge Join'.





Cartesian Join: Cartesian Join . . , . , .





Cartesian Join . Join, Cartesian - .





Broadcast Nested Loop Join: 'Broadcast Nested Loop Join' . Nested Loop Join .





«Broadcast Nested Loop Join» , . , , .





Spark Join?

Join Join, , Spark :





Spark Join, :









  • Join









  • Join





  • (Equi or Non-Equi Join)





Spark API Join Join Join. Join, 'broadcast', 'merge', 'shuffle_hash' 'shuffle_replicate_nl', , Join.





, Spark Join :





'Broadcast Hash Join'





  • Equi Join





  • 'Full Outer' Join





, :





  • 'Broadcast', Join - 'Right Outer', 'Right Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Right Outer', 'Right Semi', or 'Inner'.





  • 'Broadcast' , Join - 'Left Outer', 'Left Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Left Outer', 'Left Semi', or 'Inner'.





  • 'Broadcast' , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi' 'Inner'.





'Shuffle Hash Join'





  • Equi Join





  • 'Full Outer' Join





  • 'spark.sql.join.prefersortmergeJoin



    ( true)' false





, :





  • 'shuffle_hash' , Join - 'Right Outer', 'Right Semi', 'Inner'.





  • , , Join - 'Right Outer', 'Right Semi' 'Inner'.





  • 'shuffle_hash' , Join - 'Left Outer', 'Left Semi', 'Inner'.





  • , , Join - 'Left Outer', 'Left Semi', 'Inner'.





  • 'shuffle_hash' , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi', 'Inner'.





  • , , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi', 'Inner'.





'Sort Merge Join'





  • Equi Join





  • Join Keys, Equi Join,





  • 'spark.sql.join.prefersortmergeJoin ( true)' true.





, :





  • 'merge' , Join .





  • , Join .





'Cartesian Join'





  • 'Inner'





, :





  • 'shuffle_replicate_nl' , Join Equi Non-Equi.





  • , Join Equi Non-Equi.





'Broadcast Nested Loop Join'

'Broadcast Nested Loop Join' - Join ; , 'Broadcast Nested Loop Join' Join Join.





, Join , 'Broadcast Hash Join', 'Sort Merge Join', 'Shuffle Hash Join', 'Cartesian Join'.





Cartesian Broadcast Nested Loop Join, Broadcast Nested Loop Inner, Non-Equi Joins, Cartesian Join, , .





, : Join. , .





, Join Apache Spark. - , , .






« Hadoop, Spark, Hive»





« Spark »








All Articles