Halo.
Pada akhir tahun lalu, GlowByte dan Gazprombank membuat laporan bersama yang besar di konferensi Big Data Days, yang didedikasikan untuk pembuatan gudang data analitik modern berdasarkan ekosistem Cloudera Hadoop. Dalam artikel tersebut, kami berbicara tentang pengalaman membangun sistem, kesulitan dan tantangan yang harus kami hadapi dan atasi untuk mencapai kesuksesan dalam proyek.
Hadoop . β Β« ?Β». . - , - , , , , , Hadoop.
β Cloudera , ββ . .
ββ β . -3 .
, 2017 β β .
, , data driven .
. , : , . . .
:
( , );
;
;
-;
;
Self-service ;
Data Science .
. :
-
-: CRM, Real Time Offer, Next Best Offer, ;
- as is ( Data Lake);
;
;
;
( );
;
;
.
;
;
SLA;
ELT ;
Enterprise (, SAP Business Objects, SAS);
.
, , open source , β \ .
Hadoop Cloudera Data Hub
.
Cloudera Data Hub.
1.
. ETL . ββ . .
Hadoop 40- - t-1 t-15 batch , real-time . :
CRM;
;
;
;
Collection;
MDM;
;
;
BI
2. β β
, , , . . Disaster Recovery .
science , , - . . , . . .
, , .
, , K8S, GPU .
, , ETL, , Cloudera.
CDH 5.16.1. .
Data : CPU 2x22 Cores 768Gb RAM SAS HDD 12x4Tb. HPE DL380 Cloudera Enterprise Reference Architecture for Bare Metal Deployments. ββ, - , ETL . . , β100500β , , ββ.
, , .
Hadoop;
(ETL);
Β«- β> HadoopΒ» Β«Hadoop β> HadoopΒ»;
;
;
.
Hadoop 1.0 , java , , , Β« Β» Β« Β». , , SQL.
, , β SQL SQL. . SQL- Β« , Β».
«» SQL Hadoop. Impala . Impala Cloudera Hadoop .
Impala ?
Impala β , HDFS, MapReduce, TEZ SPARK.
Impala β .
Impala Parquet, (bloom , ), . Impala , MPP Teradata GreenPlum.
Impala , , ETL .
Hadoop YARN . .
SQL , , SQL , 3-4 .
Hadoop :
- Hue, Cloudera. , SQL Excel.
Cloudera, β Impala ETL , ad-hoc BI ? - Impala Β« Β» Hive. E , .
β ETL .
ETL :
;
;
jobβ .
- , , Hadoop , . Hadoop - SQL. β β ( , ), Hadoop β β.
, . metadata driven E-L-T ETL , SQL . SQL . ETL , SQL. SAS Data Integration.
ETL metadata driven ELT. airflow!
;
lineage ETL , API;
.. jobβ ETL .
CI/CD
SAS DI API .
β .
β Data Replicator. Hadoop.
;
;
.. , ( ), ..
, , . , SLA Hadoop.
Data Replicatorβ - Hadoop DR . , - , API. ETL , API . , DR , , «» .
, Hadoop ( Hadoop ) , , kafka, flume, ETL tool.
Hadoop . , , ( Hive) ( Impala).
β , . 247 . .. \ , ( , ..). .
, HIVE 3 ACID , , Hive ( Map Reduce), ACID Impala Hadoop .
HDFS snapshot VIEW.
HDFS, , VIEW.
VIEW, , .
β VIEW HDFS , Hadoop. UNDO Oracle, retention .
, HDFS , DDL VIEW .. metastore. .. VIEW .
HDFS Snapshot .
DataReplictorβ. , , ETL API. , ETL API VIEW.
, 247 . HDFS HDFS. , 25%.
β .
;
;
, ;
Hadoop cgroups;
Hadoop;
Hadoop, YARN Impala;
Impala β .
β ETL Cloudera.
. SQL , .
900 SQL . .
, . 1,5 2 . .
, , , . Hadoop , , , open source ( Apache Big Top) .
Cloudera :
Active Directory (AD) ;
AD Sentry;
Sentry Impala HDFS;
Target VIEW ;
;
SSL . .
Hadoop ( )
;
ETL;
Hadoop ;
, , .
β .
Hadoop ( ) β , . .
. , Hadoop, , , .
ad-hoc , , .
, :
;
;
;
;
;
;
MDM;
;
;
;
;
;
;
;
;
;
.
, 177 2350 -. snappy 20 ( 100 RAW).
2010 . , . , . , , . . , , .
- -, . 40 , 550 13200 .
, Hadoop. Cloudera Data Hub - , . , .
, metastore ( ).
Impala. ββ . β ( , ETL, , ) , . sqoop export. Impala .
, , decommission , , .
. 36 500 .
Cloudera Data Impact 2020 Data For Enterprise AI.
, Hadoop Cloudera . - . β β. β β , .
ββ, ββ, ββ . . , , . «» .
time to market , data driven .
. ββ , t - 3-5 - . , , CRM. , , . . - !
Hadoop. Hadoop . SQL MPP, ββ , β β .
Cloudera Data Platform 7.1. , CDP . , , , , Impala 3.4, parquet, Zstd . Atlas Cloudera Data Flow Β« Β». Cloudera BI - Cloudera Data Visualization.
Hadoop:
Real-time Kudu (real-time , ). Kudu, Parquet, «» SQL Impala. - .
ODS
ODS Oracle Golden Gate , Hadoop «» «» .
property Hadoop;
Arango;
Arango;
( );
( , , );
,
-
, ;
, . - , β β.
K8S
, . , .
:
, .
, ().