Spark It is a necessary skill for big data development . The question often asked in an interview is Spark What is it? , Or please introduce
Spark, Today's article mainly explains this problem . Many people's answers are not accurate enough , The most accurate description of this problem can be found on the official website .
<>1. Overall introduction
Open the official website to see a line of eye-catching tables and :
Unified engine for large-scale data analytics
Translate it : Unified engine for large-scale data analysis . Keep looking down :
What is Apache Spark™?
Apache Spark™ is a multi-language engine for executing data engineering, data
science, and machine learning on single-node machines or clusters.
Here is the answer to our question :Apache Spark™ Is a multilingual engine , Used to perform data engineering on single node machines or clusters , Data science and machine learning .
Summarize the main points :Spark Is a computing engine , For the calculation of large-scale data , Support multiple programming languages .
<>2. features
The above is a general description , Introduction to some more specific features , The official website also made an answer :
Key features
Simple. Fast. Scalable. Unified.
Spark The characteristics of are summarized in four words : simple , Fast , Scalable , Unity . A more specific description is also given on the official website :
Batch/streaming data
Unify the processing of your data in batches and real-time streaming, using
your preferred language: Python, SQL, Scala, Java or R.
Batch processing / Stream processing : have access to Python,SQL,Scala,Java or R, Unified data processing through batch processing and real-time streaming processing .
SQL analytics
Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc
reporting. Runs faster than most data warehouses.
SQL analysis : Fast execution for dashboards and interim reports , Distributed ANSI SQL query . Faster than most data warehouses .
Data science at scale
Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having
to resort to downsampling
Large scale data science : yes PB Series for exploratory data analysis (EDA), Without down sampling
Machine learning
Train machine learning algorithms on a laptop and use the same code to scale
to fault-tolerant clusters of thousands of machines.
machine learning : Training machine learning algorithm on notebook computer , And use the same code to expand to a fault-tolerant cluster of thousands of machines .
<>3. ecology
Apache Spark™ integrates with your favorite frameworks, helping to scale them
to thousands of machines.
Data science and Machine learning
SQL analytics and BI
Storage and Infrastructure
Spark Integrates multiple frameworks , Ability to extend these frameworks to thousands of machines . These frameworks include :
* Data science and machine learning :scikit-learn,pandas,TensorFlow,PyTorch,mlflow,R
* SQL Analysis and BI:Superset,Power BI,Looker,redash,tableau,dbt
* Storage and infrastructure :Elasticsearch,MongoDB,Kafka,delta
lake,kubernetes,Airflow,Parquet,SQL Server,cassandra,orc
<>4. Core module
Spark Core: Provided Spark The most basic and core functions ,Spark Other functions such as :Spark SQL,Spark
Streaming,GraphX,MLlib All in Spark Core Based on .
Spark SQL:Spark Components used to manipulate structured data . adopt Spark SQL, Users can use SQL perhaps Apache Hive Version SQL
dialect (HQL) To query data .
Spark Streaming:Spark Components of streaming computing for real-time data on the platform , Provides a rich way to process data streams API.
Spark MLlib:MLlib yes Spark A machine learning algorithm library provided by .MLlib
Not only provides model evaluation , Additional functions such as data import , Some lower level machine learning primitives are also provided .
Spark GraphX:GraphX yes Spark Framework and algorithm library for graph computing .
<>5. summary
At the end of the article Spark What is this problem to make a summary :
* Spark Is a memory based fast , currency , Scalable big data analysis and calculation engine .
* Spark Core Provided in Spark The most basic and core functions .
* Spark SQL yes Spark Components used to manipulate structured data . adopt Spark SQL, Users can use SQL perhaps Apache Hive Version
SQL dialect (HQL) To query data .
* Spark Streaming yes Spark Components of streaming computing for real-time data on the platform , Provides a rich way to process data streams API.
Technology