Hadoop vs. Spark: How to Choose Between the Two?

Sasha Andrieiev
2 min readJun 3, 2020

--

Hadoop and Spark are different platforms, each implementing various technologies that can work separately and together.

Hadoop is an open-source distributed framework that manages data processing and storage for big data applications. Hadoop is mostly used for managing and performing big data analytics (MapReduce on huge amounts of data / not real-time).

Hadoop includes Sqoop, Hive, and Mahout. In addition to using HDFS for file storage, Hadoop can also now be configured to use S3 buckets or Azure blobs as input.

Use cases for Hadoop:

  • Risk forecasting;
  • Industrial planning and predictive maintenance;
  • Fraud identification;
  • Stock Market Analysis;

Companies using Hadoop: Amazon Web Services, Cloudera, IBM, Microsoft, CERN.

Spark is a fast, easy-to-use, powerful, and general engine for big data processing tasks. It consists of six components — Core, SQL, Streaming, MLlib, GraphX, and Scheduler. Spark is mostly used for ETL and SQL batch jobs across large datasets, processing of streaming data from sensors, IoT, or financial systems, and machine learning tasks.

Use cases for Spark:

  • Real-time Big Data processing;
  • Machine learning software;
  • Interactive requests;
  • Fog and edge computing.

Companies using Spark: UC Berkeley AMPLab, Baidu, TripAdvisor, Alibaba Taobao.

Both Hadoop and Spark have their benefits and challenges. If you are willing to find out which fits your needs best, feel free to check our recent blog post.

https://jelvix.com/blog/hadoop-vs-spark-what-to-choose-to-process-big-data

--

--

Sasha Andrieiev
Sasha Andrieiev

No responses yet