For iterative processes and interactive use, Hadoop and MapReduce's mandatory dumping of output to disk proved to be a huge bottleneck. In ML, for example, users rely on iterative processes to train-test-retrain. |
Spark uses an in-memory processing paradigm, which lowers the disk IO substantially. Spark uses DAGs to store details of each transformation done on a parallelized dataset and does not process them to get results until required (lazy). |
Traditional Hadoop applications needed the data first to be copied to HDFS (or other distributed filesystem) and then did the processing. |
Spark works equally well with HDFS or any POSIX style filesystem. However, parallel Spark needs the data to be distributed. |
Mappers needed a data-localization phase in which the data was written to the local filesystem to bring resilience. |
Resilience in Spark is brough about by the DAGs, in which a missing RDD is re-calculated by following the path from which the RDD was created. |
Hadoop is built on Java and you must use Java to take advantage of all of it's capabilities. Although you can run non-Java scripts with Hadoop Streaming, it is still running a Java Framework. |
Spark is developed in Scala, and it has a unified API with so you can use Spark with Scala, Java, R and Python. |