ramiichinnam.blogspot.co.at - Apache Spark Blog

Example domain paragraphs

  Checkpointing saves an RDD to a reliable storage system (e.g. HDFS, S3) while forgetting the RDD's lineage completely.  As the driver restarts the recovery takes place.

There are two types of data that we checkpoint in Spark:

Metadata Checkpointing – Metadata means the data about data. It refers to saving the metadata to fault tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches. Configuration refers to the configuration used to create streaming DStream operations are operations which define the steaming application. Incomplete batches are batches which are in the queue but are not complete.