Skip to content

Block Manager

ap35 edited this page Sep 6, 2016 · 13 revisions

Block Manager

I still need to do more research on the different functions in the BlockManager.java class

Overview

The Block Manager controls block placement and replication in the cluster, retrieving lists of blocks from the datanodes, and fault tolerance code to get rid of corrupted blocks and restore the cluster if there is a failure. It also records stats about storage info. To handle all these tasks asynchronously, there are different daemons for each of the different tasks that are created at startup. There are locks on each blocks to synchronize threads accessing the same blocks and it seems like each thread is persistent.

Main Data Structures

  • BlockInfo
  • This Class holds information on each individual block
  • It holds the id of its block collection and a list of DataNodeStorageInfo objects
  • There are two different types of Blocks
    • Contiguous Block
      • This a normal block that represents a file chunk
    • Striped Block
      • This is a collection of blocks that is used for storing chunks of erasure coding files
  • Block Collection
    • A list of a file's BlockInfo objects
    • Used to get last Block of a file efficiently and to determine if the file is under Construction
  • BlockMap
    • A mapping of a BlockInfo to its corresponding block collection and DataStorageInfo objects
  • DataNodeDescriptor
    • Main communication interface between the namenode and the datanodes
    • Stores all relevant information of a single data node
      • Network info so that the namenode can send data to the data nodes
      • Lists of blocks that either need to be replicated, erasure coded, recovered, and invalidated in next cycle
      • Storage info of blocks on the data node
      • A cache of blocks which speeds up lookups
  • DataNodeStorageInfo
    • Main class for storage info in a datanode
    • A datanode can have multiple storage info objects at a time
    • What it stores:
      • Storage type for all its blocks
      • Stats used in making storage reports
      • Blocks that are scheduled for storage in this location
      • Blocks that are stored in TreeSet for fast lookups
  • BlockPlacementPolicy
    • class for calculating placement of block and its replicas based on the specific policy
    • 4 different policies:
      • Default- This is just places original in different rack and the replicas on different nodes on same rack
        • Node Group- refines default to ensure two blocks are on different node groups of same rack
        • Upgrade Domain- refines default to ensure all three blocks have unique upgrade domains
      • Rack Fault Tolerant- places all blocks on different racks
    • public DatanodeStorageInfo[] chooseTarget() does the calculations
  • BlockStoragePolicy
    • It defines how a specific block should be stored on a data node
    • It is basically a wrapper of constants that the data node processes to determine where to store the block

Block and Replica States

  • A block can be marked one of many states
    • Under Construction- a new block has been allocated but it has not been filled up and sufficiently replicated
      • Complete- a block has recently been replicated sufficiently and will not be modified again
        • It will be not be considered under construction at this point
      • Commited- the client has reported all bytes have been written but not enough replicas have completed
      • Under Recovery- when a lease to file expires and the block is not complete, it goes through a recovery process to synchronize the contents among all replicas
      • Replicas for Under Construction blocks have different states too
        • Finalized- the replica and its contents have been verified and will not be modified again
          • It will be considered fully constructed at this point
        • RBW- the replica is being written to by the client
        • RWR- the replica is waiting to be recovered
        • RUR- the replica is undergoing recovery
        • Temporary- it is a temporary replica created only for replication and relocation purposes
    • After the block and its replicas have been fully constructed, it can be any of these States
      • Complete or Finalized- discussed above
      • Corrupted- a block that has been written to but the timestamps on the block and its BlockInfo object do not match up nor do the checksums on the block verify it.
        • a block and all its replicas must have this problem to be considered corrupt
      • Stale- a block whose datanodes havent sent a recent block report
      • Invalid- a block that is set to be deleted by the namenode
      • Low Redundancy- a block that is under replicated
      • PendingReconstruction- an under replicated block that is set to be restored to standard replication factor
        • a block that is under construction will not be considered for reconstruction
      • Live- the block is on a fully functioning data node
      • Read Only- blocks are initially this state during startup
      • Excess- an extra replication of the block that will be scheduled for deletion soon
      • Decommissioned- block is on a data node that is decommissioned or about to be so.
  • Each of these states has a corresponding block class or enum to signify it

##Block Reports

See Block Reports

Replication/Reconstruction

private class ReplicationMonitor

  • This thread maintains that each block meets at least the minimum replication factor, if not it will add a replication job on the queue and handle when it is popped off. it is mostly managed with the Replication monitor, which periodically checks the replication factor of blocks.

public BlocksWithLocations getBlocksWithLocations

  • it gets blocks and their locations with this method mostly, where it polls each datanode with a request to send their block list. Then it calculates the replication factor of each block and sees if any are under replicated

int computeDatanodeWork()
private void processPendingReconstructions() void rescanPostponedMisreplicatedBlocks()

  • these methods compute the replication work for the data nodes and then add the work to their queues stored in their DataNodeDescriptor.
  • They are periodically called by the monitor

Corrupted Blocks

private BlockToMarkCorrupt checkReplicaCorrupt

  • This verifies if a specific block is corrupt or is under construction, corrupted blocks are then removed when the data nodes process their block work for the cycle.

Storage Fragmentation Handler

private class StorageInfoDefragmenter

  • this class compacts Block TreeSet that do not meet the minimum fill factor
    • The tree is in DataNodeStorageInfo
    • It compacts it by remaking the tree
  • it runs in a separate thread

Comments

  • How does the block manager interact with the DataNodes...it seems like there is a field datanodeManager, which I am guessing is used to poll each datanode?
    • It also seems there is a bunch more interaction that you should probably talk to NameNode people about
  • Look into completing/committing blocks and what that entails
    • Probably should document the different statuses that a block can have (Under Construction/Complete/etc)
  • What is a LocatedBlock? (It seems like this will be a shared data structure amongst different groups and it seems like the BlockManager often creates it)
  • How do we know whether a block is corrupt?
  • What is block reconstruction?

Prudhvi's Comments:

  • Should enumerate all the possible replica states, as you did with block states, see here
    • Current, Corrupt, On a failed DataNode, Out of date, Under construction
  • Discuss what parts of block management are done by the NameNode and what parts are done by the DataNode
  • How does block recovery work?
  • I've seen different sources indicate different possible block and replica states. For example this page lists: under construction, under recovery, committed, and complete as the block states. It also lists: finalized, replica being written, replica waiting to be recovered, replica under recovery, and temporary as replica states. This differs from the the states in this wiki, so we should try to figure out a single consolidated list for all the possible states.

Clone this wiki locally