-
Notifications
You must be signed in to change notification settings - Fork 2
Client Reads
##Overview##
When a client requests to read a file from HDFS, the client first communicates with the NameNode and receives the addresses of the DataNodes that contain the requested file. The client then communicates directly with the DataNodes whose addresses it received from the NameNode to perform the read.
##Main Use Case##
###Client reads from file###
- In order for a client to read a file from HDFS, the client first contacts the NameNode by calling open() on an instance of the DistributedFileSystem class.
- The DistributedFileSystem object talks to the NameNode through a DFSClient object using the ClientNamenodeProtocol (i think? -lauren)to determine the location of the blocks of the file.
- The NameNode then returns a FSDataInputStream object which has stored the addresses of all the DataNodes that manage a copy of the first few blocks (the number of which is specified by the prefetch size from the DfsClientConfig) of the file. The client interacts directly with the aforementioned DataNodes using the FSDataInputStream to complete the read.
- The client then calls read() on the FSDataInputStream which connects to closest DataNode for the first block in the file. The data is streamed from the DataNode directly back to the client.
- If the end of the block is read, the FSDataInputStream will call on the NameNode for the closest DataNode containing the next block of the file and continue to stream data back directly to the client as explained previously.
- When the client is finished reading, the client calls close() on the FSDataInputStream.
##Main Data Structures/Relevant Classes##
- DistributedFileSystem - Implementation of the abstract FileSystem for the DFS system. This object is the way end-user code interacts with a Hadoop DistributedFileSystem.
- public FSDataInputStream open(Path f, final int bufferSize) throws IOException
-
DFSClient - DFSClient can connect to a Hadoop Filesystem and perform basic file tasks. It uses the ClientProtocol to communicate with a NameNode daemon, and connects directly to DataNodes to read/write block data.Hadoop DFS users should obtain an instance of DistributedFileSystem, which uses DFSClient to handle filesystem tasks.
Create an input stream that obtains a nodelist from the namenode, and then reads from all the right places. Creates inner subclass of InputStream that does the right out-of-band work.
- public DFSInputStream open(String src, int buffersize, boolean verifyChecksum) throws IOException
- ClientNamenodeProtocol (still need to find exactly which methods used)
- DfsClientConfig - Contains client configuration properties (more detail needed)
- FSDataInputStream - Object client uses to communicate with DataNodes during reads
- public int read(long position, byte[] buffer, int offset, int length) throws IOException
Reference materials
- Rice HDFS
- General Notes
- Common
-
NameNode
- Glossary
- Specification
- Documentation
- Specification
- DataNode
- Teams and Structure
- Overview
- Documentation
- Interfacing with NameNode
- Interfacing with Client
- Interfacing with other DataNodes