-
Notifications
You must be signed in to change notification settings - Fork 2
Client Writes
##Overview##
HDFS employs a "write-once-read-many" model to optimize for throughput. Files in HDFS are write-once and have strictly one writer at a time. HDFS also supports append() functionality that allows the client to append data to the end of an existing file. In this section we will first discuss how a client performs that single write and then we will discuss how a client may append to a pre-existing file.
##Major Use Cases## Creating a File
- To write data to HDFS, the client must first create a file. To create a file, the client calls create() on an instance of the DistributedFileSystem class and provides a pathname. The client must also provide an FsPermission object specifying the write permissions for the file.
- The DistributedFileSystem object communicates to the NameNode to create a new file using a DFSClient object. However, the NameNode is not contacted immediately. Rather, the client writes to a temporary local file first, instead of the FSDataOutputStream.
- When the written data accumulated exceeds 1 HDFS block size (typically 64MB, although the client can provide a different block size in the call to create()), a DataStreamer asks the NameNode to insert the filename into the file system hierarchy and allocates a data block for it.
- The client flushes the local block to the specified data block using the FSDataOutputStream. As more data is written to the FSDataOutputStream, the NameNode will allocate more blocks for the file.
- When the client has completed the write, it calls close() on the FSDataOutputStream.
[Still researching/In progress]: Append Writes
- To write data in the form of an append() to an existing file, the client calls append() on an instance of the DistributedFileSystem class and provides a pathname.
- The call contacts the NameNode to verify that the file exists, is open for writing, and requests the block locations (addresses of the DataNodes that contain copies of the blocks allocated to the file). If the file exists, the call returns an FSDataOutputStream for the client to write the data to append to.
- As more data is written to the FSDataOutputStream, the NameNode allocates more blocks for the file.
- When the client has completed the write, it calls close() on the FSDataOutputStream.
##Main Data Structures/Relevant Classes##
-
DistributedFileSystem - Implementation of the abstract FileSystem for the DFS system. This object is the way end-user code interacts with a Hadoop DistributedFileSystem.
- public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException
-
DFSClient - DFSClient can connect to a Hadoop Filesystem and perform basic file tasks. It uses the ClientProtocol to communicate with a NameNode daemon, and connects directly to DataNodes to read/write block data.Hadoop DFS users should obtain an instance of DistributedFileSystem, which uses DFSClient to handle filesystem tasks.
*Create a new dfs file with the specified block replication with write-progress reporting and return an output stream for writing into the file. public DFSOutputStream create(String src, FsPermission permission, EnumSet flag, boolean createParent, short replication, long blockSize, Progressable progress, int buffersize, ChecksumOpt checksumOpt) throws IOException
-
ClientNamenodeProtocol (still need to find exactly which methods used)
-
DfsClientConfig - Contains client configuration properties (more detail needed)
-
FSDataOutputStream - Object client uses to communicate with DataNodes during writes
##Reference materials http://data-flair.training/blogs/hadoop-hdfs-data-read-and-write-operations/ https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html http://itm-vm.shidler.hawaii.edu/HDFS/ArchDocCommunication.html
- Rice HDFS
- General Notes
- Common
-
NameNode
- Glossary
- Specification
- Documentation
- Specification
- DataNode
- Teams and Structure
- Overview
- Documentation
- Interfacing with NameNode
- Interfacing with Client
- Interfacing with other DataNodes