當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

HDFS读写过程解析

發布時間：2025/6/15 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 HDFS读写过程解析小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、文件的打開

1.1、客戶端

HDFS打開一個文件，需要在客戶端調用DistributedFileSystem.open(Path f, int bufferSize)，其實現為：

public FSDataInputStream open(Path f, int bufferSize) throws IOException {

? return new DFSClient.DFSDataInputStream(

??????? dfs.open(getPathName(f), bufferSize, verifyChecksum, statistics));

}

其中dfs為DistributedFileSystem的成員變量DFSClient，其open函數被調用，其中創建一個DFSInputStream(src, buffersize, verifyChecksum)并返回。

在DFSInputStream的構造函數中，openInfo函數被調用，其主要從namenode中得到要打開的文件所對應的blocks的信息，實現如下：

synchronized void openInfo() throws IOException {

? LocatedBlocks newInfo = callGetBlockLocations(namenode, src, 0, prefetchSize);

? this.locatedBlocks = newInfo;

? this.currentNode = null;

}

private static LocatedBlocks callGetBlockLocations(ClientProtocol namenode,

??? String src, long start, long length) throws IOException {

??? return namenode.getBlockLocations(src, start, length);

}

LocatedBlocks主要包含一個鏈表的List<LocatedBlock> blocks，其中每個LocatedBlock包含如下信息：

Block b：此block的信息
long offset：此block在文件中的偏移量
DatanodeInfo[] locs：此block位于哪些DataNode上

上面namenode.getBlockLocations是一個RPC調用，最終調用NameNode類的getBlockLocations函數。

1.2、NameNode

NameNode.getBlockLocations實現如下：

public LocatedBlocks?? getBlockLocations(String src,

??????????????????????????????????????? long offset,

??????????????????????????????????????? long length) throws IOException {

? return namesystem.getBlockLocations(getClientMachine(),

????????????????????????????????????? src, offset, length);

}

namesystem是NameNode一個成員變量，其類型為FSNamesystem，保存的是NameNode的name space樹，其中一個重要的成員變量為FSDirectory dir。

FSDirectory和Lucene中的FSDirectory沒有任何關系，其主要包括FSImage fsImage，用于讀寫硬盤上的fsimage文件，FSImage類有成員變量FSEditLog editLog，用于讀寫硬盤上的edit文件，這兩個文件的關系在上一篇文章中已經解釋過。

FSDirectory還有一個重要的成員變量INodeDirectoryWithQuota rootDir，INodeDirectoryWithQuota的父類為INodeDirectory，實現如下：

public class INodeDirectory extends INode {

? ……

? private List<INode> children;

? ……

｝

由此可見INodeDirectory本身是一個INode，其中包含一個鏈表的INode，此鏈表中，如果仍為文件夾，則是類型INodeDirectory，如果是文件，則是類型INodeFile，INodeFile中有成員變量BlockInfo blocks[]，是此文件包含的block的信息。顯然這是一棵樹形的結構。

FSNamesystem.getBlockLocations函數如下：

public LocatedBlocks getBlockLocations(String src, long offset, long length,

??? boolean doAccessTime) throws IOException {

? final LocatedBlocks ret = getBlockLocationsInternal(src, dir.getFileINode(src),

????? offset, length, Integer.MAX_VALUE, doAccessTime);?

? return ret;

}

dir.getFileINode(src)通過路徑名從文件系統樹中找到INodeFile，其中保存的是要打開的文件的INode的信息。

getBlockLocationsInternal的實現如下：

private synchronized LocatedBlocks getBlockLocationsInternal(String src,

???????????????????????????????????????????????????? INodeFile inode,

???????????????????????????????????????????????????? long offset,

???????????????????????????????????????????????????? long length,

???????????????????????????????????????????????????? int nrBlocksToReturn,

???????????????????????????????????????????????????? boolean doAccessTime)

???????????????????????????????????????????????????? throws IOException {

??//得到此文件的block信息

? Block[] blocks = inode.getBlocks();

? List<LocatedBlock> results = new ArrayList<LocatedBlock>(blocks.length);

??//計算從offset開始，長度為length所涉及的blocks

? int curBlk = 0;

? long curPos = 0, blkSize = 0;

? int nrBlocks = (blocks[0].getNumBytes() == 0) ? 0 : blocks.length;

? for (curBlk = 0; curBlk < nrBlocks; curBlk++) {

??? blkSize = blocks[curBlk].getNumBytes();

??? if (curPos + blkSize > offset) {

??????//當offset在curPos和curPos + blkSize之間的時候，curBlk指向offset所在的block

????? break;

??? }

??? curPos += blkSize;

? }

? long endOff = offset + length;

??//循環，依次遍歷從curBlk開始的每個block，直到當前位置curPos越過endOff

? do {

??? int numNodes = blocksMap.numNodes(blocks[curBlk]);

??? int numCorruptNodes = countNodes(blocks[curBlk]).corruptReplicas();

??? int numCorruptReplicas = corruptReplicas.numCorruptReplicas(blocks[curBlk]);

??? boolean blockCorrupt = (numCorruptNodes == numNodes);

??? int numMachineSet = blockCorrupt ? numNodes :

????????????????????????? (numNodes - numCorruptNodes);

????//依次找到此block所對應的datanode，將其中沒有損壞的放入machineSet中

??? DatanodeDescriptor[] machineSet = new DatanodeDescriptor[numMachineSet];

??? if (numMachineSet > 0) {

????? numNodes = 0;

????? for(Iterator<DatanodeDescriptor> it =

????????? blocksMap.nodeIterator(blocks[curBlk]); it.hasNext();) {

??????? DatanodeDescriptor dn = it.next();

??????? boolean replicaCorrupt = corruptReplicas.isReplicaCorrupt(blocks[curBlk], dn);

??????? if (blockCorrupt || (!blockCorrupt && !replicaCorrupt))

????????? machineSet[numNodes++] = dn;

????? }

??? }

????//使用此machineSet和當前的block構造一個LocatedBlock

??? results.add(new LocatedBlock(blocks[curBlk], machineSet, curPos,

??????????????? blockCorrupt));

??? curPos += blocks[curBlk].getNumBytes();

??? curBlk++;

? } while (curPos < endOff

??????? && curBlk < blocks.length

??????? && results.size() < nrBlocksToReturn);

??//使用此LocatedBlock鏈表構造一個LocatedBlocks對象返回

? return inode.createLocatedBlocks(results);

}

1.3、客戶端

通過RPC調用，在NameNode得到的LocatedBlocks對象，作為成員變量構造DFSInputStream對象，最后包裝為FSDataInputStream返回給用戶。

二、文件的讀取

2.1、客戶端

文件讀取的時候，客戶端利用文件打開的時候得到的FSDataInputStream.read(long position, byte[] buffer, int offset, int length)函數進行文件讀操作。

FSDataInputStream會調用其封裝的DFSInputStream的read(long position, byte[] buffer, int offset, int length)函數，實現如下：

public int read(long position, byte[] buffer, int offset, int length)

? throws IOException {

? long filelen = getFileLength();

? int realLen = length;

? if ((position + length) > filelen) {

??? realLen = (int)(filelen - position);

? }

??//首先得到包含從offset到offset + length內容的block列表

??//比如對于64M一個block的文件系統來說，欲讀取從100M開始，長度為128M的數據，則block列表包括第2，3，4塊block

? List<LocatedBlock> blockRange = getBlockRange(position, realLen);

? int remaining = realLen;

??//對每一個block，從中讀取內容

??//對于上面的例子，對于第2塊block，讀取從36M開始，讀取長度28M，對于第3塊，讀取整一塊64M，對于第4塊，讀取從0開始，長度為36M，共128M數據

? for (LocatedBlock blk : blockRange) {

??? long targetStart = position - blk.getStartOffset();

??? long bytesToRead = Math.min(remaining, blk.getBlockSize() - targetStart);

??? fetchBlockByteRange(blk, targetStart,

??????????????????????? targetStart + bytesToRead - 1, buffer, offset);

??? remaining -= bytesToRead;

??? position += bytesToRead;

??? offset += bytesToRead;

? }

? assert remaining == 0 : "Wrong number of bytes read.";

? if (stats != null) {

??? stats.incrementBytesRead(realLen);

? }

? return realLen;

}

其中getBlockRange函數如下：

private synchronized List<LocatedBlock> getBlockRange(long offset,

????????????????????????????????????????????????????? long length)

??????????????????????????????????????????????????? throws IOException {

? List<LocatedBlock> blockRange = new ArrayList<LocatedBlock>();

??//首先從緩存的locatedBlocks中查找offset所在的block在緩存鏈表中的位置

? int blockIdx = locatedBlocks.findBlock(offset);

? if (blockIdx < 0) { // block is not cached

??? blockIdx = LocatedBlocks.getInsertIndex(blockIdx);

? }

? long remaining = length;

? long curOff = offset;

? while(remaining > 0) {

??? LocatedBlock blk = null;

????//按照blockIdx的位置找到block

??? if(blockIdx < locatedBlocks.locatedBlockCount())

????? blk = locatedBlocks.get(blockIdx);

????//如果block為空，則緩存中沒有此block，則直接從NameNode中查找這些block，并加入緩存

??? if (blk == null || curOff < blk.getStartOffset()) {

????? LocatedBlocks newBlocks;

????? newBlocks = callGetBlockLocations(namenode, src, curOff, remaining);

????? locatedBlocks.insertRange(blockIdx, newBlocks.getLocatedBlocks());

????? continue;

??? }

????//如果block找到，則放入結果集

??? blockRange.add(blk);

??? long bytesRead = blk.getStartOffset() + blk.getBlockSize() - curOff;

??? remaining -= bytesRead;

??? curOff += bytesRead;

????//取下一個block

??? blockIdx++;

? }

? return blockRange;

}

其中fetchBlockByteRange實現如下：

private void fetchBlockByteRange(LocatedBlock block, long start,

???????????????????????????????? long end, byte[] buf, int offset) throws IOException {

? Socket dn = null;

? int numAttempts = block.getLocations().length;

??//此while循環為讀取失敗后的重試次數

? while (dn == null && numAttempts-- > 0 ) {

????//選擇一個DataNode來讀取數據

??? DNAddrPair retval = chooseDataNode(block);

??? DatanodeInfo chosenNode = retval.info;

??? InetSocketAddress targetAddr = retval.addr;

??? BlockReader reader = null;

??? try {

??????//創建Socket連接到DataNode

????? dn = socketFactory.createSocket();

????? dn.connect(targetAddr, socketTimeout);

????? dn.setSoTimeout(socketTimeout);

????? int len = (int) (end - start + 1);

??????//利用建立的Socket鏈接，生成一個reader負責從DataNode讀取數據

????? reader = BlockReader.newBlockReader(dn, src,

????????????????????????????????????????? block.getBlock().getBlockId(),

????????????????????????????????????????? block.getBlock().getGenerationStamp(),

????????????????????????????????????????? start, len, buffersize,

????????????????????????????????????????? verifyChecksum, clientName);

??????//讀取數據

????? int nread = reader.readAll(buf, offset, len);

????? return;

??? } finally {

????? IOUtils.closeStream(reader);

????? IOUtils.closeSocket(dn);

????? dn = null;

??? }

????//如果讀取失敗，則將此DataNode標記為失敗節點

??? addToDeadNodes(chosenNode);

? }

}

BlockReader.newBlockReader函數實現如下：

public static BlockReader newBlockReader( Socket sock, String file,

?????????????????????????????????? long blockId,

?????????????????????????????????? long genStamp,

?????????????????????????????????? long startOffset, long len,

?????????????????????????????????? int bufferSize, boolean verifyChecksum,

?????????????????????????????????? String clientName)

?????????????????????????????????? throws IOException {

??//使用Socket建立寫入流，向DataNode發送讀指令

? DataOutputStream out = new DataOutputStream(

??? new BufferedOutputStream(NetUtils.getOutputStream(sock,HdfsConstants.WRITE_TIMEOUT)));

? out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

? out.write( DataTransferProtocol.OP_READ_BLOCK );

? out.writeLong( blockId );

? out.writeLong( genStamp );

? out.writeLong( startOffset );

? out.writeLong( len );

? Text.writeString(out, clientName);

? out.flush();

??//使用Socket建立讀入流，用于從DataNode讀取數據

? DataInputStream in = new DataInputStream(

????? new BufferedInputStream(NetUtils.getInputStream(sock),

????????????????????????????? bufferSize));

? DataChecksum checksum = DataChecksum.newDataChecksum( in );

? long firstChunkOffset = in.readLong();

??//生成一個reader，主要包含讀入流，用于讀取數據

? return new BlockReader( file, blockId, in, checksum, verifyChecksum,

????????????????????????? startOffset, firstChunkOffset, sock );

}

BlockReader的readAll函數就是用上面生成的DataInputStream讀取數據。

2.2、DataNode

在DataNode啟動的時候，會調用函數startDataNode，其中與數據讀取有關的邏輯如下：

void startDataNode(Configuration conf,

?????????????????? AbstractList<File> dataDirs

?????????????????? ) throws IOException {

? ……

??// 建立一個ServerSocket，并生成一個DataXceiverServer來監控客戶端的鏈接

? ServerSocket ss = (socketWriteTimeout > 0) ?

??????? ServerSocketChannel.open().socket() : new ServerSocket();

? Server.bind(ss, socAddr, 0);

? ss.setReceiveBufferSize(DEFAULT_DATA_SOCKET_SIZE);

? // adjust machine name with the actual port

? tmpPort = ss.getLocalPort();

? selfAddr = new InetSocketAddress(ss.getInetAddress().getHostAddress(),

?????????????????????????????????? tmpPort);

? this.dnRegistration.setName(machineName + ":" + tmpPort);

? this.threadGroup = new ThreadGroup("dataXceiverServer");

? this.dataXceiverServer = new Daemon(threadGroup,

????? new DataXceiverServer(ss, conf, this));

? this.threadGroup.setDaemon(true); // auto destroy when empty

? ……

}

DataXceiverServer.run()函數如下：

public void run() {

? while (datanode.shouldRun) {

??????//接受客戶端的鏈接

????? Socket s = ss.accept();

????? s.setTcpNoDelay(true);

??????//生成一個線程DataXceiver來對建立的鏈接提供服務

????? new Daemon(datanode.threadGroup,

????????? new DataXceiver(s, datanode, this)).start();

? }

? try {

??? ss.close();

? } catch (IOException ie) {

??? LOG.warn(datanode.dnRegistration + ":DataXceiveServer: "

??????????????????????????? + StringUtils.stringifyException(ie));

? }

}

DataXceiver.run()函數如下：

public void run() {

? DataInputStream in=null;

? try {

????//建立一個輸入流，讀取客戶端發送的指令

??? in = new DataInputStream(

??????? new BufferedInputStream(NetUtils.getInputStream(s),

??????????????????????????????? SMALL_BUFFER_SIZE));

??? short version = in.readShort();

??? boolean local = s.getInetAddress().equals(s.getLocalAddress());

??? byte op = in.readByte();

??? // Make sure the xciver count is not exceeded

??? int curXceiverCount = datanode.getXceiverCount();

??? long startTime = DataNode.now();

??? switch ( op ) {

????//讀取

??? case DataTransferProtocol.OP_READ_BLOCK:

??????//真正的讀取數據

????? readBlock( in );

????? datanode.myMetrics.readBlockOp.inc(DataNode.now() - startTime);

????? if (local)

??????? datanode.myMetrics.readsFromLocalClient.inc();

????? else

??????? datanode.myMetrics.readsFromRemoteClient.inc();

????? break;

????//寫入

??? case DataTransferProtocol.OP_WRITE_BLOCK:

??????//真正的寫入數據

????? writeBlock( in );

????? datanode.myMetrics.writeBlockOp.inc(DataNode.now() - startTime);

????? if (local)

??????? datanode.myMetrics.writesFromLocalClient.inc();

????? else

??????? datanode.myMetrics.writesFromRemoteClient.inc();

????? break;

????//其他的指令

??? ……

??? }

? } catch (Throwable t) {

??? LOG.error(datanode.dnRegistration + ":DataXceiver",t);

? } finally {

??? IOUtils.closeStream(in);

??? IOUtils.closeSocket(s);

??? dataXceiverServer.childSockets.remove(s);

? }

}

private void readBlock(DataInputStream in) throws IOException {

??//讀取指令

? long blockId = in.readLong();?????????

? Block block = new Block( blockId, 0 , in.readLong());

? long startOffset = in.readLong();

? long length = in.readLong();

? String clientName = Text.readString(in);

??//創建一個寫入流，用于向客戶端寫數據

? OutputStream baseStream = NetUtils.getOutputStream(s,

????? datanode.socketWriteTimeout);

? DataOutputStream out = new DataOutputStream(

?????????????? new BufferedOutputStream(baseStream, SMALL_BUFFER_SIZE));

??//生成BlockSender用于讀取本地的block的數據，并發送給客戶端

??//BlockSender有一個成員變量InputStream blockIn用于讀取本地block的數據

? BlockSender blockSender = new BlockSender(block, startOffset, length,

????????? true, true, false, datanode, clientTraceFmt);

?? out.writeShort(DataTransferProtocol.OP_STATUS_SUCCESS); // send op status

???//向客戶端寫入數據

?? long read = blockSender.sendBlock(out, baseStream, null);

?? ……

? } finally {

??? IOUtils.closeStream(out);

??? IOUtils.closeStream(blockSender);

? }

}

三、文件的寫入

下面解析向hdfs上傳一個文件的過程。

3.1、客戶端

上傳一個文件到hdfs，一般會調用DistributedFileSystem.create，其實現如下：

? public FSDataOutputStream create(Path f, FsPermission permission,

??? boolean overwrite,

??? int bufferSize, short replication, long blockSize,

??? Progressable progress) throws IOException {

??? return new FSDataOutputStream

?????? (dfs.create(getPathName(f), permission,

?????????????????? overwrite, replication, blockSize, progress, bufferSize),

??????? statistics);

? }

其最終生成一個FSDataOutputStream用于向新生成的文件中寫入數據。其成員變量dfs的類型為DFSClient，DFSClient的create函數如下：

? public OutputStream create(String src,

???????????????????????????? FsPermission permission,

???????????????????????????? boolean overwrite,

???????????????????????????? short replication,

???????????????????????????? long blockSize,

???????????????????????????? Progressable progress,

???????????????????????????? int buffersize

???????????????????????????? ) throws IOException {

??? checkOpen();

??? if (permission == null) {

????? permission = FsPermission.getDefault();

??? }

??? FsPermission masked = permission.applyUMask(FsPermission.getUMask(conf));

??? OutputStream result = new DFSOutputStream(src, masked,

??????? overwrite, replication, blockSize, progress, buffersize,

??????? conf.getInt("io.bytes.per.checksum", 512));

??? leasechecker.put(src, result);

??? return result;

? }

其中構造了一個DFSOutputStream，在其構造函數中，同過RPC調用NameNode的create來創建一個文件。?
當然，構造函數中還做了一件重要的事情，就是streamer.start()，也即啟動了一個pipeline，用于寫數據，在寫入數據的過程中，我們會仔細分析。

? DFSOutputStream(String src, FsPermission masked, boolean overwrite,

????? short replication, long blockSize, Progressable progress,

????? int buffersize, int bytesPerChecksum) throws IOException {

??? this(src, blockSize, progress, bytesPerChecksum);

??? computePacketChunkSize(writePacketSize, bytesPerChecksum);

??? try {

????? namenode.create(

????????? src, masked, clientName, overwrite, replication, blockSize);

??? } catch(RemoteException re) {

????? throw re.unwrapRemoteException(AccessControlException.class,

???????????????????????????????????? QuotaExceededException.class);

??? }

??? streamer.start();

? }

3.2、NameNode

NameNode的create函數調用namesystem.startFile函數，其又調用startFileInternal函數，實現如下：

? private synchronized void startFileInternal(String src,

????????????????????????????????????????????? PermissionStatus permissions,

????????????????????????????????????????????? String holder,

????????????????????????????????????????????? String clientMachine,

????????????????????????????????????????????? boolean overwrite,

????????????????????????????????????????????? boolean append,

????????????????????????????????????????????? short replication,

????????????????????????????????????????????? long blockSize

????????????????????????????????????????????? ) throws IOException {

??? ......

???//創建一個新的文件，狀態為under construction，沒有任何data block與之對應

?? long genstamp = nextGenerationStamp();

?? INodeFileUnderConstruction newNode = dir.addFile(src, permissions,

????? replication, blockSize, holder, clientMachine, clientNode, genstamp);

?? ......

? }

3.3、客戶端

下面輪到客戶端向新創建的文件中寫入數據了，一般會使用FSDataOutputStream的write函數，最終會調用DFSOutputStream的writeChunk函數：

按照hdfs的設計，對block的數據寫入使用的是pipeline的方式，也即將數據分成一個個的package，如果需要復制三分，分別寫入DataNode 1, 2, 3，則會進行如下的過程：

首先將package 1寫入DataNode 1
然后由DataNode 1負責將package 1寫入DataNode 2，同時客戶端可以將pacage 2寫入DataNode 1
然后DataNode 2負責將package 1寫入DataNode 3, 同時客戶端可以講package 3寫入DataNode 1，DataNode 1將package 2寫入DataNode 2
就這樣將一個個package排著隊的傳遞下去，直到所有的數據全部寫入并復制完畢

? protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum)

??????????????????????????????????????????????????????? throws IOException {

??????//創建一個package，并寫入數據

????? currentPacket = new Packet(packetSize, chunksPerPacket,

?????????????????????????????????? bytesCurBlock);

????? currentPacket.writeChecksum(checksum, 0, cklen);

????? currentPacket.writeData(b, offset, len);

????? currentPacket.numChunks++;

????? bytesCurBlock += len;

??????//如果此package已滿，則放入隊列中準備發送

????? if (currentPacket.numChunks == currentPacket.maxChunks ||

????????? bytesCurBlock == blockSize) {

????????? ......

????????? dataQueue.addLast(currentPacket);

??????????//喚醒等待dataqueue的傳輸線程，也即DataStreamer

????????? dataQueue.notifyAll();

????????? currentPacket = null;

????????? ......

????? }

? }

DataStreamer的run函數如下：

? public void run() {

??? while (!closed && clientRunning) {

????? Packet one = null;

????? synchronized (dataQueue) {

????????//如果隊列中沒有package，則等待

??????? while ((!closed && !hasError && clientRunning

?????????????? && dataQueue.size() == 0) || doSleep) {

????????? try {

??????????? dataQueue.wait(1000);

????????? } catch (InterruptedException? e) {

????????? }

????????? doSleep = false;

??????? }

??????? try {

??????????//得到隊列中的第一個package

????????? one = dataQueue.getFirst();

????????? long offsetInBlock = one.offsetInBlock;

??????????//由NameNode分配block，并生成一個寫入流指向此block

????????? if (blockStream == null) {

??????????? nodes = nextBlockOutputStream(src);

??????????? response = new ResponseProcessor(nodes);

??????????? response.start();

????????? }

????????? ByteBuffer buf = one.getBuffer();

??????????//將package從dataQueue移至ackQueue,等待確認

????????? dataQueue.removeFirst();

????????? dataQueue.notifyAll();

????????? synchronized (ackQueue) {

??????????? ackQueue.addLast(one);

??????????? ackQueue.notifyAll();

????????? }

??????????//利用生成的寫入流將數據寫入DataNode中的block

????????? blockStream.write(buf.array(), buf.position(), buf.remaining());

????????? if (one.lastPacketInBlock) {

??????????? blockStream.writeInt(0);?//表示此block寫入完畢

????????? }

????????? blockStream.flush();

??????? } catch (Throwable e) {

??????? }

????? }

????? ......

? }

其中重要的一個函數是nextBlockOutputStream，實現如下：

? private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException {

??? LocatedBlock lb = null;

??? boolean retry = false;

??? DatanodeInfo[] nodes;

??? int count = conf.getInt("dfs.client.block.write.retries", 3);

??? boolean success;

??? do {

????? ......

??????//由NameNode為文件分配DataNode和block

????? lb = locateFollowingBlock(startTime);

????? block = lb.getBlock();

????? nodes = lb.getLocations();

??????//創建向DataNode的寫入流

????? success = createBlockOutputStream(nodes, clientName, false);

????? ......

??? } while (retry && --count >= 0);

??? return nodes;

? }

locateFollowingBlock中通過RPC調用namenode.addBlock(src, clientName)函數

3.4、NameNode

NameNode的addBlock函數實現如下：

? public LocatedBlock addBlock(String src,

?????????????????????????????? String clientName) throws IOException {

??? LocatedBlock locatedBlock = namesystem.getAdditionalBlock(src, clientName);

??? return locatedBlock;

? }

FSNamesystem的getAdditionalBlock實現如下：

? public LocatedBlock getAdditionalBlock(String src,

???????????????????????????????????????? String clientName

???????????????????????????????????????? ) throws IOException {

??? long fileLength, blockSize;

??? int replication;

??? DatanodeDescriptor clientNode = null;

??? Block newBlock = null;

??? ......

????//為新的block選擇DataNode

??? DatanodeDescriptor targets[] = replicator.chooseTarget(replication,

?????????????????????????????????????????????????????????? clientNode,

?????????????????????????????????????????????????????????? null,

?????????????????????????????????????????????????????????? blockSize);

??? ......

????//得到文件路徑中所有path的INode，其中最后一個是新添加的文件對的INode，狀態為under construction

??? INode[] pathINodes = dir.getExistingPathINodes(src);

??? int inodesLen = pathINodes.length;

??? INodeFileUnderConstruction pendingFile? = (INodeFileUnderConstruction)

??????????????????????????????????????????????? pathINodes[inodesLen - 1];

????//為文件分配block, 并設置在那寫DataNode上

??? newBlock = allocateBlock(src, pathINodes);

??? pendingFile.setTargets(targets);

??? ......

??? return new LocatedBlock(newBlock, targets, fileLength);

? }

3.5、客戶端

在分配了DataNode和block以后，createBlockOutputStream開始寫入數據。

? private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client,

????????????????? boolean recoveryFlag) {

??????//創建一個socket，鏈接DataNode

????? InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName());

????? s = socketFactory.createSocket();

????? int timeoutValue = 3000 * nodes.length + socketTimeout;

????? s.connect(target, timeoutValue);

????? s.setSoTimeout(timeoutValue);

????? s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

????? long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length +

????????????????????????? datanodeWriteTimeout;

????? DataOutputStream out = new DataOutputStream(

????????? new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout),

?????????????????????????????????? DataNode.SMALL_BUFFER_SIZE));

????? blockReplyStream = new DataInputStream(NetUtils.getInputStream(s));

??????//寫入指令

????? out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

????? out.write( DataTransferProtocol.OP_WRITE_BLOCK );

????? out.writeLong( block.getBlockId() );

????? out.writeLong( block.getGenerationStamp() );

????? out.writeInt( nodes.length );

????? out.writeBoolean( recoveryFlag );

????? Text.writeString( out, client );

????? out.writeBoolean(false);

????? out.writeInt( nodes.length - 1 );

??????//注意，次循環從1開始，而非從0開始。將除了第一個DataNode以外的另外兩個DataNode的信息發送給第一個DataNode, 第一個DataNode可以根據此信息將數據寫給另兩個DataNode

????? for (int i = 1; i < nodes.length; i++) {

??????? nodes[i].write(out);

????? }

????? checksum.writeHeader( out );

????? out.flush();

????? firstBadLink = Text.readString(blockReplyStream);

????? if (firstBadLink.length() != 0) {

??????? throw new IOException("Bad connect ack with firstBadLink " + firstBadLink);

????? }

????? blockStream = out;

? }

客戶端在DataStreamer的run函數中創建了寫入流后，調用blockStream.write將數據寫入DataNode

3.6、DataNode

DataNode的DataXceiver中，收到指令DataTransferProtocol.OP_WRITE_BLOCK則調用writeBlock函數：

? private void writeBlock(DataInputStream in) throws IOException {

??? DatanodeInfo srcDataNode = null;

????//讀入頭信息

??? Block block = new Block(in.readLong(),

??????? dataXceiverServer.estimateBlockSize, in.readLong());

??? int pipelineSize = in.readInt(); // num of datanodes in entire pipeline

??? boolean isRecovery = in.readBoolean(); // is this part of recovery?

??? String client = Text.readString(in); // working on behalf of this client

??? boolean hasSrcDataNode = in.readBoolean(); // is src node info present

??? if (hasSrcDataNode) {

????? srcDataNode = new DatanodeInfo();

????? srcDataNode.readFields(in);

??? }

??? int numTargets = in.readInt();

??? if (numTargets < 0) {

????? throw new IOException("Mislabelled incoming datastream.");

??? }

????//讀入剩下的DataNode列表，如果當前是第一個DataNode，則此列表中收到的是第二個，第三個DataNode的信息，如果當前是第二個DataNode，則受到的是第三個DataNode的信息

??? DatanodeInfo targets[] = new DatanodeInfo[numTargets];

??? for (int i = 0; i < targets.length; i++) {

????? DatanodeInfo tmp = new DatanodeInfo();

????? tmp.readFields(in);

????? targets[i] = tmp;

??? }

??? DataOutputStream mirrorOut = null;? // stream to next target

??? DataInputStream mirrorIn = null;??? // reply from next target

??? DataOutputStream replyOut = null;?? // stream to prev target

??? Socket mirrorSock = null;?????????? // socket to next target

??? BlockReceiver blockReceiver = null; // responsible for data handling

??? String mirrorNode = null;?????????? // the name:port of next target

??? String firstBadLink = "";?????????? // first datanode that failed in connection setup

??? try {

??????//生成一個BlockReceiver, 其有成員變量DataInputStream in為從客戶端或者上一個DataNode讀取數據，還有成員變量DataOutputStream mirrorOut，用于向下一個DataNode寫入數據，還有成員變量OutputStream out用于將數據寫入本地。

????? blockReceiver = new BlockReceiver(block, in,

????????? s.getRemoteSocketAddress().toString(),

????????? s.getLocalSocketAddress().toString(),

????????? isRecovery, client, srcDataNode, datanode);

????? // get a connection back to the previous target

????? replyOut = new DataOutputStream(

???????????????????? NetUtils.getOutputStream(s, datanode.socketWriteTimeout));

??????//如果當前不是最后一個DataNode，則同下一個DataNode建立socket連接

????? if (targets.length > 0) {

??????? InetSocketAddress mirrorTarget = null;

??????? // Connect to backup machine

??????? mirrorNode = targets[0].getName();

??????? mirrorTarget = NetUtils.createSocketAddr(mirrorNode);

??????? mirrorSock = datanode.newSocket();

??????? int timeoutValue = numTargets * datanode.socketTimeout;

??????? int writeTimeout = datanode.socketWriteTimeout +

???????????????????????????? (HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets);

??????? mirrorSock.connect(mirrorTarget, timeoutValue);

??????? mirrorSock.setSoTimeout(timeoutValue);

??????? mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

????????//創建向下一個DataNode寫入數據的流

??????? mirrorOut = new DataOutputStream(

???????????? new BufferedOutputStream(

???????????????????????? NetUtils.getOutputStream(mirrorSock, writeTimeout),

???????????????????????? SMALL_BUFFER_SIZE));

??????? mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock));

??????? mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

??????? mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK );

??????? mirrorOut.writeLong( block.getBlockId() );

??????? mirrorOut.writeLong( block.getGenerationStamp() );

??????? mirrorOut.writeInt( pipelineSize );

??????? mirrorOut.writeBoolean( isRecovery );

??????? Text.writeString( mirrorOut, client );

??????? mirrorOut.writeBoolean(hasSrcDataNode);

??????? if (hasSrcDataNode) { // pass src node information

????????? srcDataNode.write(mirrorOut);

??????? }

??????? mirrorOut.writeInt( targets.length - 1 );

????????//此出也是從1開始，將除了下一個DataNode的其他DataNode信息發送給下一個DataNode

??????? for ( int i = 1; i < targets.length; i++ ) {

????????? targets[i].write( mirrorOut );

??????? }

??????? blockReceiver.writeChecksumHeader(mirrorOut);

??????? mirrorOut.flush();

????? }

??????//使用BlockReceiver接受block

????? String mirrorAddr = (mirrorSock == null) ? null : mirrorNode;

????? blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut,

???????????????????????????????? mirrorAddr, null, targets.length);

????? ......

??? } finally {

????? // close all opened streams

????? IOUtils.closeStream(mirrorOut);

????? IOUtils.closeStream(mirrorIn);

????? IOUtils.closeStream(replyOut);

????? IOUtils.closeSocket(mirrorSock);

????? IOUtils.closeStream(blockReceiver);

??? }

? }

BlockReceiver的receiveBlock函數中，一段重要的邏輯如下：

? void receiveBlock(

????? DataOutputStream mirrOut, // output to next datanode

????? DataInputStream mirrIn,?? // input from next datanode

????? DataOutputStream replyOut,? // output to previous datanode

????? String mirrAddr, BlockTransferThrottler throttlerArg,

????? int numTargets) throws IOException {

????? ......

??????//不斷的接受package，直到結束

????? while (receivePacket() > 0) {}

????? if (mirrorOut != null) {

??????? try {

????????? mirrorOut.writeInt(0); // mark the end of the block

????????? mirrorOut.flush();

??????? } catch (IOException e) {

????????? handleMirrorOutError(e);

??????? }

????? }

????? ......

? }

BlockReceiver的receivePacket函數如下：

? private int receivePacket() throws IOException {

????//從客戶端或者上一個節點接收一個package

??? int payloadLen = readNextPacket();

??? buf.mark();

??? //read the header

??? buf.getInt(); // packet length

??? offsetInBlock = buf.getLong(); // get offset of packet in block

??? long seqno = buf.getLong();??? // get seqno

??? boolean lastPacketInBlock = (buf.get() != 0);

??? int endOfHeader = buf.position();

??? buf.reset();

??? setBlockPosition(offsetInBlock);

????//將package寫入下一個DataNode

??? if (mirrorOut != null) {

????? try {

??????? mirrorOut.write(buf.array(), buf.position(), buf.remaining());

??????? mirrorOut.flush();

????? } catch (IOException e) {

??????? handleMirrorOutError(e);

????? }

??? }

??? buf.position(endOfHeader);???????

??? int len = buf.getInt();

??? offsetInBlock += len;

??? int checksumLen = ((len + bytesPerChecksum - 1)/bytesPerChecksum)*

??????????????????????????????????????????????????????????? checksumSize;

??? int checksumOff = buf.position();

??? int dataOff = checksumOff + checksumLen;

??? byte pktBuf[] = buf.array();

??? buf.position(buf.limit()); // move to the end of the data.

??? ......

????//將數據寫入本地的block

??? out.write(pktBuf, dataOff, len);

??? /// flush entire packet before sending ack

??? flush();

??? // put in queue for pending acks

??? if (responder != null) {

????? ((PacketResponder)responder.getRunnable()).enqueue(seqno,

????????????????????????????????????? lastPacketInBlock);

??? }

??? return payloadLen;

? }

總結

以上是生活随笔為你收集整理的HDFS读写过程解析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

过程
HDFS

上一篇： Python函数式编程——map()、r
下一篇： Map-Reduce的过程解析