當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hadoop 核心编程之 HDFS 的文件操作

發(fā)布時(shí)間：2025/3/20 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hadoop 核心编程之 HDFS 的文件操作小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

前言

本文并沒有打算介紹 HDFS 的讀寫流程，雖然這是一塊比較重要的內(nèi)容。如果你感興趣，可以去搜索相關(guān)資料。如果一遍沒有看懂，請(qǐng)看第二遍。
本文還是以代碼為主，并附以簡短的說明，幫助你理解代碼的邏輯，以及一些注意事項(xiàng)。你可以將本文的代碼封裝成一個(gè)工具類，這樣以后需要調(diào)用時(shí)候，就可以復(fù)用了。

版權(quán)說明

著作權(quán)歸作者所有。
商業(yè)轉(zhuǎn)載請(qǐng)聯(lián)系作者獲得授權(quán)，非商業(yè)轉(zhuǎn)載請(qǐng)注明出處。
本文作者：Q-WHai
發(fā)表日期： 2016年6月21日
本文鏈接：https://qwhai.blog.csdn.net/article/details/51728359
來源：CSDN
更多內(nèi)容：分類 >> 大數(shù)據(jù)之 Hadoop

HDFS 讀寫 API

上傳本地文件到 HDFS

public static void uploadFileFromLocal(String localPath, String hdfsPath) throws IOException {InputStream in = new BufferedInputStream(new FileInputStream(localPath));FileSystem fileSystem = FileSystem.get(URI.create(hdfsPath), new Configuration());OutputStream out = fileSystem.create(new Path(hdfsPath));IOUtils.copyBytes(in, out, 4096, true);fileSystem.close(); }

此處使用了一個(gè)十分方便的方法 IOUtils.copyBytes()。調(diào)用這個(gè)文件，你可以很方便地將輸入流寫入到輸出流，而且不需要你人為去控制緩沖區(qū)，也不需要人為控制循環(huán)讀取輸入源。IOUtils.copyBytes() 中的第 4 個(gè)參數(shù)表示是否關(guān)閉流對(duì)象，也就是輸入輸出流對(duì)象。一般來說打成 true 就好了。

從 HDFS 下載文件到本地

通過上面上傳文件的例子，我們可以很容易地寫出下載文件的代碼。如下：

public static void downloadFileToLocal (String hdfsPath, String localPath) throws IOException {FileSystem fileSystem = FileSystem.get(URI.create(hdfsPath), new Configuration());FSDataInputStream in = fileSystem.open(new Path(hdfsPath));OutputStream out = new FileOutputStream(localPath);IOUtils.copyBytes(in, out, 4096, true);fileSystem.close(); }

從 HDFS 下載文件到本地

上面是的下載文件已經(jīng)很好用了，現(xiàn)在再來看看另外一種下載文件的方法。調(diào)用的是 FileUtil.copy() 方法。

public static void downloadFileToLocalNew (String hdfsSourceFileFullName, String localFileFullName) throws IOException {Configuration config = new Configuration();FileSystem fileSystem = FileSystem.get(URI.create(hdfsSourceFileFullName), config);FileUtil.copy(fileSystem, new Path(hdfsSourceFileFullName), new File(localFileFullName), false, config);fileSystem.close(); }

按行讀取 HDFS 文件內(nèi)容

在 HDFS 里面應(yīng)該是沒有直接提供按行讀取文件的 API（如果有，后面我們?cè)俑?#xff09;，但是 JDK 中提供相關(guān)的 API，那就是 BufferedReader。這里你可以結(jié)合剛學(xué) Java 時(shí)使用的用戶在控制臺(tái)向程序輸入，當(dāng)時(shí)除了 Scanner 就是 BufferedReader 了，很方便。

public static List<String> readFileHDFSByLine (Configuration config, String hdfsFileFullName) throws IOException {List<String> result = new ArrayList<>();FileSystem fileSystem = FileSystem.get(URI.create(hdfsFileFullName), config);FSDataInputStream dataInputStream = fileSystem.open(new Path(hdfsFileFullName)); BufferedReader reader = null;String line;try {reader = new BufferedReader(new InputStreamReader(dataInputStream, "UTF-8"));while ((line = reader.readLine()) != null) {result.add(line);}} finally {if (reader != null) {reader.close();}}return result; }

向 HDFS 中的文件追加內(nèi)容

public static void appendLabelToHDFS(String hdfsPath, String content) throws IOException {Configuration config = new Configuration();config.set("dfs.client.block.write.replace-datanode-on-failure.policy", "NEVER");config.set("dfs.client.block.write.replace-datanode-on-failure.enable", "true");FileSystem fileSystem = FileSystem.get(URI.create(hdfsPath), config);FSDataOutputStream out = fileSystem.append(new Path(hdfsPath));int readLen = content.getBytes().length;if (-1 != readLen) {out.write(content.getBytes(), 0, readLen);}out.close();fileSystem.close(); }

此處，如果你不想動(dòng)態(tài)設(shè)置 Configuration，那么你就需要在配置文件中配置此兩項(xiàng)內(nèi)容。
補(bǔ)充說明
如果你需要對(duì)文件進(jìn)行追加內(nèi)容操作，那么在 hdfs-site.xml 配置文件中需要設(shè)置如下屬性。

<property><name>dfs.support.append</name><value>true</value> </property>

向 HDFS 中的文件追加文件

通過上面追加字符串的操作，你可能會(huì)想到這里可以先讀取文件內(nèi)容到字符串，再進(jìn)行追加字符串操作。這樣的確是可以的。不過可以看到上的輸出是一個(gè)輸出流，那么這里就不需要再讀取到字符串了。文件是可以直接對(duì)到文件流上的嘛。所以向 HDFS 文件中追加文件的操作如下：

public static void appendFileToHDFS(String hdfsPath, String localFilePath) throws IOException {Configuration config = new Configuration();config.set("dfs.client.block.write.replace-datanode-on-failure.policy", "NEVER");config.set("dfs.client.block.write.replace-datanode-on-failure.enable", "true");FileSystem fileSystem = FileSystem.get(URI.create(hdfsPath), config);InputStream in = new BufferedInputStream(new FileInputStream(localFilePath));FSDataOutputStream out = fileSystem.append(new Path(hdfsPath));IOUtils.copyBytes(in, out, 4096, true);fileSystem.close(); }

向 HDFS 文件中寫入內(nèi)容

此處對(duì) HDFS 文件的更改是覆蓋式的，也就是會(huì)把之前的內(nèi)容全部刪除。

public static void writeLabelToHDFS(String hdfsPath, String content) throws IOException {FileSystem fileSystem = FileSystem.get(URI.create(hdfsPath), new Configuration());FSDataOutputStream out = fileSystem.create(new Path(hdfsPath));int readLen = content.getBytes().length;if (-1 != readLen) {out.write(content.getBytes(), 0, readLen);}out.close();fileSystem.close(); }

刪除 HDFS 中文件

此刪除操作是刪除一個(gè)已存在的文件，從代碼中的方法命名就可以看出來。不過，如果 HDFS 中不存在此文件，也不會(huì)拋出異常。

public static void deleteFileFromHDFS(String hdfsPath) throws IOException {FileSystem fileSystem = FileSystem.get(URI.create(hdfsPath), new Configuration());fileSystem.deleteOnExit(new Path(hdfsPath));fileSystem.close(); }

讀取 HDFS 某一目錄下的所有文件

public static void readFilesOnlyInDirectoryFromHDFS(String hdfsFolderName) throws IOException {FileSystem fileSystem = FileSystem.get(URI.create(hdfsFolderName), new Configuration());FileStatus fileList[] = fileSystem.listStatus(new Path(hdfsFolderName));for (FileStatus fileStatus : fileList) {if (fileStatus.isDirectory()) {continue;}System.out.println("FileName: " + fileStatus.getPath().getName() + "\t\tSize: " + fileStatus.getLen());}fileSystem.close(); }

讀取 HDFS 某一目錄下的所有文件

此方法是參考上面的 readFilesOnlyInDirectoryFromHDFS() 方法來的，只是這里也會(huì)去讀取子目錄下的所有文件。所以使用了一個(gè)遞歸，并且為了更好地封裝，這里將遞歸的邏輯與調(diào)用分開了，這樣做的目的是避免產(chǎn)生過多的 Configuration 對(duì)象。

public static void listHDFSFiles (String hdfsFileFullName) throws IOException {Configuration config = new Configuration();listHDFSFiles(config, hdfsFileFullName); }private static void listHDFSFiles (Configuration config, String hdfsFileFullName) throws IOException {FileSystem fileSystem = FileSystem.get(URI.create(hdfsFileFullName), config);FileStatus[] fileStatus = fileSystem.listStatus(new Path(hdfsFileFullName));for (FileStatus statusItem : fileStatus) {if (statusItem.isDirectory()) {listHDFSFiles(config, statusItem.getPath().toString());}System.out.println("FileName: " + statusItem.getPath() + "\t\tSize: " + statusItem.getLen());}fileSystem.close(); }

獲取某一文件在 HDFS 中實(shí)際保存的節(jié)點(diǎn)

此方法可以展示 HDFS 中的某一個(gè)文件在 HDFS 文件系統(tǒng)中被保存的所有 DataNode。

public static void getFileLocal(String hdfsFileFullName) throws IOException {FileSystem fileSystem = FileSystem.get(URI.create(hdfsFileFullName), new Configuration());FileStatus status = fileSystem.getFileStatus(new Path(hdfsFileFullName));BlockLocation[] locations = fileSystem.getFileBlockLocations(status, 0, status.getLen());for (int i = 0; i < locations.length; i++) {String[] hosts = locations[i].getHosts();for (String host : hosts) {System.out.println("block_" + i + "_location:" + host);}} }

獲得 HDFS 中所有的節(jié)點(diǎn)信息

如果你不知道 HDFS 文件系統(tǒng)中有哪些文件，單純的想知道我的 HDFS 文件系統(tǒng)中有哪些 DataNode。那么可以把上面的 hdfsFileFullName 寫成 HDFS 的根目錄就可以了。比如我的設(shè)置如下：

public static void getHDFSNode() throws IOException {FileSystem fileSystem = FileSystem.get(URI.create("hdfs://master:9000/"), new Configuration());DistributedFileSystem distributedFileSystem = (DistributedFileSystem) fileSystem;DatanodeInfo[] dataNodeStats = distributedFileSystem.getDataNodeStats();for (int i = 0; i < dataNodeStats.length; i++) {System.out.println("DataNode_" + i + "_Node:" + dataNodeStats[i].getHostName());} }

總結(jié)

以上是生活随笔為你收集整理的Hadoop 核心编程之 HDFS 的文件操作的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： MapReduce进阶：多路径输入输出
下一篇： MapReduce 进阶：Partiti