當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【转】hadoop深入研究:(十一)——序列化与Writable实现

發布時間：2024/1/17 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了【转】hadoop深入研究:(十一)——序列化与Writable实现小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

原文鏈接 http://blog.csdn.net/lastsweetop/article/details/9249411

所有源碼在github上，https://github.com/lastsweetop/styhadoop

簡介

在hadoop中，Writable的實現類是個龐大的家族，我們在這里簡單的介紹一下常用來做序列化的一部分。

java原生類型

除char類型以外，所有的原生類型都有對應的Writable類，并且通過get和set方法可以他們的值。 IntWritable和LongWritable還有對應的變長VIntWritable和VLongWritable類。固定長度還是變長的選用類似與數據庫中的char或者vchar，在這里就不贅述了。

Text類型

Text類型使用變長int型存儲長度，所以Text類型的最大存儲為2G. Text類型采用標準的utf-8編碼，所以與其他文本工具可以非常好的交互，但要注意的是，這樣的話就和java的String類型差別就很多了。

檢索的不同

Text的chatAt返回的是一個整型，及utf-8編碼后的數字，而不是象String那樣的unicode編碼的char類型。 [java]?view plaincopy

@Test??

public?void?testTextIndex(){??

????Text?text=new?Text("hadoop");??

????Assert.assertEquals(text.getLength(),?6);??

????Assert.assertEquals(text.getBytes().length,?6);??

????Assert.assertEquals(text.charAt(2),(int)'d');??

????Assert.assertEquals("Out?of?bounds",text.charAt(100),-1);??

}??

Text還有個find方法，類似String里indexOf方法 [java]?view plaincopy

@Test??

public?void?testTextFind()?{??

????Text?text?=?new?Text("hadoop");??

????Assert.assertEquals("find?a?substring",text.find("do"),2);??

????Assert.assertEquals("Find?first?'o'",text.find("o"),3);??

????Assert.assertEquals("Find?'o'?from?position?4?or?later",text.find("o",4),4);??

????Assert.assertEquals("No?match",text.find("pig"),-1);??

}??

Unicode的不同

當uft-8編碼后的字節大于兩個時，Text和String的區別就會更清晰，因為String是按照unicode的char計算，而Text是按照字節計算。我們來看下1到4個字節的不同的unicode字符 4個unicode分別占用1到4個字節，u+10400在java的unicode字符重占用兩個char，前三個字符分別占用1個char 我們通過代碼來看下String和Text的不同 [java]?view plaincopy

@Test??

???public?void?string()?throws?UnsupportedEncodingException?{??

???????String?str?=?"\u0041\u00DF\u6771\uD801\uDC00";??

???????Assert.assertEquals(str.length(),?5);??

???????Assert.assertEquals(str.getBytes("UTF-8").length,?10);??

???????Assert.assertEquals(str.indexOf("\u0041"),?0);??

???????Assert.assertEquals(str.indexOf("\u00DF"),?1);??

???????Assert.assertEquals(str.indexOf("\u6771"),?2);??

???????Assert.assertEquals(str.indexOf("\uD801\uDC00"),?3);??

???????Assert.assertEquals(str.charAt(0),?'\u0041');??

???????Assert.assertEquals(str.charAt(1),?'\u00DF');??

???????Assert.assertEquals(str.charAt(2),?'\u6771');??

???????Assert.assertEquals(str.charAt(3),?'\uD801');??

???????Assert.assertEquals(str.charAt(4),?'\uDC00');??

???????Assert.assertEquals(str.codePointAt(0),?0x0041);??

???????Assert.assertEquals(str.codePointAt(1),?0x00DF);??

???????Assert.assertEquals(str.codePointAt(2),?0x6771);??

???????Assert.assertEquals(str.codePointAt(3),?0x10400);??

???}??

???@Test??

???public?void?text()?{??

???????Text?text?=?new?Text("\u0041\u00DF\u6771\uD801\uDC00");??

???????Assert.assertEquals(text.getLength(),?10);??

???????Assert.assertEquals(text.find("\u0041"),?0);??

???????Assert.assertEquals(text.find("\u00DF"),?1);??

???????Assert.assertEquals(text.find("\u6771"),?3);??

???????Assert.assertEquals(text.find("\uD801\uDC00"),?6);??

???????Assert.assertEquals(text.charAt(0),?0x0041);??

???????Assert.assertEquals(text.charAt(1),?0x00DF);??

???????Assert.assertEquals(text.charAt(3),?0x6771);??

???????Assert.assertEquals(text.charAt(6),?0x10400);??

???}??

這樣一比較就很明顯了。 1.String的length()方法返回的是char的數量，Text的getLength()方法返回的是字節的數量。 2.String的indexOf()方法返回的是以char為單元的偏移量，Text的find()方法返回的是以字節為單位的偏移量。 3.String的charAt()方法不是返回的整個unicode字符，而是返回的是java中的char字符 4.String的codePointAt()和Text的charAt方法比較類似，不過要注意，前者是按char的偏移量，后者是字節的偏移量

Text的迭代

在Text中對unicode字符的迭代是相當復雜的，因為與unicode所占的字節數有關，不能簡單的使用index的增長來確定。首先要把Text對象使用ByteBuffer進行封裝，然后再調用Text的靜態方法bytesToCodePoint對ByteBuffer進行輪詢返回unicode字符的code point。看一下示例代碼： [java]?view plaincopy

package?com.sweetop.styhadoop;??

import?org.apache.hadoop.io.Text;??

import?java.nio.ByteBuffer;??

/**?

?*?Created?with?IntelliJ?IDEA.?

?*?User:?lastsweetop?

?*?Date:?13-7-9?

?*?Time:?下午5:00?

?*?To?change?this?template?use?File?|?Settings?|?File?Templates.?

?*/??

public?class?TextIterator?{??

????public?static?void?main(String[]?args)?{??

????????Text?text?=?new?Text("\u0041\u00DF\u6771\uD801\udc00");??

????????ByteBuffer?buffer?=?ByteBuffer.wrap(text.getBytes(),?0,?text.getLength());??

????????int?cp;??

????????while?(buffer.hasRemaining()?&&?(cp?=?Text.bytesToCodePoint(buffer))?!=?-1)?{??

????????????System.out.println(Integer.toHexString(cp));??

????????}??

????}??

}??

Text的修改

除了NullWritable是不可更改外，其他類型的Writable都是可以修改的。你可以通過Text的set方法去修改去修改重用這個實例。 [java]?view plaincopy

@Test??

public?void?testTextMutability()?{??

????Text?text?=?new?Text("hadoop");??

????text.set("pig");??

????Assert.assertEquals(text.getLength(),?3);??

????Assert.assertEquals(text.getBytes().length,?3);??

}??

但要注意的就是，在某些情況下Text的getBytes方法返回的字節數組的長度和Text的getLength方法返回的長度不一致。因此，在調用getBytes()方法的同時最好也調用一下getLength方法，這樣你就知道在字節數組里有多少有效的字符。 [java]?view plaincopy

@Test??

public?void?testTextMutability2()?{??

????Text?text?=?new?Text("hadoop");??

????text.set(new?Text("pig"));??

????Assert.assertEquals(text.getLength(),3);??

????Assert.assertEquals(text.getBytes().length,6);??

}??

BytesWritable類型

ByteWritable類型是一個二進制數組的封裝類型，序列化格式是以一個4字節的整數(這點與Text不同，Text是以變長int開頭)開始表明字節數組的長度，然后接下來就是數組本身。看下示例： [java]?view plaincopy

@Test??

public?void?testByteWritableSerilizedFromat()?throws?IOException?{??

????BytesWritable?bytesWritable=new?BytesWritable(new?byte[]{3,5});??

????byte[]?bytes=SerializeUtils.serialize(bytesWritable);??

????Assert.assertEquals(StringUtils.byteToHexString(bytes),"000000020305");??

}??

和Text一樣，ByteWritable也可以通過set方法修改，getLength返回的大小是真實大小，而getBytes返回的大小確不是。 [java]?view plaincopy

<span?style="white-space:pre">??</span>bytesWritable.setCapacity(11);??

????????bytesWritable.setSize(4);??

????????Assert.assertEquals(4,bytesWritable.getLength());??

????????Assert.assertEquals(11,bytesWritable.getBytes().length);??

NullWritable類型

NullWritable是一個非常特殊的Writable類型，序列化不包含任何字符，僅僅相當于個占位符。你在使用mapreduce時，key或者value在無需使用時，可以定義為NullWritable。 [java]?view plaincopy

package?com.sweetop.styhadoop;??

import?org.apache.hadoop.io.NullWritable;??

import?org.apache.hadoop.util.StringUtils;??

import?java.io.IOException;??

/**?

?*?Created?with?IntelliJ?IDEA.?

?*?User:?lastsweetop?

?*?Date:?13-7-16?

?*?Time:?下午9:23?

?*?To?change?this?template?use?File?|?Settings?|?File?Templates.?

?*/??

public?class?TestNullWritable?{??

????public?static?void?main(String[]?args)?throws?IOException?{??

????????NullWritable?nullWritable=NullWritable.get();??

????????System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(nullWritable)));??

????}??

}??

ObjectWritable類型

ObjectWritable是其他類型的封裝類，包括java原生類型，String,enum,Writable,null等，或者這些類型構成的數組。當你的一個field有多種類型時，ObjectWritable類型的用處就發揮出來了，不過有個不好的地方就是占用的空間太大，即使你存一個字母，因為它需要保存封裝前的類型，我們來看瞎示例： [java]?view plaincopy

package?com.sweetop.styhadoop;??

import?org.apache.hadoop.io.ObjectWritable;??

import?org.apache.hadoop.io.Text;??

import?org.apache.hadoop.util.StringUtils;??

import?java.io.IOException;??

/**?

?*?Created?with?IntelliJ?IDEA.?

?*?User:?lastsweetop?

?*?Date:?13-7-17?

?*?Time:?上午9:14?

?*?To?change?this?template?use?File?|?Settings?|?File?Templates.?

?*/??

public?class?TestObjectWritable?{??

????public?static?void?main(String[]?args)?throws?IOException?{??

????????Text?text=new?Text("\u0041");??

????????ObjectWritable?objectWritable=new?ObjectWritable(text);??

????????System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(objectWritable)));??

????}??

}??

僅僅是保存一個字母，那么看下它序列化后的結果是什么： [java]?view plaincopy

00196f72672e6170616368652e6861646f6f702e696f2e5465787400196f72672e6170616368652e6861646f6f702e696f2e546578740141??

太浪費空間了，而且類型一般是已知的，也就那么幾個，那么它的代替方法出現，看下一小節

GenericWritable類型

使用GenericWritable時，只需繼承于他，并通過重寫getTypes方法指定哪些類型需要支持即可，我們看下用法： [java]?view plaincopy

package?com.sweetop.styhadoop;??

import?org.apache.hadoop.io.GenericWritable;??

import?org.apache.hadoop.io.Text;??

import?org.apache.hadoop.io.Writable;??

class?MyWritable?extends?GenericWritable?{??

????MyWritable(Writable?writable)?{??

????????set(writable);??

????}??

????public?static?Class<??extends?Writable>[]?CLASSES=null;??

????static?{??

????????CLASSES=??(Class<??extends?Writable>[])new?Class[]{??

????????????????Text.class??

????????};??

????}??

????@Override??

????protected?Class<??extends?Writable>[]?getTypes()?{??

????????return?CLASSES;??//To?change?body?of?implemented?methods?use?File?|?Settings?|?File?Templates.??

????}??

}??

然后輸出序列化后的結果 [java]?view plaincopy

package?com.sweetop.styhadoop;??

import?org.apache.hadoop.io.IntWritable;??

import?org.apache.hadoop.io.Text;??

import?org.apache.hadoop.io.VIntWritable;??

import?org.apache.hadoop.util.StringUtils;??

import?java.io.IOException;??

/**?

?*?Created?with?IntelliJ?IDEA.?

?*?User:?lastsweetop?

?*?Date:?13-7-17?

?*?Time:?上午9:51?

?*?To?change?this?template?use?File?|?Settings?|?File?Templates.?

?*/??

public?class?TestGenericWritable?{??

????public?static?void?main(String[]?args)?throws?IOException?{??

????????Text?text=new?Text("\u0041\u0071");??

????????MyWritable?myWritable=new?MyWritable(text);??

????????System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(text)));??

????????System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(myWritable)));??

????}??

}??

結果是： [java]?view plaincopy

024171??

00024171??

GenericWritable的序列化只是把類型在type數組里的索引放在了前面，這樣就比ObjectWritable節省了很多空間，所以推薦大家使用GenericWritable

集合類型的Writable

ArrayWritable和TwoDArrayWritable

ArrayWritable和TwoDArrayWritable分別表示數組和二維數組的Writable類型，指定數組的類型有兩種方法,構造方法里設置，或者繼承于ArrayWritable,TwoDArrayWritable也是一樣。 [java]?view plaincopy

package?com.sweetop.styhadoop;??

import?org.apache.hadoop.io.ArrayWritable;??

import?org.apache.hadoop.io.Text;??

import?org.apache.hadoop.io.Writable;??

import?org.apache.hadoop.util.StringUtils;??

import?java.io.IOException;??

/**?

?*?Created?with?IntelliJ?IDEA.?

?*?User:?lastsweetop?

?*?Date:?13-7-17?

?*?Time:?上午11:14?

?*?To?change?this?template?use?File?|?Settings?|?File?Templates.?

?*/??

public?class?TestArrayWritable?{??

????public?static?void?main(String[]?args)?throws?IOException?{??

????????ArrayWritable?arrayWritable=new?ArrayWritable(Text.class);??

????????arrayWritable.set(new?Writable[]{new?Text("\u0071"),new?Text("\u0041")});??

????????System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(arrayWritable)));??

????}??

}??

看下輸出： [java]?view plaincopy

0000000201710141??

可知，ArrayWritable以一個整型開始表示數組長度，然后數組里的元素一一排開。 ArrayPrimitiveWritable和上面類似，只是不需要用子類去繼承ArrayWritable而已。

MapWritable和SortedMapWritable

MapWritable對應Map,SortedMapWritable對應SortedMap,以4個字節開頭，存儲集合大小，然后每個元素以一個字節開頭存儲類型的索引（類似GenericWritable,所以總共的類型總數只能倒127），接著是元素本身，先key后value，這樣一對對排開。這兩個Writable以后會用很多，貫穿整個hadoop，這里就不寫示例了。我們注意到沒看到set集合和list集合，這個可以代替實現。用MapWritable代替set，SortedMapWritable代替sortedmap，只需將他們的values設置成NullWritable即可，NullWritable不占空間。相同類型構成的list，可以用ArrayWritable代替，不同類型的list可以用GenericWritable實現類型，然后再使用ArrayWritable封裝。當然MapWritable一樣可以實現list，把key設置為索引，values做list里的元素。

轉載于:https://www.cnblogs.com/ihongyan/p/5137275.html

總結

以上是生活随笔為你收集整理的【转】hadoop深入研究:(十一)——序列化与Writable实现的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 0114练习彩票、验证码、双色球的随机
下一篇： UML分析AsyncDisplayKit