日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

CUDA系列学习(五)GPU基础算法: Reduce, Scan, Histogram

發布時間:2025/3/21 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 CUDA系列学习(五)GPU基础算法: Reduce, Scan, Histogram 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
喵~不知不覺到了CUDA系列學習第五講,前幾講中我們主要介紹了基礎GPU中的軟硬件結構,內存管理,task類型等;這一講中我們將介紹3個基礎的GPU算法:reduce,scan,histogram,它們在并行算法中非常常用,我們在本文中分別就其功能用處,串行與并行實現進行闡述。?
———-

1. Task complexity

task complexity包括step complexity(可以并行成幾個操作) & work complexity(總共有多少個工作要做)。?
e.g. 下面的tree-structure圖中每個節點表示一個操作數,每條邊表示一個操作,同層edge表示相同操作,問該圖表示的task的step complexity & work complexity分別是多少。

Ans:?
step complexity: 3;?
work complexity: 6。?
下面會有更具體的例子。




2. Reduce

引入:我們考慮一個task:1+2+3+4+…?
1) 最簡單的順序執行順序組織為((1+2)+3)+4…?
2) 由于operation之間沒有依賴關系,我們可以用Reduce簡化操作,它可以減少serial implementation的步數。?


2.1 what is reduce?

Reduce input:

  • set of elements
  • reduction operation?
  • binary: 兩個輸入一個輸出
  • 操作滿足結合律: (a@b)@c = a@(b@c), 其中@表示operator?
    e.g +, 按位與 都符合;a^b(expotentiation)和減法都不是
  • ?



    2.1.1 Serial implementation of Reduce:

    reduce的每一步操作都依賴于其前一個操作的結果。比如對于前面那個例子,n個數相加,work complexity 和 step complexity都是O(n)(原因不言自明吧~)我們的目標就是并行化操作,降下來step complexity. e.g add serial reduce -> parallel reduce。?


    2.1.2 Parallel implementation of Reduce:

    也就是說,我們把step complexity降到了log2n

    舉個栗子,如下圖所示:?



    那么如果對210個數做parallel reduce add,其step complexity就是10. 那么在這個parallel reduce的第一步,我們需要做512個加法,這對modern gpu不是啥大問題,但是如果我們要對220個數做加法呢?就需要考慮到gpu數量了,如果說gpu最多能并行做512個操作,我們就應將220個數分成1024*1024(共1024組),每次做210個數的加法。這種考慮task規模和gpu數量關系的做法有個理論叫Brent’s Theory. 下面我們具體來看:

    也就是進行兩步操作,第一步分成1024個block,每個block做加法;第二步將這1024個結果再用1個1024個thread的block進行求和。kernel code:

    <code class="hljs objectivec has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">__global__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> parallel_reduce_kernel(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">float</span> *d_out, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">float</span>* d_in){<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> myID = threadIdx<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span> + blockIdx<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span> * blockDim<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span>;<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> tid = threadIdx<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span>;<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//divide threads into two parts according to threadID, and add the right part to the left one, lead to reducing half elements, called an iteration; iterate until left only one element</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span>(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">unsigned</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> s = blockDim<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span> / <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> ; s><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>; s>>=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>){<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span>(tid<s){d_in[myID] += d_in[myID + s];}__syncthreads(); <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//ensure all adds at one iteration are done</span>}<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (tid == <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>){d_out[blockIdx<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span>] = d_in[myId];} }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li></ul>



    Quiz: 看一下上面的code可以從哪里進行優化?

    Ans:我們在上一講中提到了global,shared & local memory的速度,那么這里對于global memory的操作可以更改為shared memory,從而進行提速:

    <code class="hljs objectivec has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">__global__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> parallel_shared_reduce_kernel(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">float</span> *d_out, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">float</span>* d_in){<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> myID = threadIdx<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span> + blockIdx<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span> * blockDim<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span>;<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> tid = threadIdx<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span>;<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">extern</span> __shared__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">float</span> sdata[];sdata[tid] = d_in[myID];__syncthreads();<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//divide threads into two parts according to threadID, and add the right part to the left one, lead to reducing half elements, called an iteration; iterate until left only one element</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span>(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">unsigned</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> s = blockDim<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span> / <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> ; s><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>; s>>=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>){<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span>(tid<s){sdata[tid] += sdata[tid + s];}__syncthreads(); <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//ensure all adds at one iteration are done</span>}<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (tid == <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>){d_out[blockIdx<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">.x</span>] = sdata[myId];} }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li></ul>


    優化的代碼中還有一點要注意,就是聲明的時候記得我們第三講中說過的kernel通用表示形式:

    <code class="hljs vhdl has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">kernel<<<grid <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> blocks, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">block</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> threads, shmem>>></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul> 最后一項要在call kernel的時候聲明好,即: <code class="hljs cs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">parallel_reduce_kernel<<<blocks, threads, threads*<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">sizeof</span>(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">float</span>)>>>(data_out, data_in);</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>

    好,那么問題來了,對于這兩個版本(parallel_reduce_kernel 和 parallel_shared_reduce_kernel), parallel_reduce_kernel比parallel_shared_reduce_kernel多用了幾倍的global memory帶寬? Ans: 分別考慮兩個版本的讀寫操作:
    <code class="hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">parallel_reduce_kernel</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
    Times Read Ops Write Ops
    1 1024 512
    2 512 256
    3 256 128
    ? ?
    n 1 1
    <code class="hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">parallel_shared_reduce_kernel</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
    Times Read Ops Write Ops
    1 1024 1

    所以,parallel_reduce_kernel所需的帶寬是parallel_shared_reduce_kernel的3倍


    3. Scan

    3.1 what is scan?

    • Example:

      • input: 1,2,3,4
      • operation: Add
      • ouput: 1,3,6,10(out[i]=sum(in[0:i]))
    • 目的:解決難以并行的問題

    拍拍腦袋想想上面這個問題O(n)的一個解法是out[i] = out[i-1] + in[i].下面我們來引入scan。

    Inputs to scan:

  • input array
  • 操作:binary & 滿足結合律(和reduce一樣)
  • identity element?[I op a = a], 其中I 是identity element?
    quiz: what is the identity for 加法,乘法,邏輯與,邏輯或??
    Ans:
  • op Identity
    加法 0
    乘法 1
    邏輯或|| False
    邏輯與&& True



    3.2 what scan does?

    I/O content ? ? ? ?
    input [a0 a1 a2 an]
    output [I a0 a0?a1 a0?a1??…?an]

    其中?是scan operator,I 是?的identity element




    3.2.1 Serial implementation of Scan

    很簡單:

    <code class="hljs matlab has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">int acc = identity; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span>(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">i</span>=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>;<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">i</span><<span class="hljs-transposed_variable" style="box-sizing: border-box;">elements.</span><span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">length</span>();<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">i</span>++)<span class="hljs-cell" style="box-sizing: border-box;">{acc = acc op elements[i];out[i] = acc; }</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>

    work complexity:?O(n)?
    step complexity:?O(n)

    那么,對于scan問題,我們怎樣對其進行并行化呢?



    3.2.1 Parallel implementation of Scan

    考慮scan的并行化,可以并行計算n個output,每個output元素i相當于a0?a1??…?ai,是一個reduce operation。

    Q: 那么問題的work complexity和step complexity分別變為多少了呢??
    Ans:

    • step complexity:?
      取決于n個reduction中耗時最長的,即O(log2n)
    • work complexity:?
      對于每個output元素進行計算,總計算量為0+1+2+…+(n-1),所以復雜度為O(n2).

    可見,step complexity降下來了,可惜work complexity上去了,那么怎么解決呢?這里有兩種Scan算法:

    ? more step efficiency more work efficiency
    hillis + steele (1986) ?
    blelloch (1990) ?




  • Hillis + Steele

    對于Scan加法問題,hillis+steele算法的解決方案如下:
  • 即streaming’s?
    step 0: out[i] = in[i] + in[i-1];?
    step 1: out[i] = in[i] + in[i-2];?
    step 2: out[i] = in[i] + in[i-4];?
    如果元素不存在(向下越界)就記為0;可見step 2的output就是scan 加法的結果(想想為什么,我們一會再分析)。

    那么問題來了。。。?
    Q: hillis + steele算法的work complexity 和 step complexity分別為多少?

    Hillis + steele Algorithm complexity
    ? log(n) O(n??) O(n) O(nlogn) O(n^2)
    work complexity ? ? ? ?
    step complexity ? ? ? ?

    解釋:

    為了不妨礙大家思路,我在表格中將答案設為了白色,選中表格可見答案。

  • step complexity:?
    因為第i個step的結果為上一步輸出作為in, out[idx] = in[idx] + in[idx - 2^i], 所以step complexity =?O(log(n))
  • work complexity:?
    workload =?(n?1)+(n?2)+(n?4)+...?,共有log(n)項元素相加,所以可以近似看做一個矩陣,對應上圖,長log(n), 寬n,所以復雜度為?nlog(n)



  • 2 .Blelloch

    基本思路:Reduce + downsweep

    還是先講做法。我們來看Blelloch算法的具體流程,分為reduce和downsweep 兩部分,如圖所示。



  • reduce部分:?
    每個step對相鄰兩個元素進行求和,但是每個元素在input中只出現一次,即window size=2, step = 2的求和。?
    Q: reduce部分的step complexity 和 work complexity??
    Ans:

    Reduce part in Blelloch
    ? log(n) O(n??) O(n) O(nlogn) O(n^2)
    work complexity ? ? ? ?
    step complexity ? ? ? ?

    我們依然將答案用白色標出,請選中看答案。?

  • downsweep部分:?
    簡單地說,downsweep部分的輸入元素是reduce部分鏡面反射的結果,對于每一組輸入in1 & in2有兩個輸出,左邊輸出out1 = in2,右邊輸出out2 = in1 op in2 (這里的op就是reduce部分的op),如圖:


  • ?

    如上上圖中的op為加法,那舉個例子就有:in1 = 11, in2 = 10, 可得out1 = in2 = 10, out2 = in1 + in2 = 21。由此可以推出downsweep部分的所有value,如上上圖。?
    這里畫圈的元素都是從reduce部分直接“天降”(鏡面反射)過來的,注意,每一個元素位置只去reduce出來該位置的最終結果,而且由于是鏡面反射,step層數越大的reduce計算結果“天降”越快,即從reduce的“天降”順序為

    36
    10
    3, 11
    1, 3, 5, 7

    Q: downsweep部分的step complexity 和 work complexity??
    And:downsweep是reduce部分的mirror,所以當然和reduce部分的complexity都一樣啦。

    綜上,Blelloch方法的work complexity為O(n),step 數為2?log(n).這里我們可以看出相比于Hillis + Steele方法,Blelloch的總工作量更小。那么問題來了,這兩種方法哪個更快呢?

    ANS:這取決于所用的GPU,問題規模,以及實現時的優化方法。這一邊是一個不斷變化的問題:一開始我們有很多data(work > processor), 更適合用work efficient parallel algorithm (e.g Blelloch), 隨著程序運行,工作量被減少了(processor > work),適合改用step efficient parallel algorithm,這樣而后數據又多起來啦,于是我們又適合用work efficient parallel algorithm…


    總結一下,見下表為每種方法的complexity,以及適于解決的問題:

    ? serial Hillis + Steele Blelloch
    work O(n) O(nlogn) O(n)
    step n log(n) 2*log(n)
    512個元素的vector
    512個processor
    ? ?
    一百萬的vector
    512個processor
    ? ?
    128k的vector
    1個processor
    ? ?





    4. Histogram

    4.1. what is histogram?

    顧名思義,統計直方圖就是將一個統計量在直方圖中顯示出來。

    4.2. Histogram 的 Serial 實現:

    分兩部分:1. 初始化,2. 統計

    <code class="hljs matlab has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span>(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">i</span> = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>; <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">i</span> < <span class="hljs-transposed_variable" style="box-sizing: border-box;">bin.</span>count; <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">i</span>++)res<span class="hljs-matrix" style="box-sizing: border-box;">[i]</span> = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>; <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span>(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">i</span> = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>; <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">i</span><nElements; <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">i</span>++)res<span class="hljs-matrix" style="box-sizing: border-box;">[computeBin(i)]</span> ++;</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>

    4.3. Histogram 的 Parallel 實現:

  • 直接實現:
  • kernel:

    <code class="hljs cs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">__global__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> naive_histo(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span>* d_bins, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">const</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span>* d_in, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">const</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> BIN_COUNT){<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> myID = threadIdx.x + blockDim.x * blockIdx.x;<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> myItem = d_in[myID];<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> myBin = myItem % BIN_COUNT;d_bins[myBin]++; }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>

    來想想這樣有什么問題?又是我們上次說的read-modify-write問題,而serial implementation不會有這個問題,那么想實現parallel histogram計算有什么方法呢?

    法1. accumulate using atomics?
    即,將最后一句變成?
    atomicAdd(&(d_bins[myBin]), 1);?
    但是對于atomics的方法而言,不管GPU多好,并行線程數都被限制到histogram個數N,也就是最多只有N個線程并行。?


    法2. local memory + reduce?
    設置n個并行線程,每個線程都有自己的local histogram(一個長為bin數的vector);即每個local histogram都被一個thread順序訪問,所以這樣沒有shared memory,即便沒有用atomics也不會出現read-modify-write問題。
    然后,我們將這n個histogram進行合并(即加和),可以通過reduce實現。?

    法3. sort then reduce by key?
    將數據組織成key-value對,key為histogram bin,value為1,即

    key 2 1 1 2 1 0 2 2
    value 1 1 1 1 1 1 1 1

    將其按key排序,形成:

    key 0 1 1 1 2 2 2 2
    value 1 1 1 1 1 1 1 1

    然后對相同key進行reduce求和,就可以得到histogram中的每個bin的總數。


    綜上,有三種實現paralle histogram的方法:?
    1. atomics?
    2. per_thread histogram, then reduce?
    3. sort, then reduce by key


    5. 總結:

    本文介紹了三個gpu基礎算法:reduce,scan和histogram的串行及并行實現,并鞏固了之前講過的gpu memory相關知識加以運用。


    from:?http://blog.csdn.net/abcjennifer/article/details/43528407

    總結

    以上是生活随笔為你收集整理的CUDA系列学习(五)GPU基础算法: Reduce, Scan, Histogram的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。