轨迹预测演变(第1/2部分)
無(wú)人駕駛汽車 (Self-Driving Cars)
“If you recognize that self-driving cars are going to prevent car accidents, AI will be responsible for reducing one of the leading causes of death in the world.” — Mark Zuckerberg
“如果您意識(shí)到自動(dòng)駕駛汽車將預(yù)防車禍,那么人工智能將負(fù)責(zé)減少世界上主要的死亡原因之一?!?- 馬克·扎克伯格
Whenever we think about the AI world, the auto industry immediately comes to our minds. Self-driving cars is one of the fascinating future which does not seem to be a distant reality. We just sit inside the car and watch a movie while it takes you to its destination.
每當(dāng)我們想到AI世界時(shí),就會(huì)立即想到汽車行業(yè)。 自動(dòng)駕駛汽車是令人著迷的未來(lái)之一,這似乎并不是遙不可及的現(xiàn)實(shí)。 我們只是坐在車內(nèi),一邊看電影,一邊將您帶到目的地。
But is it that easy that cars can drive fully autonomous and pay attention to all contexts in an environment? In the past few years, many papers have been published to predict the possible future trajectories of cars and pedestrians that are socially acceptable.
但是,汽車能夠完全自動(dòng)駕駛并關(guān)注環(huán)境中的所有環(huán)境是否容易呢? 在過(guò)去的幾年中,已經(jīng)發(fā)表了許多論文來(lái)預(yù)測(cè)社會(huì)上可以接受的汽車和行人的未來(lái)軌跡。
Question: What will be one of the biggest challenges of self-driving cars?
問(wèn)題 :自動(dòng)駕駛汽車的最大挑戰(zhàn)是什么?
Answer: Understanding pedestrian behavior and their future trajectory.
答案 :了解行人的行為及其未來(lái)軌跡。
Human motion can be described as multimodal i.e. humans have the possibility to move in multiple directions at any given instant of time. And this behavior is one of the biggest challenges of self-driving cars. Since they have to navigate through a human-centric world.
人體運(yùn)動(dòng)可以描述為多模式的,即人體可以在任何給定的瞬間沿多個(gè)方向移動(dòng)。 這種行為是自動(dòng)駕駛汽車的最大挑戰(zhàn)之一。 因?yàn)樗麄儽仨毚┰揭?strong>人為本的世界 。
In this first part, I will discuss three papers in brief whose main aim is to predict future possible trajectories of pedestrians.
在第一部分中,我將簡(jiǎn)要討論三篇論文,其主要目的是預(yù)測(cè)行人的未來(lái)可能軌跡。
社交GAN (Social GAN)
This is one of the initial papers that has stated using GAN to predict the possible trajectories of humans.
這是使用GAN預(yù)測(cè)人類可能軌跡的最初論文之一。
This paper tries to solve the problem by predicting the socially plausible future trajectories of humans that will help self-driving cars in making the right decision.
本文試圖通過(guò)預(yù)測(cè)人類在社會(huì)上可行的未來(lái)軌跡來(lái)解決該問(wèn)題,這將有助于自動(dòng)駕駛汽車做出正確的決定。
目標(biāo): (Aim:)
The paper aims at resolving two major challenges:
本文旨在解決兩個(gè)主要挑戰(zhàn):
方法 (Method)
Fig 1. 圖1. Screenshot from Social Gan Research paperSocial Gan Research論文的屏幕截圖This paper presented a GAN based encoder-decoder network that consists of a LSTM network for each person and a Pooling module that models interactions among them.
本文提出了一個(gè)基于GAN的編解碼器網(wǎng)絡(luò),該網(wǎng)絡(luò)由每個(gè)人的LSTM網(wǎng)絡(luò)和對(duì)它們之間的交互進(jìn)行建模的Pooling模塊組成。
The whole model (shown in Fig 1.) can be represented by 3 different components:
整個(gè)模型(如圖1所示)。 可以由3個(gè)不同的組件表示:
發(fā)電機(jī) (Generator)
The Generator consists of an encoder and a decoder. For each person, the encoder takes input as X_i. It embeds the location of each person and provides it as a fixed-length vector to the LSTM cell at time t.
生成器由編碼器和解碼器組成。 對(duì)于每個(gè)人,編碼器將輸入作為X_i。 它嵌入每個(gè)人的位置,并在時(shí)間t將其作為固定長(zhǎng)度的向量提供給LSTM單元。
The LSTM weights were shared among all the people in a scene that will help with the pooling module to develop interaction among people.
LSTM權(quán)重在場(chǎng)景中的所有人之間共享,這將有助于合并模塊發(fā)展人與人之間的互動(dòng)。
Unlike prior work, this paper has used 2 following approaches:
與先前的工作不同,本文使用了以下兩種方法:
a) For an easy training process during backpropagation, instead of predicting the bivariate Gaussian distribution, the decoder directly produces (x, y) coordinates of the person’s location.
a)為了在反向傳播過(guò)程中進(jìn)行簡(jiǎn)單的訓(xùn)練,解碼器無(wú)需預(yù)測(cè)雙變量高斯分布,而是直接生成人所在位置的(x,y)坐標(biāo)。
b) Instead of providing the social context directly as the input of the encoder, they have provided it once as input to the decoder. This led to an increase in speed to 16x times.
b)他們沒(méi)有將社交環(huán)境直接提供為編碼器的輸入,而是將其一次提供為解碼器的輸入。 這導(dǎo)致速度提高到16倍。
鑒別器 (Discriminator)
The discriminator consists of an encoder with LSTM layers for each person. The idea of this discriminator is to distinguish the real trajectories with fake ones.
鑒別器由每個(gè)人的帶有LSTM層的編碼器組成。 這種鑒別器的思想是用假的偽造來(lái)區(qū)分真實(shí)的彈道。
Ideally, it should classify the trajectories as “fake” if they are not socially acceptable or possible.
理想情況下,如果軌跡在社會(huì)上不可接受或不可能,則應(yīng)將其歸類為“偽造” 。
合并模塊 (Pooling Module)
Fig2.圖2。 Screenshot from Social GAN Paper社交GAN Paper的屏幕截圖The basic idea of this approach is shown in Fig2. This method computes the relative position of the person 1 (represented in red) and all other people (represented in blue and green). It is then concatenated with hidden states and processed independently through MLP (multi-layer perception).
這種方法的基本思想如圖2所示。 此方法計(jì)算人員1(以紅色表示)與所有其他人(以藍(lán)色和綠色表示)的相對(duì)位置。 然后將其與隱藏狀態(tài)連接起來(lái),并通過(guò)MLP(多層感知)進(jìn)行獨(dú)立處理。
Eventually, each element is sequentially pooled to compute a person’s 1 pooling vector P1.
最終,每個(gè)元素被順序合并以計(jì)算一個(gè)人的1個(gè)合并向量P1。
This method diminishes the limitation of considering people inside a particular grid (S-Pool Grid in Fig 2.)
這種方法減少了考慮在特定網(wǎng)格(圖2中的S-Pool網(wǎng)格 )內(nèi)部人員的局限性。
損失 (Losses)
3 different losses are used in this paper:
本文使用了3種不同的損耗:
Adversarial loss: This loss is a typical GAN loss that helps in differentiating real and fake trajectories.
對(duì)抗性損失 :這種損失是典型的GAN損失,有助于區(qū)分真實(shí)和偽造的軌跡。
L2 Loss: This loss takes the distance between the predicted and ground-truth trajectory and measures how far the generated samples are from real ones.
L2損失 :該損失占用了預(yù)測(cè)軌跡與地面真實(shí)軌跡之間的距離,并測(cè)量了所生成的樣本與真實(shí)樣本之間的距離。
Variety Loss: This loss helps in generating multiple different trajectories i.e. multimodal trajectories. The idea is very simple, for each input, N different possible outcomes are predicted by randomly sampling ‘z’ from N(0, 1). Eventually, select the best trajectory that has a minimum L2 value.
品種損失 :這種損失有助于產(chǎn)生多種不同的軌跡,即多峰軌跡。 這個(gè)想法非常簡(jiǎn)單,對(duì)于每個(gè)輸入,通過(guò)從N(0,1)中隨機(jī)采樣“ z”來(lái)預(yù)測(cè)N個(gè)不同的可能結(jié)果。 最終,選擇具有最小L2值的最佳軌跡。
蘇菲:細(xì)心的GAN (Sophie: An Attentive GAN)
This paper has extended the work of Social GAN and tried to predict a future path for an agent with the help of both physical and social information.
本文擴(kuò)展了Social GAN的工作,并嘗試借助物理和社交信息來(lái)預(yù)測(cè)代理的未來(lái)之路。
Although the aim is still the same as Social GAN, this paper has added scenic information too with the help of images of each frame.
盡管目標(biāo)仍然與Social GAN相同,但本文還借助每幀圖像添加了風(fēng)景信息。
The network learns two types of attention components:
網(wǎng)絡(luò)學(xué)習(xí)兩種類型的注意力組件:
Physical Attention: This attention component helps in paying attention and processing the local and global spatial information of the surrounding. As mentioned in the paper: “For example, when reaching a curved path, we focus more on the curve rather than other constraints in the environment”
身體注意 :此注意組件有助于關(guān)注并處理周圍的局部和全局空間信息。 如論文所述 : “例如,當(dāng)?shù)竭_(dá)彎曲路徑時(shí),我們更多地關(guān)注曲線而不是環(huán)境中的其他約束”
Social Attention: In this component, the idea is to give more attention to the movement and decisions of other agents in the surrounding environment. For example: “when walking in a corridor, we pay more attention to people in front of us rather than the ones behind us”
社會(huì)關(guān)注 :在此組件中,其想法是更加關(guān)注周圍環(huán)境中其他代理的移動(dòng)和決策。 例如:“ 在走廊上行走時(shí),我們會(huì)更加關(guān)注前方的人,而不是身后的人 ”
方法 (Method)
Fig3. 圖3。 Screenshot from Sophie Research paper.Sophie Research論文的屏幕截圖。This paper’s proposed approach has been divided into 3 modules (as shown in Fig3.).
本文提出的方法已被劃分成3個(gè)模塊(如圖圖三)。
特征提取器模塊 (Feature Extractor module)
This module extracts features from the input in 2 different forms, first as an image for each frame and second as a state of each agent for each frame at a time ‘t’.
該模塊提取物在2組不同的形式輸入功能, 首先為每個(gè)幀的圖像,并且第二 ,因?yàn)槊總€(gè)代理在時(shí)間t的每個(gè)幀的狀態(tài)。
To extract visual features from the image, they have used VGGnet-19 as a CNN network. The weights of this network are initialized by ImageNet. And to extract features from the past trajectory of all agents, they use a similar approach as Social GAN and use LSTM as an encoder.
為了從圖像中提取視覺(jué)特征,他們將VGGnet-19用作CNN網(wǎng)絡(luò)。 該網(wǎng)絡(luò)的權(quán)重由ImageNet初始化。 為了從所有特工的過(guò)去軌跡中提取特征,他們使用與Social GAN類似的方法,并使用LSTM作為編碼器。
To understand the interaction between each agent and capture the influence of each agent trajectory on another agent, the pooling module was used in Social GAN. This paper has mentioned 2 limitations with that method:
為了了解每個(gè)代理之間的交互并捕獲每個(gè)代理軌跡對(duì)另一個(gè)代理的影響,在Social GAN中使用了合并模塊。 本文提到了該方法的兩個(gè)局限性:
Because of these limitations, they define an ordering structure. In this, they use sort as permutation invariant function instead of max(used in Social GAN). They sort the agents by calculating the euclidean distance between the target agent and other agents.
由于這些限制,它們定義了排序結(jié)構(gòu)。 在這種情況下,他們使用sort作為置換不變函數(shù),而不是max (在Social GAN中使用)。 他們通過(guò)計(jì)算目標(biāo)代理與其他代理之間的歐式距離來(lái)對(duì)代理進(jìn)行排序。
注意模塊 (Attention module)
With the help of physical or social attention, this module helps in highlighting the important information of the input for the next module.
在身體或社會(huì)關(guān)注的幫助下,該模塊有助于突出顯示下一個(gè)模塊輸入的重要信息。
The idea is, as humans pay more attention to certain obstacles or objects in an environment like upcoming turns or people walking towards them, a similar kind of attention needs to be learned.
這個(gè)想法是,當(dāng)人們?cè)诩磳⒌絹?lái)的轉(zhuǎn)彎或人們朝他們走去的環(huán)境中更加關(guān)注某些障礙或物體時(shí),需要學(xué)習(xí)類似的注意力。
As mentioned before, this network tends to learn 2 different attention.
如前所述,該網(wǎng)絡(luò)傾向于學(xué)習(xí)2種不同的注意力。
In physical attention, hidden states of the LSTM from the GAN module and learned features from the visual context is provided as input. This helps in learning more about the physical constraints like the path is straight or curved, what is the current movement direction, position, and more?
在物理注意方面 ,將GAN模塊中LSTM的隱藏狀態(tài)和可視上下文中的學(xué)習(xí)功能作為輸入提供。 這有助于了解更多關(guān)于物理約束的信息,例如路徑是直線還是彎曲的,當(dāng)前的運(yùn)動(dòng)方向,位置等等?
In social attention, the LSTM features learned from the feature module together with hidden states of the LSTM from the GAN module are provided as input. This helps in focusing on all agents that are important to predict the further trajectory.
在社會(huì)關(guān)注中 ,從功能模塊中學(xué)習(xí)到的LSTM功能以及從GAN模塊中獲得的LSTM隱藏狀態(tài)都作為輸入提供。 這有助于專注于對(duì)預(yù)測(cè)進(jìn)一步軌跡很重要的所有代理。
GAN模塊 (GAN module)
This module takes that highlighted input features to generate a realistic future path for each agent that satisfies all the social and physical norms.
該模塊采用突出顯示的輸入功能,為滿足所有社會(huì)和身體規(guī)范的每個(gè)代理生成現(xiàn)實(shí)的未來(lái)之路。
This GAN module is majorly inspired by the Social GAN with almost no further changes.
此GAN模塊主要受Social GAN的啟發(fā),幾乎沒(méi)有進(jìn)一步的變化。
The input to the generator is the selected features from the attention module as well as white noise ‘z’ sampled from a multivariate normal distribution.
生成器的輸入是從注意力模塊中選擇的功能以及從多元正態(tài)分布中采樣的白噪聲“ z” 。
損失 (Losses)
This approach has used 2 losses which are also similar to Social GAN.
該方法使用了2個(gè)損失,這也類似于Social GAN。
L2 Loss: This loss is similar to “variety loss” used in Social GAN.
L2損失:此損失類似于Social GAN中使用的“ 品種損失 ”。
社交方式 (Social Ways)
In this paper, they are also trying to predict the pedestrian’s trajectories and their interaction. However, they are also aiming to solve one problem from all previous approaches: Mode Collapse.
在本文中 ,他們還試圖預(yù)測(cè)行人的軌跡及其相互作用。 但是,他們還旨在解決以前所有方法中的一個(gè)問(wèn)題: 模式崩潰 。
Mode Collapse is the opposite of multimodality. In this, the generator tries to produce similar samples or the same set of samples leading to similar modes in output.
模式崩潰與多模式相反。 在這種情況下,生成器嘗試生成相似的樣本或同一組樣本,從而導(dǎo)致輸出中的相似模式。
To solve the mode collapse problem, this paper uses info-GAN instead of L2 Loss or variety loss.
為了解決模式崩潰問(wèn)題,本文使用info-GAN代替了L2損失或品種損失。
方法 (Method)
Fig4.圖4。 Screenshot from Social Ways Research paper社交方式研究論文的屏幕截圖This method comprises of 3 different components:
此方法包含3個(gè)不同的組件:
發(fā)電機(jī) (Generator)
The generator consists of an encoder-decoder network. Here the past trajectory of each agent was fed into respective LSTM-E (Encoder), which encodes the history of the agent. The output of each LSTM-E was fed into attention pooling as well as a decoder.
生成器由一個(gè)編碼器-解碼器網(wǎng)絡(luò)組成。 在這里,每個(gè)代理的過(guò)去軌跡被輸入到相應(yīng)的LSTM-E(編碼器)中,該代碼對(duì)代理的歷史進(jìn)行編碼。 每個(gè)LSTM-E的輸出都被輸入到注意力集中以及解碼器中。
For decoding, the future trajectories, hidden states from LSTM-E, noise vector ‘z’, latent code ‘c’, and important interacting agent features from the attention pooling are fed into decoder.
對(duì)于解碼,未來(lái)的軌跡,LSTM-E的隱藏狀態(tài),噪聲矢量“ z”,潛在代碼“ c”以及注意力集中的重要交互代理功能都將輸入解碼器。
The latent code ‘c’ helps in maximizing a lower bound of the mutual information between the distribution of generated output and ‘c’.
潛在代碼“ c”有助于最大化生成的輸出與“ c”的分布之間的相互信息的下限。
注意池 (Attention Pooling)
This paper uses a similar kind of approach as used in Sophie: an attentive GAN.
本文使用的方法與Sophie中使用的方法類似:殷勤的GAN 。
However, in addition to the euclidean distance between agents (used in Sohpie), 2 more features are being used:
但是,除了代理之間的歐式距離(用于Sohpie)之外,還使用了2個(gè)功能:
Bearing angle: “the angle between the velocity vector of agent 1 and vectors joining agents 1 and agent 2.”
方位角 : “代理1的速度矢量與連接代理1和代理2的矢量之間的角度。”
The distance of closest approach: “the smallest distance, 2 agents would reach in the future if both maintain their current velocity.”
最接近的距離 : “最小的距離,如果兩個(gè)代理都保持當(dāng)前速度,將來(lái)會(huì)到達(dá)?!?
Instead of sorting, the attention weights were obtained by scalar product and softmax operation between hidden states and the above-mentioned three features.
代替排序,注意力權(quán)重通過(guò)隱藏狀態(tài)與上述三個(gè)特征之間的標(biāo)量積和softmax操作獲得。
鑒別器 (Discriminator)
Discriminator consists of LSTM based encoder with multiple dense layers. Generated trajectories from the generator for each agent and ground-truth trajectories were fed into the discriminator.
鑒別器由具有多個(gè)密集層的基于LSTM的編碼器組成。 生成器為每個(gè)代理生成的軌跡和真實(shí)的軌跡被輸入到鑒別器中。
As output, the probability that generated trajectories are real is provided.
作為輸出,提供了生成的軌跡是真實(shí)的概率。
損失 (Losses)
There are 2 losses used in this process:
此過(guò)程中使用了2種損失:
Adversarial Loss: This is the normal GAN loss that helps in differentiating between real and fake samples.
對(duì)抗性損失 :這是正常的GAN損失,有助于區(qū)分真實(shí)和偽造樣品。
Information Loss: The basic idea of this loss is to maximize mutual information. And that is achieved by minimizing the negative- loss likelihood on salient variable ‘c’.
信息丟失 :這種丟失的基本思想是最大化相互信息。 這是通過(guò)使顯著變量'c'的負(fù)損失可能性最小化來(lái)實(shí)現(xiàn)的 。
結(jié)果 (Results)
All three papers have tried to learn from previous approaches and have gained some new insights.
所有這三篇論文都試圖向以前的方法學(xué)習(xí),并獲得了一些新見(jiàn)解。
There are 2 metrics that are used to evaluate this application:
有兩個(gè)用于評(píng)估此應(yīng)用程序的指標(biāo):
Average Displacement Error (ADE): It is the average L2 distance on all predicted timesteps between the generated trajectory and ground-truth trajectory.
平均位移誤差(ADE) :它是在生成的軌跡和地面真實(shí)軌跡之間的所有預(yù)測(cè)時(shí)間步長(zhǎng)上的平均L2距離。
Final Displacement Error(FDE): This is the smallest distance between the generated trajectory and ground-truth trajectory at the final predicted timestep.
最終位移誤差(FDE) :這是在最終預(yù)測(cè)時(shí)間步中生成的軌跡與地面真實(shí)軌跡之間的最小距離。
There are 5 datasets, which are used as benchmarking this application. All these approaches have been tested on all these datasets and they have provided valuable comparable results.
有5個(gè)數(shù)據(jù)集,用作基準(zhǔn)測(cè)試此應(yīng)用程序。 所有這些方法都在所有這些數(shù)據(jù)集上進(jìn)行了測(cè)試,它們提供了有價(jià)值的可比結(jié)果。
Fig5.圖5。 Screenshot of Results from Social Ways社交方式結(jié)果的屏幕截圖From Fig5. , it can be said that all three approaches show promising results on some datasets. However, I think with hyperparameter tuning and little adjustments, results may shuffle too. I believe all three have certain advantages and could be used to further research in this area.
從圖5。 ,可以說(shuō)這三種方法在某些數(shù)據(jù)集上都顯示出令人鼓舞的結(jié)果。 但是,我認(rèn)為通過(guò)超參數(shù)調(diào)整和少量調(diào)整,結(jié)果也可能會(huì)混亂。 我相信這三者都具有一定的優(yōu)勢(shì),可以用于該領(lǐng)域的進(jìn)一步研究。
結(jié)論 (Conclusion)
The problem of modeling human motion prediction in a scene along with human-human interaction is challenging yet vital for self-driving cars. Without modeling this behavior, it is impossible that self-driving cars could be fully operational.
在場(chǎng)景中對(duì)人的運(yùn)動(dòng)預(yù)測(cè)進(jìn)行建模以及人與人之間的交互作用的問(wèn)題是具有挑戰(zhàn)性的,但對(duì)于自動(dòng)駕駛汽車而言卻至關(guān)重要。 如果不對(duì)此行為進(jìn)行建模,那么自動(dòng)駕駛汽車就不可能完全運(yùn)轉(zhuǎn)。
How to model human-human interaction is the major difference between the above-mentioned approaches. From the theory and suggested methods, I believe attention on distance and the bearing angle between 2 agents is one the most crucial ways to move forward.
如何模擬人與人之間的互動(dòng)是上述方法之間的主要區(qū)別。 從理論和建議的方法來(lái)看,我認(rèn)為關(guān)注距離和兩個(gè)代理之間的方位角是前進(jìn)的最關(guān)鍵方法之一。
But it is completely my perspective. There could be multiple ways that this could be implemented and enhanced. And we will see that in Part 2 too.
但這完全是我的觀點(diǎn)。 可能有多種方法可以實(shí)現(xiàn)和增強(qiáng)它。 我們還將在第2部分中看到這一點(diǎn)。
With self-driving cars as?the?focus, I would continue with more approaches in Part 2 with a focus on the trajectory prediction of cars.
以自動(dòng)駕駛汽車為重點(diǎn),我將在第2部分中繼續(xù)介紹更多方法,重點(diǎn)是汽車的軌跡預(yù)測(cè)。
I am happy for any further discussion on this paper and in this area. You can leave a comment here or reach out to me on my LinkedIn profile.
對(duì)于本文和該領(lǐng)域的任何進(jìn)一步討論,我感到很高興。 您可以在此處發(fā)表評(píng)論,也可以通過(guò)我的LinkedIn個(gè)人資料與我聯(lián)系。
翻譯自: https://towardsdatascience.com/trajectory-prediction-self-driving-cars-ai-40a7c6eecb4c
總結(jié)
以上是生活随笔為你收集整理的轨迹预测演变(第1/2部分)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: i53350p性能
- 下一篇: 人口预测和阻尼-增长模型_使用分类模型预