當前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

安全警报该站点安全证书_深度学习如何通过实时犯罪警报确保您的安全

發布時間：2023/12/15 pytorch 35 豆豆

生活随笔收集整理的這篇文章主要介紹了安全警报该站点安全证书_深度学习如何通过实时犯罪警报确保您的安全小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

安全警報該站點安全證書

Citizen scans thousands of public first responder radio frequencies 24 hours a day in major cities across the US. The collected information is used to provide real-time safety alerts about incidents like fires, robberies, and missing persons to more than 5M users. Having humans listen to 1000+ hours of audio daily made it very challenging for the company to launch new cities. To continue scaling, we built ML models that could discover critical safety incidents from audio.

公民每天24小時在美國主要城市掃描成千上萬個公共第一響應無線電頻率。所收集的信息用于向超過500萬用戶提供有關火災，搶劫和失蹤人員等事件的實時安全警報。每天讓人們聽1000多個小時的音頻，對于公司啟動新城市來說非常困難。為了繼續擴展，我們建立了ML模型，可以從音頻中發現嚴重的安全事件。

Our custom software-defined radios (SDRs) capture large swathes of radio frequency (RF) and create optimized audio clips that are sent to an ML model to flag relevant clips. The flagged clips are sent to operations analysts to create incidents in the app, and finally, users near the incidents are notified.

我們的定制軟件定義的無線電(SDR)捕獲了大量的射頻(RF)并創建優化的音頻剪輯，然后將其發送到ML模型以標記相關的剪輯。標記的剪輯將發送給運營分析人員，以在應用程序中創建事件，最后，事件附近的用戶將得到通知。

Figure 1. Safety alerts workflow (Image by Author)圖1.安全警報工作流(作者提供的圖像)

使公共語音轉文本引擎適應我們的問題領域 (Adapting a Public Speech-to-Text Engine to Our Problem Domain)

Figure 2. Clip classifier using public speech-to-text engine (Image by Author)圖2.使用公共語音轉文本引擎的剪輯分類器(作者提供的圖像)

We started with a top performing speech-to-text engine based on the word error rate (WER). There are a lot of special codes used by police that are not part of the normal vernacular. For example, an NYPD officer requests backup units by transmitting a “Signal 13”. We customized the vocabulary to our domain using speech contexts.

我們從基于單詞錯誤率(WER)的性能最高的語音到文本引擎開始。警察使用許多特殊代碼，這些特殊代碼不屬于普通語言。例如，NYPD官員通過發送“信號13”來請求備份單元。我們使用語音上下文針對我們的領域定制了詞匯表。

We also boosted some words to fit our domain, for example, “assault” isn’t used colloquially, but is very common in our use case. We had to bias our models towards detecting “assault” over “a salt”.

我們還增加了一些詞來適合我們的領域，例如，“突擊”不是口語化的，但在我們的用例中很常見。我們不得不將模型偏向于檢測“攻擊”而不是“鹽”。

After tuning the parameters, we were able to get reasonable accuracy for transcriptions in some cities. The next step was to use the transcribed data of the audio clips and figure out which ones were relevant to Citizen.

調整參數后，我們能夠在某些城市中獲得合理的轉錄準確性。下一步是使用音頻片段的轉錄數據，找出哪些與公民有關。

基于轉錄和音頻特征的二進制分類器 (Binary Classifier Based on Transcriptions and Audio Features)

We modeled a binary classification problem with the transcriptions as input and a confidence level as output. XGBoost gave us the best performance on our dataset.

我們用轉錄作為輸入，置信度作為輸出對二進制分類問題建模。 XGBoost在我們的數據集上為我們提供了最佳性能。

We had insight from someone who previously worked in law enforcement that radio transmissions about major incidents in some cities are preceded by special alert tones to get the attention of police on the ground. This extra feature helped make our model more reliable, especially in cases of bad transcriptions. Some other useful features we found were the police channel and transmission IDs.

我們從以前在執法部門工作過的人那里了解到，在某些城市發生重大事件的無線電廣播之前，會先發出特殊的提示音，以引起現場警察的注意。這個額外的功能使我們的模型更加可靠，尤其是在轉錄錯誤的情況下。我們發現其他一些有用的功能是警察通道和傳輸ID。

We A/B tested the ML model in operations workflow. After a few days of running the test, we noticed no degradation in the incidents created by analysts who were using the model-flagged clips only.

我們A / B在操作流程中測試了ML模型。經過幾天的測試，我們發現，僅使用模型標記的剪輯的分析師所產生的事件沒有降低。

We launched the model in a few cities. Now a single analyst could handle multiple cities at once, which wasn’t previously possible! With the new spare capacity on operations, we were able to launch multiple new cities.

我們在一些城市推出了該模型。現在，一個分析師可以一次處理多個城市，這在以前是不可能的！有了新的運營備用容量，我們得以啟動多個新城市。

Figure 3. Model rollout leading to a significant reduction in audio for analysts (Image by Author)圖3.模型的推出大大減少了分析人員的音頻(作者提供的圖像)

超越公共語音轉文字引擎 (Beyond a Public Speech-to-Text Engine)

The model didn’t turn out to be a panacea for all our problems. We could only use it in a few cities which had good quality audio. Public speech-to-text engines are trained on phone models with different acoustic profile than radios; as a result, the transcription quality was sometimes unreliable. Transcriptions were completely unusable for the older analog systems, which were very noisy.

該模型并沒有成為解決我們所有問題的靈丹妙藥。我們只能在有高質量音頻的幾個城市中使用它。公開語音到文本引擎在電話模型上接受了與收音機不同的聲學配置的訓練；結果，轉錄質量有時是不可靠的。轉錄對于嘈雜的老式模擬系統是完全不可用的。

We tried multiple models from multiple providers, but none of them were trained on an acoustic profile similar to our dataset and couldn’t handle noisy audio.

我們嘗試了來自多個提供商的多個模型，但是沒有一個模型是在類似于我們的數據集的聲學輪廓上進行訓練的，并且無法處理嘈雜的音頻。

We explored replacing the speech-to-text engine with the one trained on our data while keeping the rest of the pipeline the same. However, we needed several hundred hours of transcription data for our audio which was very slow and expensive to generate. We had an option to optimize the process by only transcribing the “important” words defined in our vocabulary and adding blanks for the irrelevant words — but that was still just an incremental reduction in effort.

我們研究了用訓練有素的數據替換語音到文本引擎，同時保持其余管道不變。但是，我們需要數百小時的轉錄數據來獲取音頻，這非常緩慢且生成成本很高。我們可以選擇僅通過轉錄詞匯中定義的“重要”單詞并為不相關的單詞添加空格來優化流程的方法，但這仍然只是逐步減少的工作量。

Eventually, we decided to build a custom speech processing pipeline for our problem domain.

最終，我們決定為我們的問題域建立定制的語音處理管道。

卷積神經網絡的關鍵詞識別 (Convolutional Neural Network for Keyword Spotting)

Since we only care about the presence of keywords, we didn’t need to find the right order of words and could reduce our problem to keyword spotting. That was a much easier problem to solve! We decided to do so using a convolutional neural network (CNN) trained on our dataset.

由于我們只關心關鍵字的存在，因此我們不需要找到正確的單詞順序，并且可以將我們的問題歸結為關鍵字發現。那是一個容易解決的問題！我們決定使用在我們的數據集上訓練的卷積神經網絡(CNN)來做到這一點。

Using CNNs over Recurrent neural networks (RNNs) or Long short-term memory (LSTM) models meant that we could train much faster and had quicker iterations. We also evaluated using the Transformer model which is massively parallel but requires a lot of hardware to run. Since we were only looking for short term dependencies between audio segments to detect words, a computationally simple CNN seemed a better choice over Transformers and it freed up hardware for us to be more vigorous with hyperparameter tuning.

在遞歸神經網絡(RNN)或長短期記憶(LSTM)模型上使用CNN意味著我們可以更快地訓練并且迭代更快。我們還使用了大規模并行但需要大量硬件才能運行的Transformer模型進行了評估。由于我們只是在尋找音頻片段之間的短期依賴關系來檢測單詞，因此與Transformers相比，計算簡單的CNN似乎是更好的選擇，并且它釋放了硬件，使我們可以更靈活地進行超參數調整。

Figure 4. Clip flagging model with a CNN for keyword spotting (Image by Author)圖4.帶有CNN的剪輯標記模型，用于關鍵字發現(作者提供的圖像)

We split the audio clips into fixed duration subclips. We gave a positive label to a subclip if a vocabulary word was present. We then marked an audio clip as useful if any such subclip was found in it. During the training process, we tried how varying the duration of subclips affected our convergence performance. Long clips made it much harder for the model to figure out which portion of the clip was useful and also harder to debug. Short clips meant that words partially appeared across multiple clips, which made it harder for the model to identify them. We were able to tune this hyperparameter and find a reasonable duration.

我們將音頻片段分成固定持續時間的子片段。如果存在詞匯，則給子剪輯加上正標簽。然后，如果在其中找到任何此類子剪輯，則將音頻剪輯標記為有用。在訓練過程中，我們嘗試了改變子剪輯的持續時間如何影響我們的收斂性能。較長的剪輯使模型更難確定剪輯的哪個部分有用并且也較難調試。短剪輯意味著單詞在多個剪輯中部分出現，這使得模型更難識別它們。我們能夠調整此超參數并找到合理的持續時間。

For each subclip, we convert the audio into MFCC coefficients and also add the first and second-order derivatives. The features are generated with a frame size of 25ms and stride of 10ms. The features are then fed into a neural network based on Keras Sequential model using a Tensorflow backend. The first layer is a Gaussian noise which makes the model more robust to noise differences between different radio channels. We tried an alternative approach of artificially overlaying real noise to clips, but that slowed down training time significantly with no meaningful performance gains.

對于每個子剪輯，我們將音頻轉換為MFCC系數，還添加一階和二階導數。生成的特征具有25ms的幀大小和10ms的步幅。然后使用Tensorflow后端將特征輸入基于Keras 序列模型的神經網絡中。第一層是高斯噪聲，它使模型對不同無線電信道之間的噪聲差異更魯棒。我們嘗試了一種將人工噪聲人為地疊加到片段上的替代方法，但是這種方法大大降低了訓練時間，并且沒有明顯的性能提升。

We then added subsequent layers of Conv1D, BatchNormalization, and MaxPooling1D. Batch normalization helped with the model convergence, and max pooling helped in making the model more robust to minor variations in speech and also to channel noise. Also, we tried adding dropout layers, but those didn’t improve the model meaningfully. Finally, we added a densely-connected neural network layer which fed into a single output-dense layer with sigmoid activation.

然后，我們添加了Conv1D，BatchNormalization和MaxPooling1D的后續層。批量歸一化有助于模型收斂，最大池化有助于使模型對語音中的微小變化和信道噪聲更加健壯。另外，我們嘗試添加了輟學層，但是這些并沒有有意義地改善模型。最后，我們添加了一個緊密連接的神經網絡層，該層通過S型激活被饋送到單個輸出密集層。

生成標簽數據 (Generating Labeled Data)

Figure 5. Labeling process for audio clips (Image by Author)圖5.音頻剪輯的標記過程(作者提供的圖像)

To label the training data, we gave annotators the list of keywords for our domain and asked them to mark the start and end positions within a clip along with the word label if any of the vocabulary words were present.

為了標記訓練數據，我們為注釋者提供了我們域的關鍵字列表，并要求他們在片段中標記單詞的開始和結束位置以及單詞標簽(如果存在任何詞匯)。

To ensure the annotations were reliable, we had a 10% overlap across annotators and calculated how they performed on the overlapped clips. Once we had ~50 hours of labeled data, we started the training process. We kept collecting more data while iterating on the training process.

為確保注釋可靠，我們在注釋器之間有10％的重疊，并計算了它們在重疊剪輯上的表現。擁有約50個小時的標記數據后，我們便開始了培訓過程。我們不斷地在訓練過程中不斷收集更多數據。

Since some words in our vocabulary were much more common than others, our model had a reasonable performance on common words but struggled with rarer words with fewer examples. We tried creating artificial examples of those words by overlaying the word utterance in other clips. However, the performance gains were not commensurate with actually getting labeled data for those words. Eventually, as our model improved with common words, we ran it on unlabeled audio clips and excluded the ones where the model found those words. That helped us reduce the redundant words from our future labeling.

由于詞匯中的某些單詞比其他單詞普遍得多，因此我們的模型對常見單詞具有合理的表現，但與較少的示例(較少的示例)一起苦苦掙扎。我們嘗試通過將單詞話語覆蓋在其他片段中來創建這些單詞的人工示例。但是，性能的提高與實際獲得這些單詞的標簽數據并不相稱。最終，隨著我們的模型改進了常用詞，我們將其在未標記的音頻剪輯上運行，并排除了模型找到這些詞的剪輯。這有助于我們減少將來標簽中的多余單詞。

模型發布 (Model Launch)

After several iterations of data collection and hyperparameter tuning, we were able to train a model with high recall on our vocabulary words and reasonable precision. High recall was very important to capture critical safety alerts. The flagged clips are always listened to before an alert is sent, so false positives were not a huge concern.

經過數次數據收集和超參數調整的迭代，我們能夠訓練出詞匯量和合理精確度高的召回模型。高召回率對于捕獲關鍵的安全警報非常重要。發送警報之前，始終會監聽標記的剪輯，因此誤報并不是一個大問題。

We A/B tested the model in some boroughs of New York City. The model was able to cut down audio volume by 50–75% (depending on the channel). It also clearly outcompeted our model trained on public speech-to-text engine since NYC has very noisy audio due to analog systems.

我們A / B在紐約市的一些行政區測試了該模型。該模型能夠將音頻音量降低50％至75％(取決于通道)。由于紐約市由于模擬系統而產生的聲音非常嘈雜，因此它顯然也勝過我們在公共語音轉文本引擎上訓練過的模型。

Somewhat surprisingly, we then found that the model transferred well to audio from Chicago even though the model was trained on NYC data. After collecting a few hours of Chicago clips, we were able to transfer-learn from the NYC model to get reasonable performance in Chicago.

出乎意料的是，我們隨后發現該模型可以很好地從芝加哥傳輸到音頻，即使該模型是根據NYC數據進行訓練的。收集了幾個小時的芝加哥片段后，我們就可以從紐約市模型中轉移學習信息，從而在芝加哥獲得合理的表現。

結論 (Conclusion)

Our speech processing pipeline with the custom deep neural network was broadly applicable to police audio from major US cities. It discovered critical safety incidents from the audio, allowing Citizen to expand rapidly into cities across the country and serve the mission of keeping communities safe.

我們帶有定制深度神經網絡的語音處理管道廣泛適用于美國主要城市的警察音頻。它從音頻中發現了嚴重的安全事件，使公民能夠Swift擴展到全國的城市，并以維護社區安全為使命。

Picking computationally simple CNN architecture over RNN, LSTM, or Transformer and simplifying our labeling process were major breakthroughs that allowed us to outperform public speech-to-text models in a very short time and with limited resources.

在RNN，LSTM或Transformer上選擇計算簡單的CNN架構并簡化我們的標記過程是主要的突破，這使我們能夠在非常短的時間內和有限的資源上勝過公共語音轉文本模型。

翻譯自: https://towardsdatascience.com/how-deep-learning-can-keep-you-safe-with-real-time-crime-alerts-95778aca5e8a