扫描二维码读取文档_使用深度学习读取和分类扫描的文档
掃描二維碼讀取文檔
To many people’s dismay, there is still a giant wealth of paper documents floating out there in the world. Tucked into corner drawers, stashed in filing cabinets, overflowing from cubicle shelves — these are headaches to keep track of, keep updated, and just store. What if there existed a system where you could scan these documents, generate plain text files from their contents, and automatically categorize them into high level topics? Well, the technology to do all of this exists, and it’s simply a matter of stitching them all together and getting it to work as a cohesive system, which is what we’ll be going through in this article. The main technologies used will be OCR (Optical Character Recognition) and Topic Modeling. Let’s get started!
令很多人感到沮喪的是,世界上仍然漂浮著大量的紙質文件。 藏在角落的抽屜里,藏在文件柜中,從小隔間的架子上溢出來-這些都是跟蹤,更新和存儲的麻煩。 如果存在一個可以掃描這些文檔,從其內容生成純文本文件并自動將其分類為高級主題的系統,該怎么辦? 好的,存在完成所有這些工作的技術,這僅僅是將它們縫合在一起并使之作為一個內聚系統工作的問題,這就是我們將在本文中介紹的內容。 使用的主要技術將是OCR(光學字符識別)和主題建模。 讓我們開始吧!
Telegraph UK)英國電訊報 )收集數據 (Collecting Data)
The first thing we’re going to do is create a simple dataset so that we can test each portion of our workflow and make sure it’s doing what it’s supposed to. Ideally, our dataset will contain scanned documents of various levels of legibility and time periods, and the high level topic that each document belongs to. I couldn’t locate a dataset with these exact specifications, so I got to work building my own. The high level topics I decided on were government, letters, smoking, and patents. Random? Well these were mainly chosen because of the availability of a good variety of scanned documents for each of these areas. The wonderful sources below were used to extract the scanned documents for each of these topics:
我們要做的第一件事是創建一個簡單的數據集,以便我們可以測試工作流的每個部分,并確保它按預期工作。 理想情況下,我們的數據集將包含不同級別的可讀性和時間范圍的掃描文檔,以及每個文檔所屬的高級主題。 我無法找到具有這些確切規格的數據集,因此我必須開始構建自己的數據集。 我決定的高級主題是政府,信件,吸煙和專利。 隨機? 好的,之所以選擇這些文檔,是因為每個區域都有大量的掃描文檔。 下面的精彩資源用于提取每個主題的掃描文檔:
Government/Historical: OurDocuments
政府/歷史 : OurDocuments
Letters: LettersofNote
字母 : 便箋
Patents: The Portal to Texas History (University of North Texas)
專利 : 德克薩斯歷史門戶(北德克薩斯大學)
Smoking: Tobacco 800 Dataset
吸煙 : 煙草800數據集
From each of these sources I picked 20 or so documents that were of a good size and legible to me, and put them into individual folders defined by the topic
從這些來源中,我選擇了大約20份大小合適且對我來說清晰易讀的文檔,并將它們放在該主題定義的各個文件夾中
After almost a full day of searching for and cataloging all the images, I resized them all to 600x800 and converted them into .PNG format. The finished dataset is available for download here.
經過將近一整天的搜索和分類所有圖像,我將它們的大小全部調整為600x800,并將其轉換為.PNG格式。 完整的數據集可在此處下載。
Some of the scanned documents we will be analyzing我們將要分析的一些掃描文件The simple resizing and conversion script is below:
簡單的大小調整和轉換腳本如下:
建立OCR管道 (Building the OCR Pipeline)
Optical Character Recognition is the process of extracting written text from images. This is usually done via machine learning models, and most often through pipelines incorporating convolutional neural networks. While we could train a custom OCR model for our application, it would require tons more training data and computation resources. We will instead utilize the fantastic Microsoft Computer Vision API, which includes a specific module specifically for OCR. You will need to register for a free tier account (sufficient for use with document scanning) and the API call will consume an image (as a PIL image) and output several bits of information including the location/orientation of the text on the image as well as the text itself. The following function will take in a list of PIL images and output an equal sized list of extracted texts:
光學字符識別是從圖像中提取文字的過程。 這通常是通過機器學習模型完成的,并且通常是通過包含卷積神經網絡的管道完成的。 盡管我們可以為我們的應用訓練定制的OCR模型,但將需要大量的訓練數據和計算資源。 相反,我們將使用出色的Microsoft Computer Vision API,其中包括專門用于OCR的特定模塊。 您將需要注冊一個免費帳戶(足以用于文檔掃描),并且API調用將使用一個圖像(作為PIL圖像)并輸出幾位信息,包括該文本在圖像上的位置/方向。以及文字本身。 以下函數將獲取一個PIL圖像列表,并輸出一個相等大小的提取文本列表:
后期處理 (Post-processing)
Since we might want to end our workflow here in some instances, instead of just holding onto the extracted text as a giant list in memory, we can also write out the extracted texts into individual .txt files with the same names as the original input files. While the OCR technology from Microsoft is good, it occasionally will make mistakes. We can mitigate some of these mistakes using the SpellChecker module. The following script accepts an input and output folder, reads in all the scanned documents in the input folder, reads them using our OCR script, runs a spell check and correct mis-spelled words, and finally writes out the raw .txt files into the output folder.
由于我們可能希望在某些情況下在這里結束工作流程,而不是僅將提取的文本作為巨大的列表保存在內存中,因此我們還可以將提取的文本寫到單獨的.txt文件中,其名稱與原始輸入文件相同。 盡管Microsoft的OCR技術很好,但偶爾也會出錯。 我們可以使用SpellChecker模塊來減輕其中的一些錯誤。 以下腳本接受輸入和輸出文件夾,讀取輸入文件夾中的所有掃描文檔,使用我們的OCR腳本讀取它們,運行拼寫檢查并糾正拼寫錯誤的單詞,最后將原始.txt文件寫出到導出目錄。
為主題建模準備文本 (Preparing Text for Topic Modeling)
If our set of scanned documents is large enough, writing them all into one large folder can make them hard to sort through, and we likely already have some kind of implicit grouping in the documents (especially if they came from something like a filing cabinet). If we have a rough idea of how many different “types” or topics of documents we have, we can use topic modeling to help identify these automatically. This will give us the infrastructure to split the identified text from OCR into individual folders based on document content. The topic model we will be using is called LDA, for Latent Direchlet Analysis, and there’s a great introduction to this type of model here. To run this model we will need a bit more pre-processing and organizing of our data, so to prevent our scripts from getting to long and congested we will assume the scanned documents have already been read and converted to .txt files using the above workflow. The topic model will then read in these .txt files, classify them into however many topics we specify, and place them into appropriate folders.
如果我們掃描的文檔集足夠大,則將它們全部寫到一個大文件夾中可能會使它們難以分類,并且我們可能已經在文檔中進行了某種隱式分組(特別是如果它們來自諸如文件柜之類的文件) 。 如果我們對文檔有多少個不同的“類型”或主題有一個大概的了解,我們可以使用主題建模來自動識別它們。 這將為我們提供基礎結構,以根據文檔內容將OCR中識別出的文本分成單獨的文件夾。 我們將使用的主題模型稱為LDA,用于潛在Direchlet分析,這里對這種模型進行了很好的介紹。 要運行此模型,我們將需要對數據進行更多的預處理和組織,因此,為防止腳本變得冗長而擁擠,我們將假定已使用上述工作流程讀取了掃描的文檔并將其轉換為.txt文件。 然后,主題模型將讀取這些.txt文件,將它們分類為我們指定的許多主題,并將它們放置在適當的文件夾中。
We’ll start off with a simple function to read all the outputted .txt files in our folder and read them into a list of tuples with (filename, text). This will help us keep track of the original filenames for after we categorize them into topics
我們將從一個簡單的函數開始,讀取文件夾中所有輸出的.txt文件,并將它們讀入具有(文件名,文本)的元組列表。 這有助于我們在將原始文件名歸類為主題后跟蹤原始文件名
Next, we will need to make sure that all useless words (ones that don’t help us distinguish the topic of a particular document). We will do this using three different methods:
接下來,我們將需要確保所有無用的單詞(那些不會幫助我們區分特定文檔主題的單詞)。 我們將使用三種不同的方法進行此操作:
To achieve all of this (and our topic model) we will use the Gensim package. The script below will run the necessary pre-processing steps on a list of text (output from the function above) and train an LDA model.
為了實現所有這些(以及我們的主題模型),我們將使用Gensim包。 下面的腳本將對文本列表(從上面的函數輸出)運行必要的預處理步驟,并訓練LDA模型。
使用主題模型對文檔進行分類 (Using the Topic Model to Categorize Documents)
Once we have our LDA model trained, we can use it to categorize our set of training documents (and future documents that might come in) into topics and then place them into the appropriate folders.
一旦我們對LDA模型進行了培訓,我們就可以使用它來將我們的培訓文檔集(以及將來可能出現的未來文檔)分類為主題,然后將它們放置在適當的文件夾中。
Using the trained LDA model against a new text string require some fiddling (in fact I needed some help figuring it out myself, thank god for SO), all of the complication is contained in the function below:
對新的文本字符串使用經過訓練的LDA模型需要一些擺弄(實際上,我需要一些幫助自己弄清楚的東西,謝謝上帝,所以),所有復雜性都包含在下面的函數中:
Finally, we’ll need another method to get the actual name of the topic based on the topic index.
最后,我們將需要另一種方法來根據主題索引獲取主題的實際名稱。
放在一起 (Putting it All Together)
Now, we can stick all of the functions we wrote above into a single script that accepts an input folder, output folder, and topic count. The script will read all the scanned document images in the input folder, write them into .txt files, build an LDA model to find high level topics in the documents, and organize the outputted .txt files into folders based on document topic.
現在,我們可以將上面編寫的所有功能粘貼到一個腳本中,該腳本可以接受輸入文件夾,輸出文件夾和主題計數。 該腳本將讀取輸入文件夾中的所有掃描的文檔圖像,將它們寫入.txt文件,構建LDA模型以在文檔中查找高級主題,并根據文檔主題將輸出的.txt文件組織到文件夾中。
演示版 (Demo)
To prove all of the above wasn’t just long winded gibberish, here’s a video demo of the system. There’s many things that can be improved (most notably to keep track of line breaks from the scanned documents, handling special characters and other languages besides English, and making requests to the computer vision API in batch instead of one by one) but we have ourselves a solid foundation to build improvements on. For more information checkout the associated Github repo.
為了證明以上所有內容不只是long花一現,這是該系統的視頻演示。 有很多可以改進的地方(最值得注意的是跟蹤掃描的文檔中的換行符,處理除英語之外的特殊字符和其他語言,以及批量而不是一一地向計算機視覺API發出請求),但我們擁有自己的能力不斷改進的堅實基礎。 有關更多信息,請查看相關的Github存儲庫 。
Thanks for reading!
謝謝閱讀!
翻譯自: https://medium.com/@shairozsohail/reading-and-categorizing-scanned-documents-using-deep-learning-4ab2c0e3f34c
掃描二維碼讀取文檔
總結
以上是生活随笔為你收集整理的扫描二维码读取文档_使用深度学习读取和分类扫描的文档的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 魔兽世界显卡要求有哪些(《魔兽世界》官方
- 下一篇: 贝叶斯深度神经网络_深度学习为何胜过贝叶