案例逐步演示python利用正则表达式提取指定内容并输出到csv
背景和目標(biāo)
這次我想要處理的是一個(gè)txt文件,里面的內(nèi)容是一臺(tái)機(jī)器定時(shí)ping另一臺(tái)機(jī)器的輸出結(jié)果,想要提取出的內(nèi)容是時(shí)間和rtt值,最后還要把結(jié)果輸出到csv文件。
1. 明確要提取的內(nèi)容,編寫(xiě)正則表達(dá)式
要提取的文本如下:
第一步是要編寫(xiě)正則表達(dá)式,此時(shí)可以先不要讀取數(shù)據(jù)文件。先復(fù)制一部分?jǐn)?shù)據(jù)到str中,方便測(cè)試。
編寫(xiě)正則表達(dá)式用到了re模塊,因?yàn)槊總€(gè)人要處理的文本是不一樣的,所以需要自己去學(xué)習(xí)基本的使用方法。re具體使用方法可以參考這篇文章:
https://zhuanlan.zhihu.com/p/139596371
關(guān)鍵就是弄清楚.*?和{}的作用,還有re.S可以匹配到換行符,就可以比較容易地寫(xiě)出正確的表達(dá)式。
import re # 為了方便測(cè)試,我把一部分文本先放到str里 str=''' 2022-03-11 15:21:48 1 PING 81.71.51.181 (81.71.51.181) 56(84) bytes of data. 64 bytes from 81.71.51.181: icmp_seq=1 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=2 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=3 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=4 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=5 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=6 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=7 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=8 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=9 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=10 ttl=45 time=253 ms--- 81.71.51.181 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9000ms rtt min/avg/max/mdev = 250.203/250.563/253.202/0.961 ms 2022-03-11 15:22:40 2 PING 81.71.51.181 (81.71.51.181) 56(84) bytes of data. 64 bytes from 81.71.51.181: icmp_seq=1 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=2 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=3 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=4 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=5 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=6 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=7 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=8 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=9 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=10 ttl=45 time=250 ms--- 81.71.51.181 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9009ms rtt min/avg/max/mdev = 250.181/250.256/250.434/0.636 ms 2022-03-11 15:23:44 3 PING 81.71.51.181 (81.71.51.181) 56(84) bytes of data. 64 bytes from 81.71.51.181: icmp_seq=1 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=2 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=3 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=4 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=5 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=6 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=7 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=8 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=9 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=10 ttl=45 time=250 ms--- 81.71.51.181 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9009ms rtt min/avg/max/mdev = 250.209/250.320/250.658/0.563 ms '''# print(re.findall(r'(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2})', str)) # 提取時(shí)間 # print(re.findall(r'mdev = (.*?) ms', str)) # 提取rttprint(re.findall(r'(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}).*?mdev = (.*?) ms', data, re.S)) # 提取時(shí)間和rtt 包括換行輸出:
D:\python37\python.exe D:/test/data_process.py ['2022-03-11 15:21', '2022-03-11 15:22', '2022-03-11 15:23'] ['250.203/250.563/253.202/0.961', '250.181/250.256/250.434/0.636', '250.209/250.320/250.658/0.563'] [('2022-03-11 15:21', '250.203/250.563/253.202/0.961'), ('2022-03-11 15:22', '250.181/250.256/250.434/0.636'), ('2022-03-11 15:23', '250.209/250.320/250.658/0.563')]Process finished with exit code 02. 從文件中讀入數(shù)據(jù)
編寫(xiě)出正確的正則表達(dá)式后,就可以從文件中讀數(shù)據(jù)了
import re # 讀取文件 with open("ping/ping_flkf_gz.txt","r") as input_file:str = input_file.read()print(re.findall(r'(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}).*?mdev = (.*?) ms', str, re.S)) # 提取時(shí)間和延遲 包括換行input_file.close() # 關(guān)閉文件輸出比較多,截取一部分展示:
D:\python37\python.exe D:/test/data_process.py [('2022-03-11 15:21', '250.203/250.563/253.202/0.961'), ('2022-03-11 15:22', '250.181/250.256/250.434/0.636'), ('2022-03-11 15:23', '250.209/250.320/250.658/0.563'), ('2022-03-11 15:25', '250.183/250.240/250.275/0.225'), ('2022-03-11 15:26', '250.217/250.240/250.300/0.592'), ('2022-03-11 15:27', '250.166/250.362/250.956/0.683'), ('2022-03-11 15:28', '250.186/250.256/250.343/0.319'), ('2022-03-11 15:29', '250.181/250.435/252.077/0.776'), ('2022-03-11 15:30', '250.177/250.249/250.401/0.673'), ('2022-03-11 15:31', '250.210/250.436/251.498/0.376'), ('2022-03-11 15:32', '250.207/250.280/250.588/0.401'), ('2022-03-11 15:33', '250.237/250.336/250.747/0.568'), ('2022-03-11 15:34', '250.217/250.283/250.437/0.675'), ('2022-03-11 15:35', '250.254/250.456/251.092/0.623'), ('2022-03-11 15:36', '250.167/250.236/250.308/0.226'), ('2022-03-11 15:37', '250.162/250.399/251.032/0.667'), ('2022-03-11 15:38', '250.207/250.261/250.406/0.053'), ('2022-03-11 15:39', '250.219/250.657/252.056/0.878')]這里其實(shí)是一個(gè)列表,里面的每個(gè)元組是我提取出來(lái)的時(shí)間和rtt。
3. 寫(xiě)入csv
能夠正確讀取輸入文件并提取數(shù)據(jù)后,下一步就是要把結(jié)果寫(xiě)入csv文件,所以用到了csv模塊。
for循環(huán)遍歷列表,使用csv_writer.writerow一行行寫(xiě)入csv文件。
結(jié)果就寫(xiě)入到csv文件中了
time,latency 2022-03-11 15:21,250.203/250.563/253.202/0.961 2022-03-11 15:22,250.181/250.256/250.434/0.636 2022-03-11 15:23,250.209/250.320/250.658/0.563 2022-03-11 15:25,250.183/250.240/250.275/0.225 2022-03-11 15:26,250.217/250.240/250.300/0.592 2022-03-11 15:27,250.166/250.362/250.956/0.683 2022-03-11 15:28,250.186/250.256/250.343/0.319 2022-03-11 15:29,250.181/250.435/252.077/0.776 2022-03-11 15:30,250.177/250.249/250.401/0.673 2022-03-11 15:31,250.210/250.436/251.498/0.3764. 還可以把每個(gè)數(shù)值分開(kāi)存放
發(fā)現(xiàn)此時(shí)latency那一列是這樣的250.203/250.563/253.202/0.961
為了后面方便處理,把每個(gè)數(shù)值單獨(dú)作為一列,因此要修改正則表達(dá)式
輸出到csv文件的效果:
至此就完成了~
總結(jié)
以上是生活随笔為你收集整理的案例逐步演示python利用正则表达式提取指定内容并输出到csv的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 2月刷题记录
- 下一篇: websocket python爬虫_p