centos7 python3 爬虫登陆邮箱_使用爬虫爬取超星学习通的作业时间并且通过邮件提醒!...
簡介
因?yàn)楸救耸謵弁涀鲎鳂I(yè),因此,想通過爬蟲,爬取超星的學(xué)習(xí)通作業(yè)時間,并且進(jìn)行定時提醒。
環(huán)境
阿里云輕量級服務(wù)器
Centos7
Anaconda3
先看效果展示
爬取過程
發(fā)送郵件
過程
需要了解的知識
http的訪問機(jī)制
cookies,session是用來干嘛的
驗(yàn)證碼登錄的流程
頁面的機(jī)制
python幾個包的使用
requetst(網(wǎng)頁請求包)
lxml,etree(網(wǎng)頁界面處理包)
email,smtplib(郵件處理包)
muggle-ocr(驗(yàn)證碼識別)
datetime和time(時間包)
Centos下的anaconda的使用
stmp郵件協(xié)議講解
Centos如何進(jìn)行定時任務(wù)
Centos關(guān)于郵件發(fā)送的端口
開始
(對于部分模塊,有些博客寫的非常好,我就不進(jìn)行詳述,但是會提供鏈接)
1.http的訪問機(jī)制
一文搞懂HTTP協(xié)議(帶圖文)
2.cookies,session是用來干嘛的
cookie和session的區(qū)別
session和cookies的區(qū)別
3. 驗(yàn)證碼登錄的流程
驗(yàn)證碼的原理及作用
簡單來說(以超星學(xué)習(xí)為例子):
每次進(jìn)入登錄頁面,它會先請求一個驗(yàn)證碼的網(wǎng)址(不用管它c(diǎn)ode?***是什么。,它只是一個通過js的datetime函數(shù)得到的一個時間戳(不明白的同學(xué)可以去搜一搜)。不要太在意這部分,一開始,我就走入歧途,想通過這個來獲取驗(yàn)證碼,其實(shí)思路就錯了,引以為戒)
進(jìn)入該鏈接之后,得到一張圖片
此時,表面上得到的是一張圖片,實(shí)際上,在服務(wù)器端,它還生成了與之匹配的cookies信息,并且返回給了登錄頁面。而之后我們客戶端就必須攜帶這個cookies以及賬號密碼信息進(jìn)行訪問。
4. 頁面的機(jī)制
使用的chrome瀏覽器,打開F12就可以了解整個頁面的轉(zhuǎn)換過程。主要分為:
html靜態(tài)網(wǎng)頁(容易獲取)
js后臺操作函數(shù)(通過相關(guān)函數(shù)來進(jìn)行動態(tài)加載顯示頁)
就比如這里的驗(yàn)證碼圖片,并不是靜態(tài)加載的,因此直接獲取到的html中并沒有該圖片的鏈接,反而是通過一個函數(shù)進(jìn)行動態(tài)加載。
打開F12后主要頁面如下:
在最后的圖中的formdata就是本次登錄的信息(賬號,密碼(base64加密),驗(yàn)證碼等)
代碼
#coding=UTF-8
#File name :爬蟲超星
#Author:龍文漢
#Data:2020.10.16
#Description:使用爬蟲爬取超星的作業(yè)詳情,獲取作業(yè)的截至?xí)r間
import time
import json
import requests
from lxml import etree
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import os
import base64
import datetime
import smtplib
from email.header import Header
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
from email.mime.multipart import MIMEMultipart
import threading
import sys
import muggle_ocr
class PaChaongxin():
#main_function
def __init__(self,username,password,emai):
self.username = username
self.password = password
self.to = emai
self.code = 0
self.code_status = None #驗(yàn)證碼正確還是失敗
self.user_pas_status = None #賬戶名和密碼
self.sender_mail = 'xxxx@xxx'#發(fā)送者郵件
self.sender_pass = 'xxxxxxx'? # 郵箱的stm密碼
self.session = requests.session()
self.header = {
'User-Agent': 'xxxxxxxx'#自己的user_agent
}
def User_Pas(self):
#輸入賬號,密碼
# self.username = input("請輸入學(xué)號:")
# self.password = input("請輸入密碼:")
# self.to = input("請輸入郵箱:")
#self.username = xxxxxxx
#self.password = 'xxxxx'
#self.to = 'xxxxxx'
return
def Get_code(self):
#獲取驗(yàn)證碼,以及攜帶的cookies
code_url = 'https://passport2.chaoxing.com/num/code'#超星驗(yàn)證碼網(wǎng)址
path_path = 'vari_code.png'
code_response = self.session.get(code_url)
#保存驗(yàn)證碼
img = open(path_path,'wb')
img.write(code_response.content)
img.close()
#顯示驗(yàn)證碼,并且初始化,人為輸入,不適用識別程序
# img_open = Image.open('vari_code.png')
# img = mpimg.imread('vari_code.png',0)
# plt.imshow(img)? # 顯示圖片
# plt.axis('off')? # 不顯示坐標(biāo)軸
# plt.show()
# self.code = input("請輸入驗(yàn)證碼:")
print("befor:",self.code)
self.code = self.Code_Verifed()
print("after:", self.code)
os.remove(path_path)
def Load_Page(self):
#使用session進(jìn)入登錄界面驗(yàn)證
self.password = base64.b64encode(self.password.encode("utf-8"))? # 被編碼的參數(shù)必須是二進(jìn)制數(shù)據(jù)
param = {
'fid': 'xxxx',
'uname': self.username,
'numcode': self.code,
'password': self.password,
'refer': 'http%3A%2F%2Fi.chaoxing.com',
't': 'true'
}
header = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,und;q=0.7',
'Connection': 'keep-alive',
'Content-Length': '109',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'passport2.chaoxing.com',
'Origin': 'https://passport2.chaoxing.com',
'Referer': 'https://passport2.chaoxing.com/login?loginType=3&newversion=true&fid=-1&refer=http%3A%2F%2Fi.chaoxing.com',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'xxxxxxxx',#注意自己更改
'X-Requested-With': 'XMLHttpRequest',
}
load_url = 'https://passport2.chaoxing.com/unitlogin?'
load_response = self.session.post(load_url,headers=header,data=param)
load_response_msg = json.loads(load_response.text)
print(load_response_msg,load_response_msg.keys())
if 'mes' in load_response_msg.keys():
if load_response_msg['mes'] == '驗(yàn)證碼錯誤':
print('驗(yàn)證碼錯誤')
self.code_status = True
while(self.code_status):
self.Get_code()
param['numcode'] = self.code
load_response = self.session.post(load_url, headers=header, data=param)
load_response_msg = json.loads(load_response.text)
if 'mes' not in load_response_msg.keys() or load_response_msg['mes'] != '驗(yàn)證碼錯誤':
self.code_status = False
if 'mes' in load_response_msg.keys() and load_response_msg['mes'] == '用戶名或密碼錯誤':
print('用戶名或密碼錯誤')
self.user_pas_status = True
while(self.user_pas_status):
self.User_Pas()
self.Get_code()
param['numcode'] = self.code
param['uname'] = self.username
param['password'] = self.password
load_response = self.session.post(load_url, headers=header, data=param)
load_response_msg = json.loads(load_response.text)
if 'mes' not in load_response_msg.keys() and load_response_msg['mes'] != '用戶名或密碼錯誤':
self.code_status = False
#上面返回的應(yīng)該是登錄成功,接下來,帶著新的cookies訪問主頁
self_page_url = 'http://i.mooc.chaoxing.com'
self_page_response = self.session.get(url=self_page_url,headers = self.header)
#個人空間的主頁面,這里直接提取課程的部分,要在左側(cè)的按鈕里找到相對應(yīng)的連接,
#該課程的頁面直接鑲嵌在本頁面,所以帶著session直接訪問也可以
# self_page_html = etree.HTML(self_page.text)
# return self_page_html
def Get_Class_View(self):
#進(jìn)入所有課程的界面
class_view_url = 'http://mooc1-2.chaoxing.com/visit/courses'
class_view = self.session.get(class_view_url,headers = self.header)
class_view_html = etree.HTML(class_view.text)
#直接返回整個頁面的編碼,方便后續(xù)的查找
return class_view_html
def Go_to_work(self,Singel_Class_Url_after):
#通過外面直接傳來的課程網(wǎng)址,直接跳轉(zhuǎn)
#進(jìn)入單個課程的界面的作業(yè)模塊
Singel_Class_Url = 'https://mooc1-2.chaoxing.com'+Singel_Class_Url_after
single_class_response = self.session.get(Singel_Class_Url,headers=self.header)
single_class_page = etree.HTML(single_class_response.text)
url_after = single_class_page.xpath("/html/body/div[4]/div/div/div[2]/ul/li[6]/a/@data")
if len(url_after) == 0:
open_zuoye_url = 'https://mooc1-2.chaoxing.com' + \
single_class_page.xpath("/html/body/div[2]/div/div/div[2]/ul/li[6]/a/@data")[0]
else:
open_zuoye_url = 'https://mooc1-2.chaoxing.com' + \
single_class_page.xpath("/html/body/div[4]/div/div/div[2]/ul/li[6]/a/@data")[0]
work_xml_response = self.session.get(open_zuoye_url,headers=self.header)
work_xml = etree.HTML(work_xml_response.text)
#這里的open_zuoye直接轉(zhuǎn)到了作業(yè)的界面
#調(diào)用每個作業(yè)的函數(shù),方便多線程
single_class_text = self.Get_work_time(work_xml)
return single_class_text
def Get_work_time(self,work_xml):
# 以上就是關(guān)于頁面跳轉(zhuǎn)的函數(shù),接下來就是作業(yè)的截取
#作業(yè)的信息提取
work_num = len(work_xml.xpath('//*[@id="RightCon"]/div/div/div[2]/ul/li'))
class_name = work_xml.xpath('/html/body/div[2]/div/h1/span[1]/@title')[0]
#郵件的內(nèi)容:[[課程],[作業(yè)名稱,截至?xí)r間,剩余時間]*n]的列表
#無作業(yè)的課程直接返回
if work_num == 0:
return None
work_text_info = []#有作業(yè)的課程
work_text_info.append(class_name)#添加課程名
#對每個項(xiàng)目進(jìn)行整理
for i in range(1,work_num+1):
work_name = work_xml.xpath('//*[@id="RightCon"]/div/div/div[2]/ul/li['+str(i)+']/div[1]/p/a/text()')[0].strip()
work_status = work_xml.xpath('/html/body/div[3]/div[1]/div/div/div/div[2]/ul/li['+str(i)+']/div[1]/span[3]/strong/text()')[0].strip()
work_end_time = work_xml.xpath('//*[@id="RightCon"]/div/div/div[2]/ul/li['+str(i)+']/div[1]/span[2]/text()')
# print(work_end_time)
if len(work_end_time) == 0:#無截至日期的課程
break
else:
work_end_time = work_end_time[0]
print(class_name,work_name,work_end_time)
if work_status != '已完成' and work_status != "待批閱":#存在未完成的課程
work_text_info.append([])
#增加作業(yè)名
work_text_info[-1].append(work_name)
#使用datatime,計算時間差
current_time = time.strftime("%Y-%m-%d %H:%M", time.localtime())
#轉(zhuǎn)化位datatime的格式
current_time_tran = datetime.datetime.strptime(current_time ,"%Y-%m-%d %H:%M")
work_end_time_tran = datetime.datetime.strptime(work_end_time ,"%Y-%m-%d %H:%M")
time_mul = work_end_time_tran - current_time_tran
time_mul_day = time_mul.days
if time_mul_day <= 1:
time_m, time_s = divmod(time_mul.seconds, 60)
time_h, time_m = divmod(time_m, 60)
lea_time = '0天:'+str(time_h)+'小時:'+str(time_m)+'分鐘'
#增加剩余時間
work_text_info[-1].append(lea_time)
else:
lea_time = str(time_mul_day)+'天'
work_text_info[-1].append(lea_time)
print(class_name,"已經(jīng)結(jié)束")
#返回單課程的作業(yè)信息
return work_text_info
def Code_Verifed(self):
sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.Captcha)
with open(r'vari_code.png', 'rb') as f:
captcha_bytes = f.read()
code = sdk.predict(image_bytes=captcha_bytes)
return code
def send_email_by_qq(self,text):
# 利用郵箱發(fā)郵件提醒
# 設(shè)置總的郵件體對象,對象類型為mixed
msg_root = MIMEMultipart('mixed')
# 郵件添加的頭尾信息等
msg_root['From'] = 'xxxx@xxxxx'
msg_root['To'] = self.to
# 郵件的主題,顯示在接收郵件的預(yù)覽頁面
subject = '快到作業(yè)截止時間了!'
msg_root['subject'] = Header(subject, 'utf-8')
# 構(gòu)造文本內(nèi)容
text_inf = "未完成作業(yè)總覽:\n"
for i in range(0,len(text)):
text_inf += text[i][0]+"\n"#添加標(biāo)題
for x in range(1,len(text[i])):
text_inf += '\t\t'
text_inf += str(text[i][x])
text_inf += "\n\n"
text_sub = MIMEText(text_inf, 'plain', 'utf-8')
print(text_inf)
msg_root.attach(text_sub)
# # 構(gòu)造超文本
# url = "https://blog.csdn.net/chinesepython"
# html_info = """
#
點(diǎn)擊以下鏈接,你會去向一個更大的世界
#
click me
#
i am very galsses for you
# """% url
# html_sub = MIMEText(html_info, 'html', 'utf-8')
# # 如果不加下邊這行代碼的話,上邊的文本是不會正常顯示的,會把超文本的內(nèi)容當(dāng)做文本顯示
# html_sub["Content-Disposition"] = 'attachment; filename="csdn.html"'
# # 把構(gòu)造的內(nèi)容寫到郵件體中
# msg_root.attach(html_sub)
# # 構(gòu)造圖片
# image_file = open(r'D:\python_files\images\test.png', 'rb').read()
# image = MIMEImage(image_file)
# image.add_header('Content-ID', '')
# # 如果不加下邊這行代碼的話,會在收件方方面顯示亂碼的bin文件,下載之后也不能正常打開
# image["Content-Disposition"] = 'attachment; filename="red_people.png"'
# msg_root.attach(image)
# # 構(gòu)造附件
# txt_file = open(r'D:\python_files\files\hello_world.txt', 'rb').read()
# txt = MIMEText(txt_file, 'base64', 'utf-8')
# txt["Content-Type"] = 'application/octet-stream'
# #以下代碼可以重命名附件為hello_world.txt
# txt.add_header('Content-Disposition', 'attachment', filename='hello_world.txt')
# msg_root.attach(txt)
try:
sftp_obj = smtplib.SMTP_SSL('smtp.qq.com', 465)
sftp_obj.login(self.sender_mail, self.sender_pass)
sftp_obj.sendmail(self.sender_mail, self.to, msg_root.as_string())
sftp_obj.quit()
print('sendemail successful!')
except Exception as e:
print('sendemail failed next is the reason')
print(e)
def Begin(self):
self.User_Pas()#獲取信息
self.Get_code()#獲取驗(yàn)證碼以及cookies信息,用session進(jìn)行保存
self.Load_Page()#進(jìn)入個人中心,更新cookies
class_view_html = self.Get_Class_View()#進(jìn)入所有作業(yè)的單頁面,并且返該界面
len_class = len(class_view_html.xpath('/html/body/div/div[2]/div[3]/ul/li'))#計算一共有多少個課程
all_text = []#總的郵件信息
for i in range(1,len_class):#開始遍歷每門課
print('開始第',i)
Singel_Class_Url_after = class_view_html.xpath('/html/body/div/div[2]/div[3]/ul/li['+str(i)+']/div[2]/h3/a/@href')[0]
# print(Singel_Class_Url_after)
# all_text.append([threading.Thread(target=self.Go_to_work,args=(Singel_Class_Url_after)).start()])
work_text = self.Go_to_work(Singel_Class_Url_after)
if work_text == None or len(work_text) <= 1:
continue
else:
all_text.append(work_text)
#將沒有作業(yè)的課程刪除
return all_text
if __name__ == '__main__':
#while 1:
#time_list = time.strftime("%H:%M:%S", time.localtime()).split(":")
# if time_list[0] == "08" and time_list[1] == "00" and time_list[2] == "00":
#? ? pachong = PaChaongxin(201821094053,"Mmgh774109","1766446371@qq.com")
#? ? work_email = pachong.Begin()
#? ? pachong.send_email_by_qq(work_email)
#? ? pachong = PaChaongxin(201821094049,"wcsba8102","2651447403@qq.com")
#? ? work_email = pachong.Begin()
#? ? pachong.send_email_by_qq(work_email)
pachong = PaChaongxin(學(xué)號,密碼,郵箱)
work_email = pachong.Begin()
pachong.send_email_by_qq(work_email)
pachong = PaChaongxin(xxxxxxx,"xxxxxxx","xxxxx@xxxxx")
work_email = pachong.Begin()
pachong.send_email_by_qq(work_email)
總結(jié)
果然任務(wù)驅(qū)動加上興趣的學(xué)習(xí),能更加擴(kuò)展知識面,本次實(shí)例,都了解了:爬蟲知識,網(wǎng)絡(luò)知識,密碼學(xué),云服務(wù)器的使用,郵件知識,python的使用。總體來說,受益匪淺。希望大家能共同進(jìn)步
PS:如有需要Python學(xué)習(xí)資料的小伙伴可以加點(diǎn)擊下方鏈接自行獲取
總結(jié)
以上是生活随笔為你收集整理的centos7 python3 爬虫登陆邮箱_使用爬虫爬取超星学习通的作业时间并且通过邮件提醒!...的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 近期计划
- 下一篇: 基于python语言开发的员工信息管理系