當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

python抓取html中特定的数据库,Python抓取网页中内容，正则分析后存入mysql数据库...

發布時間：2025/3/15 数据库 16 豆豆

生活随笔收集整理的這篇文章主要介紹了 python抓取html中特定的数据库,Python抓取网页中内容，正则分析后存入mysql数据库... 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

firefox+httpfox可以查看post表單

首先在http://www.renren.com/這個地址輸入用戶名和密碼，

輸入用戶名和密碼之后post到下面這個網址：

http://www.renren.com/PLogin.do

#renren.py

import urllib

import urllib2

import cookielib

cookie = cookielib.CookieJar()

opener =

urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))

postdata=urllib.urlencode({

'email':'',

#your account

'password':'' #your password

})

req = urllib2.Request(

url='http://www.renren.com/PLogin.do',

data=postdata

)

result=opener.open(req)

print result.read()

這樣就已經登陸人人網了。

打印出來的是已登陸界面的html源碼。

二、抓取網頁并獲得需要的信息

這里以股票網站seekingalpha為例(sorry no offending)打開SA，準備抓取

import urllib

import urllib2

content=urllib2.urlopen('http://seekingalpha.com/symbol/GOOGL?s=googl').read()

print content

下面就會打印出GOOGL股票的頁面。

*注意這里并沒有使用post因為這個網站不登陸也可以看>

下面分析正則表達式：

寫出正則表達式：pattern=re.compile(r'href="/article.*sasource’)

這樣會找到所有指向評論頁面的鏈接，若打印的話GOOG會有下面這些：

http://seekingalpha.com/article/2250373-energetic-moves-for-google

http://seekingalpha.com/article/2249173-google-bringing-satellite-internet-to-the-world

http://seekingalpha.com/article/2247383-what-googles-self-driving-car-says-about-the-company

http://seekingalpha.com/article/2238623-europe-tries-to-censor-google

http://seekingalpha.com/article/2236283-google-is-reportedly-mulling-expansion-in-outer-space

http://seekingalpha.com/article/2234863-what-will-googles-30-billion-in-foreign-acquisitions-do

http://seekingalpha.com/article/2229953-in-defense-of-google-glass

http://seekingalpha.com/article/2229163-android-fragmentation-and-the-cloud

http://seekingalpha.com/article/2227963-everything-you-need-to-know-about-twitch-tv-and-why-company-could-be-a-steal-for-google

http://seekingalpha.com/article/2226203-google-adds-quest-visual-to-its-portfolio-m-and-a-overview

http://seekingalpha.com/article/2223103-goog-vs-googl-a-classic-pairs-trade

http://seekingalpha.com/article/2222373-google-or-apple-which-is-the-better-long-term-bet

http://seekingalpha.com/article/2220023-a-look-at-everything-thats-wrong-with-google-glass

http://seekingalpha.com/article/2198683-analysis-of-oral-argument-in-vringo-vs-google-patent-infringement-appeal

http://seekingalpha.com/article/2193673-google-investors-can-expect-upside-potential

http://seekingalpha.com/article/2191843-google-is-a-stock-to-own-for-the-long-term

http://seekingalpha.com/article/2187033-google-7-different-insiders-have-sold-shares-during-the-last-30-days

http://seekingalpha.com/article/2169973-google-facing-some-problems-in-the-mobile-advertising-market

http://seekingalpha.com/article/2168773-google-strikes-deal-with-buffett-backed-wind-generator

http://seekingalpha.com/article/2165243-why-google-has-upside-to-nearly-650

http://seekingalpha.com/article/2251473-what-wwdc-says-about-apples-new-products

http://seekingalpha.com/article/2251063-how-apples-iphones-might-become-an-indispensable-piece-of-equipment-again

http://seekingalpha.com/article/2250973-will-apple-outsmart-google-in-the-internet-of-things

http://seekingalpha.com/article/2249683-demand-medias-c-and-m-business-prospects-boosted-by-new-google-search-algorithm-changes

http://seekingalpha.com/article/2248843-googles-satellites-pose-threat-to-sirius-xm

http://seekingalpha.com/article/2248193-facebook-battling-google-for-eyeballs

http://seekingalpha.com/article/2248143-wall-street-breakfast-must-know-news

http://seekingalpha.com/article/2246013-apple-something-extraordinary-is-certain

http://seekingalpha.com/article/2245693-why-you-shouldnt-believe-the-himax-google-break-up-rumor

http://seekingalpha.com/article/2244133-dividends-role-in-wealth-creation-sector-analysis

http://seekingalpha.com/article/2242083-the-defensive-portfolio-focusing-on-competitive-advantage

http://seekingalpha.com/article/2242023-vringos-q1-report-shows-mixed-results-is-a-secondary-offering-just-around-the-corner

http://seekingalpha.com/article/2241533-is-facebook-at-the-peak-of-its-share-price

http://seekingalpha.com/article/2240663-wall-street-breakfast-must-know-news

http://seekingalpha.com/article/2240493-blackberry-z3-seems-too-late-to-the-party

http://seekingalpha.com/article/2238893-why-apple-beats-partnership-will-change-competitive-landscape-for-music-streaming

http://seekingalpha.com/article/2238073-apples-split-what-you-need-to-know

http://seekingalpha.com/article/2236983-lady-liberty-rescues-vringo-google-royalty-tab-to-exceed-1_8-billion

http://seekingalpha.com/article/2236893-high-time-for-investors-to-buy-into-samsung

http://seekingalpha.com/article/2231733-lenovo-making-the-right-strategic-moves-to-build-value

下面是完整python代碼：

#table commenturl

#CREATE TABLE `commenturl` (

# ?`id` int(11) unsigned NOT NULL

AUTO_INCREMENT,

# ?`object` varchar(30) DEFAULT NULL,

# ?`url` varchar(1024) DEFAULT NULL,

# ?PRIMARY KEY (`id`)

# ?) ENGINE=InnoDB DEFAULT CHARSET=utf8;

#truncate table commenturl----set autoincrement to be 1

import MySQLdb

import urllib2

headers =

{'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1;

en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

req = urllib2.Request(url = 'http://seekingalpha.com/symbol/GOOG?s=goog',headers

= headers)

content=urllib2.urlopen(req).read()

import sys

import os

import re

links=re.findall(r'href="/article.*sasource',content)

try:

conn=MySQLdb.connect(host='localhost',user='root',passwd='',port=3306)

cur=conn.cursor()

conn.select_db('usr')

except MySQLdb.Error,e:

print "Mysql

Error %d: %s" % (e.args[0], e.args[1])

for url in links:

ct=len(url)

url=url[6:(ct-10)]

url='http://seekingalpha.com'+url

print url

cur.execute("INSERT INTO COMMENTURL(object,url)

VALUES('GOOG',%s)",url)

conn.commit()

注意：網站會為了防止爬蟲而出現Error 403 Forbidden，這時要模擬瀏覽器訪問，代碼：req =

urllib2.Request(url ='http://seekingalpha.com/symbol/GOOG?s=goog',headers

= headers)

總之上面是全的源碼還有mysql建表語句。

總結

以上是生活随笔為你收集整理的python抓取html中特定的数据库,Python抓取网页中内容，正则分析后存入mysql数据库...的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： c语言循环程序怎么编程,c语言程序（5）
下一篇： mysql数据库事务命令_MySql学习