从头学习计算机网络_我如何通过从头开始构建网络爬虫来自动进行求职
從頭學(xué)習(xí)計(jì)算機(jī)網(wǎng)絡(luò)
它是如何開(kāi)始的故事 (The story of how it began)
It was midnight on a Friday, my friends were out having a good time, and yet I was nailed to my computer screen typing away.
星期五是午夜,我的朋友們出去玩得很開(kāi)心,但我被釘在電腦屏幕上打字了。
Oddly, I didn’t feel left out.
奇怪的是,我沒(méi)有被排除在外。
I was working on something that I thought was genuinely interesting and awesome.
我正在做一些我認(rèn)為真的很有趣而且很棒的事情。
I was right out of college, and I needed a job. When I left for Seattle, I had a backpack full of college textbooks and some clothes. I could fit everything I owned in the trunk of my 2002 Honda Civic.
我當(dāng)時(shí)剛大學(xué)畢業(yè),需要一份工作。 當(dāng)我去西雅圖時(shí),我有一個(gè)裝滿大學(xué)課本和一些衣服的背包。 我可以裝滿2002年本田思域后備箱中的所有物品。
I didn’t like to socialize much back then, so I decided to tackle this job-finding problem the best way I knew how. I tried to build an app to do it for me, and this article is about how I did it. ?
那時(shí)我不喜歡社交,所以我決定以我所知道的最好方式解決這個(gè)找工作的問(wèn)題。 我試圖構(gòu)建一個(gè)應(yīng)用程序來(lái)為我做這件事,而本文則是關(guān)于我是如何做到的。 ?
Craigslist入門 (Getting started with Craigslist)
I was in my room, furiously building some software that would help me collect, and respond to, people who were looking for software engineers on Craigslist. Craigslist is essentially the marketplace of the Internet, where you can go and find things for sale, services, community posts, and so on.
我當(dāng)時(shí)在我的房間里,瘋狂地開(kāi)發(fā)一些軟件,這些軟件可以幫助我收集和響應(yīng)在Craigslist上尋找軟件工程師的人們。 Craigslist本質(zhì)上是Internet的市場(chǎng),您可以在其中找到要出售的東西,服務(wù),社區(qū)帖子等。
At that point in time, I had never built a fully fledged application. Most of the things I worked on in college were academic projects that involved building and parsing binary trees, computer graphics, and simple language processing models.
那時(shí),我從未構(gòu)建過(guò)完整的應(yīng)用程序。 我在大學(xué)期間從事的大多數(shù)工作都是學(xué)術(shù)項(xiàng)目,涉及構(gòu)建和解析二叉樹(shù),計(jì)算機(jī)圖形學(xué)以及簡(jiǎn)單的語(yǔ)言處理模型。
I was quite the “newb.”
我真是個(gè)“新手”。
That said, I had always heard about this new “hot” programming language called Python. I didn’t know much Python, but I wanted to get my hands dirty and learn more about it.
就是說(shuō),我一直都聽(tīng)說(shuō)過(guò)這種稱為Python的新“熱門”編程語(yǔ)言。 我對(duì)Python不太了解,但是我想弄清楚自己的手,并進(jìn)一步了解它。
So I put two and two together, and decided to build a small application using this new programming language.
因此,我將兩個(gè)和兩個(gè)放在一起,并決定使用這種新的編程語(yǔ)言來(lái)構(gòu)建一個(gè)小型應(yīng)用程序。
建立(工作中的)原型的旅程 (The journey to build a (working) prototype)
I had a used BenQ laptop my brother had given me when I left for college that I used for development.
我上大學(xué)時(shí)曾用過(guò)哥哥給我的一臺(tái)二手BenQ筆記本電腦,當(dāng)時(shí)我用它來(lái)開(kāi)發(fā)。
It wasn’t the best development environment by any measure. I was using Python 2.4 and an older version of Sublime text, yet the process of writing an application from scratch was truly an exhilarating experience.
無(wú)論如何,它都不是最佳的開(kāi)發(fā)環(huán)境。 我使用的是Python 2.4和較舊版本的Sublime文本 ,但是從頭開(kāi)始編寫(xiě)應(yīng)用程序的過(guò)程確實(shí)令人振奮。
I didn’t know what I needed to do yet. I was trying various things out to see what stuck, and my first approach was to find out how I could access Craigslist data easily.
我還不知道該怎么辦。 我嘗試了各種嘗試以了解問(wèn)題所在,而我的第一種方法是找出如何輕松訪問(wèn)Craigslist數(shù)據(jù)的方法。
I looked up Craigslist to find out if they had a publicly available REST API. To my dismay, they didn’t.
我查找了Craigslist,以了解他們是否具有公開(kāi)可用的REST API。 令我沮喪的是,他們沒(méi)有。
However, I found the next best thing.
但是,我找到了下一個(gè)最好的東西。
Craigslist had an RSS feed that was publicly available for personal use. An RSS feed is essentially a computer-readable summary of updates that a website sends out. In this case, the RSS feed would allow me to pick up new job listings whenever they were posted. This was perfect for my needs.
Craigslist的RSS供稿已公開(kāi)供個(gè)人使用。 RSS feed本質(zhì)上是網(wǎng)站發(fā)送的更新的計(jì)算機(jī)可讀摘要 。 在這種情況下,RSS提要將允許我在發(fā)布新職位列表時(shí)選擇它們。 這非常適合我的需求。
Next, I needed a way to read these RSS feeds. I didn’t want to go through the RSS feeds manually myself, because that would be a time-sink and that would be no different than browsing Craigslist.
接下來(lái),我需要一種閱讀這些RSS feed的方法。 我不想自己親自瀏覽RSS提要,因?yàn)槟菚?huì)浪費(fèi)時(shí)間,而且與瀏覽Craigslist沒(méi)什么不同。
Around this time, I started to realize the power of Google. There’s a running joke that software engineers spend most of their time Googling for answers. I think there’s definitely some truth to that.
大約在這段時(shí)間里,我開(kāi)始意識(shí)到Google的強(qiáng)大功能。 開(kāi)個(gè)玩笑,軟件工程師將大部分時(shí)間都用在Google搜索上。 我認(rèn)為這肯定是有些道理。
After a little bit of Googling, I found this useful post on StackOverflow that described how to search through a Craiglist RSS feed. It was sort of a filtering functionality that Craigslist provided for free. All I had to do was pass in a specific query parameter with the keyword I was interested in.
經(jīng)過(guò)一番谷歌搜索之后,我在StackOverflow上找到了這篇有用的文章,描述了如何搜索Craiglist RSS feed。 這是Craigslist免費(fèi)提供的一種篩選功能。 我要做的就是用我感興趣的關(guān)鍵字傳遞特定的查詢參數(shù)。
I was focused on searching for software-related jobs in Seattle. With that, I typed up this specific URL to look for listings in Seattle that contained the keyword “software”.
我專注于在西雅圖尋找與軟件相關(guān)的工作。 這樣,我輸入了該特定URL,以查找包含關(guān)鍵字“軟件”的西雅圖清單。
https://seattle.craigslist.org/search/sss?format=rss&query=software
https://seattle.craigslist.org/search/sss?format=rss&query=software
And voilà! It worked beautifully.
和瞧! 它工作得很漂亮 。
我吃過(guò)最美麗的湯 (The most beautiful soup I’ve ever tasted)
I wasn’t convinced, however, that my approach would work.
但是,我沒(méi)有確信我的方法會(huì)奏效。
First, the number of listings was limited. My data didn’t contain all the available job postings in Seattle. The returned results were merely a subset of the whole. I was looking to cast as wide a net as possible, so I needed to know all the available job listings.
首先, 列表的數(shù)量是有限的 。 我的數(shù)據(jù)沒(méi)有包含西雅圖所有可用的職位發(fā)布。 返回的結(jié)果只是整體的一部分。 我一直在尋找盡可能廣泛的網(wǎng)絡(luò),所以我需要知道所有可用的工作清單。
Second, I realized that the RSS feed didn’t include any contact information. That was a bummer. I could find the listings, but I couldn’t contact the posters unless I manually filtered through these listings.
其次,我意識(shí)到RSS提要不包含任何聯(lián)系信息 。 真是可惜。 我可以找到列表,但是除非手動(dòng)過(guò)濾這些列表,否則我無(wú)法聯(lián)系海報(bào)。
I’m a person of many skills and interests, but doing repetitive manual work isn’t one of them. I could’ve hired someone to do it for me, but I was barely scraping by with 1-dollar ramen cup noodles. I couldn’t splurge on this side project.
我是一個(gè)有很多技能和興趣的人,但是做重復(fù)的體力勞動(dòng)不是其中之一。 我本來(lái)可以雇一個(gè)人為我做的,但我勉強(qiáng)抓著一美元的拉面杯面條。 我不能為此項(xiàng)目揮霍。
That was a dead-end. But it wasn’t the end.
那是死路一條。 但它是不是結(jié)束 。
連續(xù)迭代 (Continuous iteration)
From my first failed attempt, I learned that Craigslist had an RSS feed that I could filter on, and each posting had a link to the actual posting itself.
從我的第一次失敗嘗試中,我了解到Craigslist有一個(gè)RSS提要供我過(guò)濾,并且每個(gè)帖子都有指向?qū)嶋H帖子本身的鏈接。
Well, if I could access the actual posting, then maybe I could scrape the email address off of it? 🧐 That meant I needed to find a way to grab email addresses from the original postings.
好吧,如果我可以訪問(wèn)實(shí)際的帖子,那么也許可以從中刪除電子郵件地址? meant那意味著我需要找到一種方法來(lái)從原始帖子中獲取電子郵件地址。
Once again, I pulled up my trusted Google, and searched for “ways to parse a website.”
我再次拉起我信任的Google,并搜索“解析網(wǎng)站的方式”。
With a little Googling, I found a cool little Python tool called Beautiful Soup. It’s essentially a nifty tool that allows you to parse an entire DOM Tree and helps you make sense of how a web page is structured.
稍加谷歌搜索,我發(fā)現(xiàn)了一個(gè)很酷的Python小工具,名為Beautiful Soup 。 從本質(zhì)上講,它是一個(gè)漂亮的工具,可讓您解析整個(gè)DOM樹(shù),并幫助您理解網(wǎng)頁(yè)的結(jié)構(gòu)。
My needs were simple: I needed a tool that was easy to use and would let me collect data from a webpage. BeautifulSoup checked off both boxes, and rather than spending more time picking out the best tool, I picked a tool that worked and moved on. Here’s a list of alternatives that do something similar.
我的需求很簡(jiǎn)單:我需要一個(gè)易于使用的工具,并且可以讓我從網(wǎng)頁(yè)上收集數(shù)據(jù)。 BeautifulSoup選中了這兩個(gè)復(fù)選框,而不是花更多的時(shí)間挑選最好的工具 ,而是選擇了一個(gè)行之有效的工具。 這是做類似事情的替代方案的列表 。
Side note: I found this awesome tutorial that talks about how to scrape websites using Python and BeautifulSoup. If you’re interested in learning how to scrape, then I recommend reading it.
旁注:我發(fā)現(xiàn)了這個(gè)很棒的教程 ,該教程討論了如何使用Python和BeautifulSoup抓取網(wǎng)站。 如果您有興趣學(xué)習(xí)如何抓取,則建議閱讀。
With this new tool, my workflow was all set.
有了這個(gè)新工具,我的工作流程就完成了。
I was now ready to tackle the next task: scraping email addresses from the actual postings.
我現(xiàn)在準(zhǔn)備處理下一個(gè)任務(wù):從實(shí)際發(fā)帖中抓取電子郵件地址。
Now, here’s the cool thing about open-source technologies. They’re free and work great! It’s like getting free ice-cream on a hot summer day, and a freshly baked chocolate-chip cookie to go.
現(xiàn)在,這是關(guān)于開(kāi)源技術(shù)的最酷的東西。 它們是免費(fèi)的,而且效果很好! 就像在炎熱的夏日里免費(fèi)獲得冰淇淋, 以及新鮮出爐的巧克力曲奇餅干一樣。
BeautifulSoup lets you search for specific HTML tags, or markers, on a web page. And Craigslist has structured their listings in such a way that it was a breeze to find email addresses. The tag was something along the lines of “email-reply-link,” which basically points out that an email link is available.
BeautifulSoup使您可以在網(wǎng)頁(yè)上搜索特定HTML標(biāo)簽或標(biāo)記。 Craigslist的清單結(jié)構(gòu)很容易找到電子郵件地址。 該標(biāo)記類似于“ email-reply-link”,基本上指出了電子郵件鏈接可用。
From then on, everything was easy. I relied on the built-in functionality BeautifulSoup provided, and with just some simple manipulation, I was able to pick out email addresses from Craigslist posts quite easily.
從那時(shí)起,一切都很輕松。 我依靠提供的內(nèi)置功能BeautifulSoup,并且只需進(jìn)行一些簡(jiǎn)單的操作,就可以很容易地從Craigslist帖子中挑選出電子郵件地址。
放在一起 (Putting things together)
Within an hour or so, I had my first MVP. I had built a web scraper that could collect email addresses and respond to people looking for software engineers within a 100-mile radius of Seattle.
在一個(gè)小時(shí)左右的時(shí)間內(nèi),我有了第一個(gè)MVP。 我建立了一個(gè)網(wǎng)絡(luò)抓取工具,可以收集電子郵件地址并響應(yīng)在西雅圖100英里范圍內(nèi)尋找軟件工程師的人們的React。
I added various add-ons on top of the original script to make life much easier. For example, I saved the results into a CSV and HTML page so that I could parse them quickly.
我在原始腳本的頂部添加了各種附加組件,以使工作更加輕松。 例如,我將結(jié)果保存到CSV和HTML頁(yè)面中,以便可以快速解析它們。
Of course, there were many other notable features lacking, such as:
當(dāng)然,還缺少許多其他值得注意的功能,例如:
- the ability to log the email addresses I sent 能夠記錄我發(fā)送的電子郵件地址
- fatigue rules to prevent over-sending emails to people I’d already reached out to 疲勞規(guī)則,以防止向我已經(jīng)聯(lián)系過(guò)的人發(fā)送過(guò)多電子郵件
- special cases, such as some emails requiring a Captcha before they’re displayed to deter automated bots (which I was) 特殊情況,例如有些電子郵件需要顯示驗(yàn)證碼才能顯示,以阻止自動(dòng)漫游器(我當(dāng)時(shí)是)
- Craigslist didn’t allow scrapers on their platform, so I would get banned if I ran the script too often. (I tried to switch between various VPNs to try to “trick” Craigslist, but that didn’t work), and Craigslist不允許在其平臺(tái)上使用刮板,因此如果我過(guò)于頻繁地運(yùn)行腳本,我將被禁止使用。 (我試圖在各種VPN之間切換以嘗試“欺騙” Craigslist,但這沒(méi)有用),以及
I still couldn’t retrieve all postings on Craigslist
我仍然無(wú)法檢索Craigslist上的所有帖子
The last one was a kicker. But I figured if a posting had been sitting for a while, then maybe the person who posted it was not even looking anymore. It was a trade-off I was OK with.
最后一個(gè)是踢腳。 但是我發(fā)現(xiàn)如果某個(gè)發(fā)布已經(jīng)坐了一段時(shí)間,那么發(fā)布該帖子的人可能甚至都不再看了。 這是我可以接受的折衷方案。
The whole experience felt like a game of Tetris. I knew what my end goal was, and my real challenge was fitting the right pieces together to achieve that specific end goal. Each piece of the puzzle brought me on a different journey. It was challenging, but enjoyable nonetheless and I learned something new each step of the way.
整個(gè)體驗(yàn)就像是俄羅斯方塊的游戲 。 我知道自己的最終目標(biāo)是什么,而我真正的挑戰(zhàn)是將正確的零件組合在一起以實(shí)現(xiàn)那個(gè)特定的最終目標(biāo)。 每個(gè)難題都使我走上了不同的旅程。 這是具有挑戰(zhàn)性的,但仍然很有趣,我在每一步中都學(xué)到了一些新東西。
得到教訓(xùn) (Lessons learned)
It was an eye-opening experience, and I ended up learning a little bit more about how the Internet (and Craigslist) works, how various different tools can work together to solve a problem, plus I got a cool little story I can share with friends.
這是一次令人大開(kāi)眼界的經(jīng)歷,我最終了解了有關(guān)Internet(和Craigslist)如何工作,各種不同工具如何協(xié)同工作以解決問(wèn)題的更多知識(shí),并且我得到了一個(gè)很酷的小故事,可以與我分享朋友們。
In a way, that’s a lot like how technologies work these days. You find a big, hairy problem that you need to solve, and you don’t see any immediate, obvious solution to it. You break down the big hairy problem into multiple different manageable chunks, and then you solve them one chunk at a time.
從某種意義上講,這與當(dāng)今技術(shù)的運(yùn)作方式非常相似。 您發(fā)現(xiàn)需要解決的一個(gè)大問(wèn)題,而且沒(méi)有任何直接,明顯的解決方案。 您將大毛病分解為多個(gè)不同的可管理塊,然后一次解決一個(gè)塊。
Looking back, my problem was this: how can I use this awesome directory on the Internet to reach people with specific interests quickly? There were no known products or solutions available to me at the time, so I broke it down into multiple pieces:
回想起來(lái),我的問(wèn)題是這樣的: 我如何使用Internet上的這個(gè)很棒的目錄快速找到具有特定興趣的人 ? 當(dāng)時(shí)沒(méi)有可用的已知產(chǎn)品或解決方案,因此我將其分解為多個(gè)部分:
That’s all there was to it. Technology merely acted as a means to the end. If I could’ve use an Excel spreadsheet to do it for me, I would’ve opted for that instead. However, I’m no Excel guru, and so I went with the approach that made most sense to me at the time.
僅此而已。 技術(shù)只是達(dá)到目的的手段 。 如果我可以使用Excel電子表格來(lái)幫我做,那我會(huì)選擇這么做。 但是,我不是Excel專家,所以我采用了當(dāng)時(shí)對(duì)我來(lái)說(shuō)最有意義的方法。
改進(jìn)領(lǐng)域 (Areas of Improvement)
There were many areas in which I could improve:
我可以在很多方面進(jìn)行改進(jìn):
- I picked a language I wasn’t very familiar with to start, and there was a learning curve in the beginning. It wasn’t too awful, because Python is very easy to pick up. I highly recommend that any beginning software enthusiast use that as a first language. 我選擇了一種我不太熟悉的語(yǔ)言來(lái)開(kāi)始學(xué)習(xí),而且一開(kāi)始就有學(xué)習(xí)的彎路。 并不是很糟糕,因?yàn)镻ython很容易拿起。 我強(qiáng)烈建議任何新手軟件愛(ài)好者將其用作第一語(yǔ)言。
Relying too heavily on open-source technologies. Open source software has it’s own set of problems, too. There were multiple libraries I used that were no longer in active development, so I ran into issues early on. I could not import a library, or the library would fail for seemingly innocuous reasons.
過(guò)于依賴開(kāi)源技術(shù)。 開(kāi)源軟件也有它自己的一系列問(wèn)題 。 我使用了多個(gè)不再進(jìn)行主動(dòng)開(kāi)發(fā)的庫(kù),所以我很早就遇到了問(wèn)題。 我無(wú)法導(dǎo)入庫(kù),否則該庫(kù)將因看似無(wú)害的原因而失敗。
Tackling a project by yourself can be fun, but can also cause a lot of stress. You’d need a lot of momentum to ship something. This project was quick and easy, but it did take me a few weekends to add in the improvements. As the project went on, I started to lose motivation and momentum. After I found a job, I completely ditched the project.
自己解決一個(gè)項(xiàng)目可能很有趣,但也會(huì)帶來(lái)很多壓力 。 您需要大量的動(dòng)力來(lái)運(yùn)送東西。 這個(gè)項(xiàng)目既快速又簡(jiǎn)單,但是確實(shí)花了我?guī)讉€(gè)周末來(lái)進(jìn)行改進(jìn)。 隨著項(xiàng)目的進(jìn)行,我開(kāi)始失去動(dòng)力和動(dòng)力。 找到工作后,我完全放棄了這個(gè)項(xiàng)目。
我使用的資源和工具 (Resources and Tools I used)
The Hitchhiker’s Guide to Python — Great book for learning Python in general. I recommend Python as a beginner’s first programming language, and I talk about how I used it to land offers from multiple top-tier top companies in my article here.
《 Hitchhiker的Python指南》 -全面學(xué)習(xí)Python的好書(shū)。 我建議Python作為初學(xué)者的第一個(gè)編程語(yǔ)言,和我談我如何使用從多個(gè)頂級(jí)頂級(jí)公司的土地報(bào)價(jià)在我的文章在這里 。
DailyCodingProblem: It’s a service that sends out daily coding problems to your email, and has some of the most recent programming problems from top-tier tech companies. Use my coupon code, zhiachong, to get $10 off!
DailyCodingProblem :這是一項(xiàng)將日常編碼問(wèn)題發(fā)送到您的電子郵件的服務(wù),并且具有一些頂級(jí)技術(shù)公司的最新編程問(wèn)題。 使用我的優(yōu)惠券代碼zhiachong可獲得$ 10的折扣!
BeautifulSoup — The nifty utility tool I used to build my web crawler
BeautifulSoup —我用來(lái)構(gòu)建網(wǎng)絡(luò)搜尋器的漂亮實(shí)用工具
Web Scraping with Python — A useful guide to learning how web scraping with Python works.
使用Python進(jìn)行網(wǎng)絡(luò)抓取-學(xué)習(xí)如何使用Python進(jìn)行網(wǎng)絡(luò)抓取的有用指南。
Lean Startup - I learned about rapid prototyping and creating an MVP to test an idea from this book. I think the ideas in here are applicable across many different fields and also helped drive me to complete the project.
精益創(chuàng)業(yè) -我從本書(shū)中學(xué)到了快速原型制作和創(chuàng)建MVP來(lái)測(cè)試想法的知識(shí)。 我認(rèn)為這里的想法適用于許多不同領(lǐng)域,也幫助我完成了該項(xiàng)目。
Evernote — I used Evernote to compile my thoughts together for this post. Highly recommend it — I use this for basically _everything_ I do.
Evernote —我使用Evernote將我的想法匯總在一起。 強(qiáng)烈推薦它-我基本上將其用于所有操作。
My laptop- This is my current at-home laptop, set up as a work station. It’s much, much easier to work with than an old BenQ laptop, but both would work for just general programming work.
我的筆記本電腦 -這是我當(dāng)前的家用筆記本電腦,設(shè)置為工作站。 與舊的BenQ筆記本電腦相比,它使用起來(lái)容易得多,但兩者都僅適用于常規(guī)編程工作。
Credits:
學(xué)分:
Brandon O’brien, my mentor and good friend, for proof-reading and providing valuable feedback on how to improve this article.
我的導(dǎo)師和好朋友Brandon O'brien進(jìn)行了校對(duì)并提供了有關(guān)改進(jìn)本文的寶貴反饋。
Leon Tager, my coworker and friend who proofreads and showers me with much-needed financial wisdom.
萊昂·塔格 ( Leon Tager )是我的同事和朋友,他用急需的財(cái)務(wù)知識(shí)為我校對(duì)和洗澡。
You can sign up for industry news, random tidbits and be the first to know when I publish new articles by signing up here.
您可以注冊(cè)以獲取行業(yè)新聞,隨機(jī)花絮,并可以在此處注冊(cè)成為第一個(gè)知道我何時(shí)發(fā)布新文章的人。
Zhia Chong is a software engineer at Twitter. He works on the Ads Measurement team in Seattle, measuring ads impact and ROI for advertisers. The team is hiring!
Zhia Chong是Twitter的軟件工程師。 他在西雅圖的廣告評(píng)估團(tuán)隊(duì)工作,負(fù)責(zé)評(píng)估廣告客戶的廣告影響力和投資回報(bào)率。 團(tuán)隊(duì)正在 招聘 !
You can find him on Twitter and LinkedIn.
您可以在 Twitter 和 LinkedIn 上找到他 。
翻譯自: https://www.freecodecamp.org/news/how-i-built-a-web-crawler-to-automate-my-job-search-f825fb5af718/
從頭學(xué)習(xí)計(jì)算機(jī)網(wǎng)絡(luò)
總結(jié)
以上是生活随笔為你收集整理的从头学习计算机网络_我如何通过从头开始构建网络爬虫来自动进行求职的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到用水泡脚是什么意思
- 下一篇: 为什么测试喜欢ie_为什么我现在喜欢测试