python抓取网站乱码_如何使用Python抓取网站-變量-匯編語言學習筆記

python抓取网站乱码_如何使用Python抓取网站

2023-09-06 阅读 33 评论 0

摘要：python抓取网站乱码by Devanshu Jain 由Devanshu Jain It is that time of the year when the air is filled with the claps and cheers of 4 and 6 runs during the Indian Premier League Cricket T20 tournament followed by the ICC Cricket World Cup in England. And h

python抓取网站乱码

by Devanshu Jain

由Devanshu Jain

It is that time of the year when the air is filled with the claps and cheers of 4 and 6 runs during the Indian Premier League Cricket T20 tournament followed by the ICC Cricket World Cup in England. And how can we forget the election results of the world’s largest democratic country, India, that will be out in the next few weeks?

在每年的那个时候，在印度超级联赛板球T20锦标赛以及随后在英格兰举行的ICC板球世界杯期间，四分之三的欢呼声和欢呼声四处响起。我们又如何忘记未来几周将会公布的世界最大民主国家印度的选举结果？

To stay updated on who will be getting this year’s IPL title or which country is going to get the ICC World Cup in 2019 or how the country’s future will look in the next 5 years, we constantly need to be glued to the Internet.

为了及时了解谁将获得今年的IPL冠军或哪个国家将获得2019年ICC世界杯或未来5年该国的未来情况，我们始终需要牢牢掌握互联网。

But if you’re like me and cannot spare much time on the Internet, but have a strong desire to stay updated with all these titles, then this article is for you. So without wasting any time, let’s get started!

但是，如果您像我一样，不能在互联网上花费大量时间，但是强烈希望与所有这些标题保持同步，那么本文适合您。因此，不要浪费时间，让我们开始吧！

There are two ways with which we can access the updated information. One way is through APIs provided by these media websites, and the other way round is through Web/Content Scraping.

我们可以通过两种方式访问更新的信息。一种方式是通过这些媒体网站提供的API，另一种方式是通过Web /内容搜集。

The API way is too simple, and probably the best way to get updated information is by calling the associated programming interface. But sadly, not all websites provide publicly accessible APIs. So the other way left for us is to Web Scrape.

API的方法太简单了，获取更新信息的最好方法可能是调用关联的编程接口。但是遗憾的是，并非所有网站都提供可公开访问的API。因此，剩下的另一种方式是Web Scrape。

网页抓取 (Web Scraping)

Web scraping is a technique to extract information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet). Web scraping may involve accessing the web directly using HTTP, or through a web browser.

Web抓取是一种从网站提取信息的技术。该技术主要致力于将Web上的非结构化数据(HTML格式)转换为结构化数据(数据库或电子表格)。 Web抓取可能涉及直接使用HTTP或通过Web浏览器访问Web。

In this article, we’ll be using Python to create a bot for scraping content from the websites.

在本文中，我们将使用Python创建一个用于从网站抓取内容的机器人。

Craft.io流程 (Process Workflow)

Get the URL of the page from which we want to extract/scrape data
获取我们要从中提取/抓取数据的页面的URL
Copy/download the HTML content of the page
复制/下载页面HTML内容
Parse the HTML content and get the required data
解析HTML内容并获取所需的数据

The above flow helps us to navigate to the URL of the required page, get its HTML content, and parse the required data. But sometimes there are cases, when we first have to log in to the website and then navigate to a specific location to get the required data. In that case, that adds one more step of logging into the website.

上面的流程帮助我们导航到所需页面的URL，获取其HTML内容并解析所需数据。但是有时在某些情况下，我们首先必须登录到网站，然后导航到特定位置以获取所需的数据。在这种情况下，这又增加了登录网站的步骤。

配套 (Packages)

For parsing the HTML content and getting the required data, we use the Beautiful Soup library. It’s an amazing Python package for parsing HTML and XML documents. Do check it out here.

为了解析HTML内容并获取所需的数据，我们使用Beautiful Soup库。这是一个很棒的Python包，用于解析HTML和XML文档。在这里检查一下。

For logging into the website, navigating to the required URL within the same session, and downloading the HTML content, we’ll be using the Selenium library. Selenium Python helps with clicking on buttons, entering content in structures, and much more.

要登录网站，在同一会话中导航到所需的URL并下载HTML内容，我们将使用Selenium库。 Selenium Python可以帮助您单击按钮，在结构中输入内容等等。

直接进入代码 (Dive right into the code)

First we are importing all the libraries that we are going to use.

首先，我们导入将要使用的所有库。

# importing librariesfrom selenium import webdriverfrom bs4 import BeautifulSoup

Next we need to give the browser’s driver the path to Selenium to initiate our web browser (Google Chrome). And if we don’t want our bot to show the GUI of the browser, then we can add the headless option to Selenium. Headless browsers provide automated control of a web page in an environment similar to popular web browsers, but are executed via a command-line interface or using network communications.

接下来，我们需要为浏览器的驱动程序提供Selenium的路径，以启动我们的Web浏览器(Google Chrome)。而且，如果我们不希望我们的机器人显示浏览器的GUI，那么可以将headless选项添加到Selenium中。无头浏览器可在类似于流行网络浏览器的环境中自动控制网页，但可通过命令行界面或使用网络通信来执行。

# chrome driver pathchromedriver = '/usr/local/bin/chromedriver'options = webdriver.ChromeOptions()options.add_argument('headless')  # for opening headless browser

browser = webdriver.Chrome(executable_path=chromedriver, chrome_options=options)

After the environment has been set up by defining the browser and installing libraries, we’ll be getting our hands on the HTML. Navigate to the login page and find the email, password and submit button’s field id, class or name to enter our content into the page structure.

通过定义浏览器并安装库来设置环境后，我们将开始使用HTML。导航到登录页面，找到电子邮件，密码和提交按钮的字段ID，类或名称，以将我们的内容输入页面结构。

# Navigating to the login pagebrowser.get('http://playsports365.com/default.aspx')

#Finding the tags by nameemail = browser.find_element_by_name('ctl00$MainContent$ctlLogin$_UserName')

password = browser.find_element_by_name('ctl00$MainContent$ctlLogin$_Password')

login = browser.find_element_by_name('ctl00$MainContent$ctlLogin$BtnSubmit')

Next, we’ll send the credentials into these HTML tags by clicking on the submit button to enter our content within the page structure.

接下来，我们将通过单击提交按钮在页面结构中输入我们的内容，将凭据发送到这些HTML标记中。

# appending login credentialsemail.send_keys('********')password.send_keys('*******')

# clicking submit buttonlogin.click()

Once the login is successful, navigate to the required page and get the page’s HTML content

登录成功后，导航至所需页面并获取页面HTML内容

# After successful login, navigating to Open Bets Pagebrowser.get('http://playsports365.com/wager/OpenBets.aspx')

# Getting HTML content and parsing itrequiredHtml = browser.page_source

Now, we’ve received the HTML content and the only thing that is left is parsing this content. We’ll parse the content using the Beautiful Soup and html5lib libraries. html5lib is a Python package that implements the HTML5 parsing algorithm which is heavily influenced by current browsers. As soon as we get the normalized structure of the parsed content, we can find our data present in any child tag of the HTML tag. Our data is present in the table tag and that’s why we’re searching for that tag.

现在，我们已经收到了HTML内容，剩下的唯一事情就是解析该内容。我们将使用Beautiful Soup和html5lib库解析内容。 html5lib是一个Python软件包，实现了受当前浏览器严重影响HTML5解析算法。一旦获得了解析内容的标准化结构，我们就可以在HTML标记的任何子标记中找到我们的数据。我们的数据显示在table标签中，这就是为什么我们要搜索该标签的原因。

soup = BeautifulSoup(requiredHtml, 'html5lib')table = soup.findChildren('table')my_table = table[0]

Once we find the parent tag, we just need to recursively traverse within its children and print the values.

找到父标记后，我们只需要在其子标记中递归遍历并打印值即可。

# fetching tags and printing valuesrows = my_table.findChildren(['th', 'tr'])for row in rows:    cells = row.findChildren('td')    for cell in cells:        value = cell.text        print (value)

To execute the above program, install Selenium, Beautiful Soup and html5lib libraries using pip. After installing the libraries, typing #python <program name> would print the values to the console.

要执行上述程序，请使用pip安装Selenium，Beautiful Soup和html5lib库。安装库之后，键入#python <program na me>会将值打印到控制台。

By this way, we can scrape and find data from any website.

通过这种方式，我们可以从任何网站上抓取并查找数据。

Now, if we are scraping a website which changes its content very frequently, like cricket scores or live election results, we can run this program in a cron job and set an interval for the cron job.

现在，如果我们要抓取一个网站，该网站经常更改其内容(例如板球分数或实时选举结果)，则可以在cron作业中运行此程序并为cron作业设置间隔。

Apart from that, we can also have the results displayed right onto our screen instead of the console by printing the results in the notification tab that pops up onto the desktop after a particular time interval. We can even share these values onto a messaging client. Python has rich libraries that can help us with all that.

除此之外，我们还可以将结果打印在通知标签中，而不是在控制台上直接显示在屏幕上，通知标签在特定时间间隔后会弹出到桌面上。我们甚至可以将这些值共享到消息传递客户端上。 Python具有丰富的库，可以帮助我们完成所有这些工作。

If you want me to explain how to set up a cron job and get notifications to appear on the desktop, feel free to ask me in comment section.

如果要我解释如何设置Cron作业并使通知显示在桌面上，请随时在评论部分询问我。

Until next time, bye and I hope you liked the article.

下次再见，我希望您喜欢这篇文章。