Real-time Crawling of Pull Request Count
In this lesson, we will crawl the number of Pull Requests from the Django repository page on GitHub and display it on the screen.
Please note that a Pull Request refers to suggesting changes to another user's repository.
Step 1
Fetching HTML from the Web Page
response = requests.get(url) html_content = response.text
- requests.get(url): Retrieves data from the web page at the given URL. Here, it is the URL of the Django GitHub repository page.
- response.text: Extracts the HTML content as a string from the response received by the- requests.getfunction.
Step 2
Parsing HTML
soup = BeautifulSoup(html_content, "html.parser")
- BeautifulSoup(html_content, "html.parser"): Uses- BeautifulSoupto parse the obtained HTML content (- html_content). This allows easy access to various elements within the HTML document.
Step 3
Extracting Information
count = soup.find(id="pull-requests-repo-tab-count").get_text()
- soup.find(id="pull-requests-repo-tab-count"): Searches for an element with the ID- pull-requests-repo-tab-countin the parsed HTML content. This ID corresponds to the element that displays the number of pull requests on the GitHub repository page.
- .get_text(): Extracts the text content (in this case, the number of pull requests) from the found element.
Note: When performing crawling, make sure to check the robots.txt file and terms of service of the target website to ensure compliance with their regulations.
Practice Exercise
- 
Execute the above code using various repository URLs from GitHub. 
- 
Practice targeting different HTML tags and extracting data from those tags. 
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help
Code Editor
Run
Generate
Execution Result