Unverified Commit 8e753a32 authored by Athokshay Ashok's avatar Athokshay Ashok Committed by GitHub
Browse files

Add files via upload

parent dfdaed24
%% Cell type:markdown id: tags:
# IBM Capstone Project
%% Cell type:code id: tags:
``` python
import numpy as np
import pandas as pd
!pip install BeautifulSoup4
!pip install requests
!pip install lxml
```
%% Cell type:code id: tags:
``` python
print('Hello Capstone Project Course!')
```
%% Cell type:markdown id: tags:
## Week 3: Segmenting and Clustering the Neighborhoods in the City of Toronto, Canada
%% Cell type:code id: tags:
``` python
from bs4 import BeautifulSoup
import requests
#Use Beautiful Soup to extract page text
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(source.text, 'html.parser')
#Find table in HTML and extract all data into rows
data = []
columns = []
table = soup.find(class_='wikitable')
for index, tr in enumerate(table.find_all('tr')):
section = []
for td in tr.find_all(['th','td']):
section.append(td.text.rstrip())
if (index == 0):
columns = section
else:
data.append(section)
canada_df = pd.DataFrame(data = data,columns = columns)
canada_df.head()
```
%%%% Output: execute_result
Postal Code Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Regent Park, Harbourfront
%% Cell type:code id: tags:
``` python
#Remove all rows where borough is not assigned
canada_df = canada_df[canada_df['Borough'] != 'Not assigned']
canada_df.head()
```
%%%% Output: execute_result
Postal Code Borough Neighbourhood
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Regent Park, Harbourfront
5 M6A North York Lawrence Manor, Lawrence Heights
6 M7A Downtown Toronto Queen's Park, Ontario Provincial Government
%% Cell type:code id: tags:
``` python
# More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page,
# you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park.
# These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in
#the above table.
# This did not need to be addressed since the data was already grouped by postal code with all the corresponding neighborhoods.
```
%% Cell type:code id: tags:
``` python
#Update index to be postcode
if(canada_df.index.name != 'Postal Code'):
canada_df = canada_df.set_index('Postal Code')
canada_df.head()
```
%%%% Output: execute_result
Borough Neighbourhood
Postal Code
M3A North York Parkwoods
M4A North York Victoria Village
M5A Downtown Toronto Regent Park, Harbourfront
M6A North York Lawrence Manor, Lawrence Heights
M7A Downtown Toronto Queen's Park, Ontario Provincial Government
%% Cell type:code id: tags:
``` python
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
canada_df['Neighbourhood'].replace("Not assigned", canada_df["Borough"],inplace=True)
canada_df.head()
```
%%%% Output: execute_result
Borough Neighbourhood
Postal Code
M3A North York Parkwoods
M4A North York Victoria Village
M5A Downtown Toronto Regent Park, Harbourfront
M6A North York Lawrence Manor, Lawrence Heights
M7A Downtown Toronto Queen's Park, Ontario Provincial Government
%% Cell type:code id: tags:
``` python
canada_df.shape
```
%%%% Output: execute_result
(103, 2)
%% Cell type:code id: tags:
``` python
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment