Best Location To Start an Asian Restaurant in Jozi: A data science perspective
- Introduction
a. City Background
The city of Johannesburg is located in Gauteng province (one of 9 provinces in the country). It is often referred to as Jozi, Joburg or Egoli and it’s the largest city in South Africa. As of 2021, the population of Joburg is estimated to be 5.7 million with 76.4% are Black African, 12.3% are White people, 5.6% are coloured people, and 4.9% are Indian/Asian.
Johannesburg has the largest man-made forest in the world, it is also one of the largest 50 urban agglomerations in the world and the largest city in the world that is not located on near a water body.
The city is divided into 7 distinct regions or municipalities:
- Regions or municipalities can be viewed as boroughs
- Each municipality is constructed from locations or suburbs
- Location or suburbs can be viewed as neighbourhoods
b. Problem Description
Joburg is very diverse in its population distribution and there is a fair mix of ethnic groups within each location/suburb. We will use clustering and segmentation methods to investigate and find a good location for establishing an Asian restaurant within the Midrand municipality.
Special emphasis will be on an area where the restaurant will be first of its kind or increase on a currently small footprint.
2. Analytic Approach
Data Description
This means identifying what type of patterns will be needed to address the question most effectively. If the question is to determine probabilities of an action, then a predictive model might be used. If the question is to show relationships, a descriptive approach maybe be required.
This project will look into the municipality/Region data and the locations/suburbs within each municipality in Joburg.
For each location in Joburg, we will look at the venues (i.e. restaurants) that exist within that location.
a. Locations within Joburg:
This data is retrieved from Johannesburg postal code data (on: https://www.southafricapostcode.com/location/gauteng/city-of-johannesburg/):
- This data will be scrapped over 57 pages of data within above url
- This data does not contain the Region/Municipality data, this data will be retrieved from other sources
b. Region/Municipality data in Joburg:
Municipality data is retrieved from (https://www.joburg.org.za/about_/regions/Pages/City-of-Johannesburg-regions.aspx):
- Data is arranged in 7 different Regions (A to G)
- Data is scrapped from each Region, however the format within each page is not common and some pages present data in html while others in pdf
- Will be explained further in Section 3
c. Venues data within each location in Joburg:
This data will be retrieved from Foursquare by using predefined credentials for the Foursquare API
3. Methodology
Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
a. Data Requirements
- The data required consists of location/suburb information that contains postal codes
- Postal code data is used to translate to latitude and longitude data
- The Region/Municipality of each location is then concatenated with the location data to provide the final dataset as shown below
b. Data Collection and preparation
i) Locations within Joburg is strapped from https://www.southafricapostcode.com/location/gauteng/city-of-johannesburg/ over 57 pages of content.
Here we construct a scrapper function that retrieves data from each page and stores it in a dataframe called Jozi_df:
ii) Assigning coordinates to Joburg locations requires the use of geocoder library. Here we use the Jozi_df data’s postal codes to translate to lat/long data
iii) Region/Municipality data is retrieved from (https://www.joburg.org.za/about_/regions/Pages/City-of-Johannesburg-regions.aspx), here the data for each Region is retrieved with a specialized function as the format of the data varies from Region to Region
a) For Regions A, C, D and G the format is similar and a single function, Regioner, is used to retrieve the locations within each municipality:
b) For Region F a special function (RegionerF) is created to retrieve data as it’s data is not the same as preceding regions:
c) For Region B, we create a special function (RegionerB) that retrieves the data for all locations that fall within Region B:
d) Region E data is presented as pdf format. For this we use camelot library to scrap data from Region E url:
All the location data within each municipality is then gathered into Municipality_df dataframe:
We merge the location and municipality dataframes to generate final data as per Data Requirements section:
c. Data Understanding
To better understand the locations and distribution of suburbs within Joburg, we use folium library to display the data in a map:
First we filter the Joburg municipality and locations data such that all entries that were not matched with a municipality are discarded:
On visualization the location data is reduced to:
4. Results
We pick the Midrand municipality as our area of interest and filter all locations that contain Midrand as the municipality into dataframe Midrand_Municipality:
We then visualize all locations within Midrand:
The Foursquare API is used to retrieve all venues within each location in Midrand and the data is stored in Midrand_venues_sorted dataframe:
a. Modelling
The segmentation of the locations in Midrand will be conducted using an unsupervised clustering algorithm called k-mean method.
For our cluster modelling on the Midrand data, we use 3 clusters to evaluate each observation within each cluster that falls within the nearest mean (i.e. groups common clusters).
We visualize the clustered data on folium:
b. Evaluation
Out of the 3 clusters, the distribution per cluster is shown below:
Within cluster 0, the most famous restaurants are African, Indian and Italian
Within cluster 1, the most famous restaurants are Italian, Indian and Mexican
In cluster 2, the most famous restaurants are African, Indian and Italian
5. Discussion and Conclusion
Cluster 1 (purple dots) contains most of the establishments within the Midrand municipality and it is the most saturated cluster. It would not be the best area to establish new market.
Cluster 0 (red dots) and Cluster 2 (greed dots) are unsaturated markets and they would form a good base to deploy an Asian restaurant. Cluster 2, especially Brendavere suburb would be the most suitable location to establish an Asian restaurant as it has a larger distribution of traffic in terms of locality with office buildings, outdoor sites and historic areas.
6. References
[2]: https://en.wikipedia.org/wiki/Suburbs_of_Johannesburg
[3]: https://www.southafricapostcode.com/location/gauteng/city-of-johannesburg/
[4]: https://www.joburg.org.za/about_/regions/Pages/City-of-Johannesburg-regions.aspx