Tuesday, December 10, 2019

Advanced Data Analysis Report

Questions: Case Study Baycoast is a (fictitious) local government area (called a 'city') within greater Melbourne, Australia. It consists of a number of different suburbs, all with their own history of development. The city grew indifferent stages, with new suburbs gradually emerging. It covers some wealthy suburbs and some not so wealthy. As the name would indicate, the city is located on the Bay.The city stretches for several kilometres along the Bay's lovely beaches, and for several kilometres inland. About 60,000 people live in the suburbs of Baycoast.The main objective is to conduct exploratory, descriptive and causal analysis is to gain a comprehensive understanding of house prices in the Baycoast region and an understanding of themost important factors that impact prices. Your analysis will be based on a random sample of 120 houses from the city. Note that for the purpose of the assignment the unit of analysis is a House. Itis defined as a stand-alone dwelling. That is, flats, apartments, etc are n ot included in the database. The assignment requires five separate tasks: Summary of House Prices Factors influencing house prices Development of a multiple regression model Time Series analysis Critique the Business Research Approach Answers: Description of House Price - This study data includes 24 variables. Out of these House no. is an identifying variable same as serial number. Price ($000) is the dependent variable of this study. And Rest 22 variables; Rooms, Lot Size(sq m), Age, Area (sq m), Material, To Train(km), To Bus(km), To Shops(km), Street, Storeys, Style, Bedrooms, Bathrooms, Kitchen, Heating, AirCon, Bay Views, Suburb, Weekly Rent $, Rental Return %, Condition and Rental Status, are independent variables. Out of these indpendent variables; Lot Size(sq m), Age, Area (sq m), To Train(km), To Bus(km), To Shops(km), Bay Views, Weekly Rent $ and Rental Return % are continuous variables while rest are categorical variables. The dependent variable price is a continuous variable. So, it described descriptively by calculating mean and standard deviation. To understand the distribution of the data skewness and kurtosis were also calculated. Mean for Price is 886.57 and median is 852, median is little smaller than mean implying the positive skewed or skewed to the right which is also supported by the positive value of skewness (0.426). Negative value of kurtosis (-0.1478) implies the slight flat peaks of the normal curve as can be observed in the histogram plotted below. Standard Deviation is 324.95. Variable price seem to follow almost normal distribution with slight right skewness and flat peak as observed in the histogram presented below. Only one outlier with a value of 1500 was found as an outlier in the boxplot. Factors affecting House Price To find the factors affecting House Price, first correlation was run for all the independent variables with variable price. Amongst these correlation coefficients, any four values of correlation coefficient greater than 0.4 were selected and these factors were studied. The cut-off value of 0.4 was chosen so as to have variables with at least moderately good correlation with the variable price. So, based on the correlation coefficient the variables selected are Rooms, Street, Bay Views and Weekly Rent ($). By common sense if we think then we will find that these variables really affect the price of a house. Correlation coefficients are reported below - Correlation coefficient Rooms 0.505469 Lot Size(sq m) 0.41083 Age -0.36346 Area (sq m) 0.567769 Material 0.164491 To Train(km) 0.003012 To Bus(km) -0.02383 To Shops(km) -0.08373 Street 0.72257 Storeys 0.565098 Style 0.364688 Bedrooms 0.539745 Bathrooms 0.331223 Kitchen -0.05608 Heating 0.243694 AirCon 0.122417 Bay Views 0.676194 Suburb 0.373759 Weekly Rent $ 0.665591 Rental Return % -0.40384 Condition 0.354435 Rental Status 0.068507 The correlation coefficients for the factors selected are Correlation coefficient Rooms 0.505469 Street 0.72257 Bay Views 0.676194 Weekly Rent $ 0.665591 Scatter plot between Weekly Rent and Price was plotted to visually inspect the relation between these two variables. Scatter plot was plotted between these two variables only because these are continuous variables. Slight linear relationship is observed between Weekly Rent and Selling Price. To compare the two continuous variables descriptively, descriptive statistics described below Price($'000) Weekly Rent $ Mean 886.575 Mean 604.3333333 Standard Error 29.66343575 Standard Error 20.67073241 Median 852 Median 595 Mode 811 Mode 835 Standard Deviation 324.9466579 Standard Deviation 226.4365284 Sample Variance 105590.3305 Sample Variance 51273.5014 Kurtosis -0.14778497 Kurtosis 0.225284528 Skewness 0.426005063 Skewness 0.523856086 Range 1569 Range 1170 Minimum 192 Minimum 200 Maximum 1761 Maximum 1370 Sum 106389 Sum 72520 Count 120 Count 120 Confidence Level(95.0%) 58.73656476 Confidence Level(95.0%) 40.9301142 The standard error and skewness of the two variables is quite comparable. Weekly rent has positive kurtosis as compared to negative kurtosis for Price implying steeper peaks than Price. The other three variables are categorical variables so mean Price against each category was calculated to view the effect of price of each category. Average of Price($'000) Rooms Total 5 744.86 6 721.00 7 808.00 8 1137.32 9 1182.69 Grand Total 886.58 Average of Price($'000) Street Total 3 520.92 4 684.29 5 866.93 6 884.40 7 882.60 8 1103.10 9 1359.88 10 1499.00 Grand Total 886.58 Average of Price($'000) Bay Views Total 0 753.12 0.1 817.50 0.2 867.25 0.3 991.17 0.4 1041.29 0.5 1079.40 0.6 1110.00 0.7 1176.80 0.8 1362.50 0.9 1455.00 1 1510.00 Grand Total 886.58 From the above tables we can observe that as the category increases, i.e number of rooms increases or number of street increases or the proportion of Bay views increases then Price of the House also increases with only one exception of price for 7 Street is less than price for Street 6. Multiple Regression As the data contains 22 independent variables and Excel has limitation on the number of dependent variables to be used, two multiple regression with 11 variables each were run sing Data analysis toolpak of excel. Amongst these two multiple regression whichever variables showed significant p-value (p 0.05) for t-test for slope were included in the model. The model included 9 variables out of which variable Bedroom showed not significant p-value and was excluded. This leads to final multiple regression model with 8 independent variables influencing Price of a house. These variables are Lot size, Age, Area, Street, Storeys, Bay Views, Weekly Rent and Rental rent. The final model is Price ($ 000) = 116.0136 + 0.0642 * Lot size - 2.0876 *Age + 1.0457 * Area + 39.3334 * Street + 92.9236 * Storeys + 314.9985483 * Bay Views + 0.4481 * Weekly Rent - 66.6201 Rental rent The variables Age and Rental rent are negatively linearly related to Price meaning if age of the house increases or the rental rate increases than Price decrease or vice versa. All other variables; Lot size, Area, Street, Storeys, Bay Views and Weekly Rent are positively linearly related meaning increase in any of these variables are associated with increase in house price. The constant value of 116.0136 implies that if all the variables are zero still the price of house is 116.0136 ($ 000) which indirectly means its land price coz without anything there could be no house practically. The coefficients of variables imply that if the variable is increased by unit then Price of the house will increase by units equal to coefficient value. Eg: If the lot size increases by one unit then Price will increase by 0.0642 units. The R squared value obtained for this model is 0.9804. R squared value represents goodness of fit of the model. So 98.04% value of goodness of fit implies very good fit of the model. This also explains that 98.04% of variance in the model is explained by the independent variables included in the model and only approx. 2% remains unexplained. Adjusted R square value is 97.90 which also imply very good fit of the model. The p value for Analysis of Variance obtained is very less which implies the model to be highly significant at 5% level of significance. If we compare the positive coefficients then Bay views has the highest coefficient followed by Storeys followed by Street followed by Area followed by Weekly Rent followed by Lot size. Lot size influences the price of a house by least strength and Bay views by highest strength. Time series analysis Time series data from 4th quarter of 2009 to 3rd quarter of 2013 was used to seasonalize the trend using the software Megastat an add-in of Excel. Deseasonalization was carried out using multiplicative model. Forecast equation derived through regression is y = 25.42t+504.5 where 504.5 is constant and 25.42 is a coefficient or slope of the line and t is the number of serial order. Using this equation Median House Price was forecasted from 4th quarter of 2013 to 3rd Quarter of 2014 using t from 17 to 20 as below t Median House price 17 936.64 18 962.06 19 987.48 20 1012.9 Using the observed median house price provided we calculated Mean Absolute Percentage Error (MAPE) by following formula MAPE = 100/n*(Actual Forecast)/Actual Where n = number of actual observations provided t Forecasted Actual ratio abs ratio 17 936.64 980.00 0.04 0.04 18 962.06 1062.00 0.09 0.09 19 987.48 1206.00 0.18 0.18 20 1012.90 954.00 -0.06 0.06 sum 0.38 MAPE 9.53 The value of MAPE obtained for this data is 9.53 which imply good fit of the model and there is very less difference between forecasted values and observed values. Discussion The data in this study includes 120 observations and 24 variables out of which one, Price, is a dependent variable. Rest 23 variables, except House no., are independent variable. House No. is identifying variable. Amongst all these variables few variables were more correlated with Price than other few. Few variables showed positive correlation and few negative. Most of the negative correlations observed were very weak. Out of positive correlation variables Street, Bay views and Weekly Rent show strong correlation. Out of the 22 variables, 8 variables; Lot size, Age, Area, Street, Storeys, Bay Views, Weekly Rent and Rental rent were observed to determine the price of the house significantly. Thus, these are the factors which contribute to the estimation of House price significantly. Out of these Age and Rental rent are inversely related to House Price. The model showed a good fit with R squared value of approx. 98% implying that approx. 98% of variance in House Price is explained by these independent variables and model. Only approx. 2% of variance in House price remained unexplained. The model obtained was significant with highly significant p value at 5% level of significance implying that these results should be considered and are not just obtained by chance. Bay views most strongly affect the House price and lot size with the least strength. Using time series data from 4th Quarter of 2009 to 3rd quarter of 2013 was deseasonalised using Multiplicative model. Forecast equation was obtained using regression. Then values of median house price were forecasted from 4th quarter of 2013 to 3rd quarter of 2014. These values were then compared with the observed values by calculating Mean Absolute Percentage Error determining the fit of the forecasted values. Mean Absolute Percentage Error of 9.53 indicates good fit. Thus, the results of multiple regression and time series analysis implies that this data provides a good fit. Thus, the model fit is able to forecast the values with high accuracy. This data has high potential.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.