test blog

Predicting the Value of Your House

Posted by Phuc Duong on Sep 22, 2016 12:32:20 PM

[av_section min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' scroll_down='' id='' color='main_color' custom_bg='' src='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9' video_mobile_disabled='' overlay_enable='' overlay_opacity='0.5' overlay_color='' overlay_pattern='' overlay_custom_pattern='']

[av_heading heading='Predicting the Value of Your House' tag='h1' style='blockquote modern-quote modern-centered' size='40' subheading_active='subheading_below' subheading_size='25' padding='30' color='custom-color-heading' custom_font='#0a0a0a']
A Step-By-Step Tutorial Using Azure ML
[/av_heading]

[/av_section][av_section color='main_color' custom_bg='#ffffff' src='' attachment='' attachment_size='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9' min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' id='']
[av_three_fifth first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_heading tag='h3' padding='0' heading='Overview' color='' style='blockquote modern-quote modern-centered' custom_font='#ffffff' size='40' subheading_active='' subheading_size='15' custom_class=''][/av_heading]

[av_textblock size='' font_color='' color='']

Learn how companies like Zillow would predict the value of your home. In this tutorial you will learn how to build a model to predict the real estate sales price of a house based upon various historical features about the house and the sales transaction.

[/av_textblock]

[av_heading tag='h3' padding='0' heading='About the Data' color='' style='blockquote modern-quote modern-centered' custom_font='#ffffff' size='40' subheading_active='' subheading_size='15' custom_class=''][/av_heading]

[av_textblock size='' font_color='' color='']

Ames housing dataset includes 81 features and 1460 observations. Each observation represents the sale of a home and each feature is an attribute describing the house or the circumstance of the sale. Click here for the full list of feature descriptions.

[/av_textblock]

[/av_three_fifth][av_two_fifth min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/Steves-House-01.png' attachment='27952' attachment_size='full' align='center' styling='' hover='' link='' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[av_hr class='default' height='50' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808']

[av_social_share title='Share this entry' style='' buttons='' share_facebook='' share_twitter='' share_pinterest='' share_gplus='' share_reddit='' share_linkedin='' share_tumblr='' share_vk='' share_mail=''][/av_social_share]

[/av_two_fifth]
[/av_section]

[av_section color='main_color' custom_bg='#ffffff' src='' attachment='' attachment_size='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9' min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' id='']
[av_two_fifth first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/regression.jpg' attachment='27967' attachment_size='full' align='center' styling='' hover='' link='manually,https://gallery.cortanaintelligence.com/Experiment/Building-a-Regression-Model-to-Predict-Real-Estate-Sales-Price-1#' target='_blank' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_two_fifth][av_three_fifth min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_heading tag='h3' padding='0' heading='Follow Along, Clone this Experiment' color='' style='blockquote modern-quote modern-centered' custom_font='#ffffff' size='40' subheading_active='' subheading_size='15' custom_class=''][/av_heading]

[av_textblock size='' font_color='' color='']

A full copy of this experiment has been posted to the Cortana Intelligence Gallery. Go to the link and click on "open in Studio".

[/av_textblock]

[av_button label='Clone this Experiment' link='manually,https://gallery.cortanaintelligence.com/Experiment/Building-a-Regression-Model-to-Predict-Real-Estate-Sales-Price-1#' link_target='_blank' size='large' position='center' icon_select='yes' icon='ue857' font='entypo-fontello' color='theme-color' custom_bg='#444444' custom_font='#ffffff']

[/av_three_fifth]
[/av_section]

[av_section min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' id='' color='main_color' custom_bg='#ffffff' src='' attachment='' attachment_size='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9']
[av_heading tag='h3' padding='10' heading='Preprocessing & Data Exploration' color='' style='blockquote modern-quote modern-centered' custom_font='#ffffff' size='40' subheading_active='' subheading_size='15' custom_class=''][/av_heading]

[av_hr class='invisible' height='50' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_one_third first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/exclude-low-quality-features.png' attachment='27964' attachment_size='full' align='center' styling='' hover='' link='lightbox' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_one_third][av_two_third min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_icon_box position='left' boxed='' icon='ue847' font='entypo-fontello' title='Drop Low Value Columns' link='' linktarget='' linkelement='' font_color='' custom_title='' custom_content='' color='' custom_bg='' custom_font='' custom_border='']
Begin by identifying features (columns) that add little-to-no value for predictive modeling. These columns will be dropped using the "select columns from dataset" module.

The following columns were chosen to be "excluded" from the dataset:

Id, Street, Alley, PoolQC, Utilities, Condition2, RoofMatl, MiscVal, PoolArea, 3SsnPorch, LowQualFinSF, MiscFeature, LandSlope, Functional, BsmtHalfBath, ScreenPorch, BsmtFinSF2, EnclosedPorch.

These  low quality features were removed to improve the model's performance. Low quality includes lack of representative categories, too many missing values, or noisy features.
[/av_icon_box]

[/av_two_third][av_hr class='short' height='50' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_two_third first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_icon_box position='left' boxed='' icon='ue85c' font='entypo-fontello' title='Define Categorical Variables' link='' linktarget='' linkelement='' font_color='' custom_title='' custom_content='' color='' custom_bg='' custom_font='' custom_border='']
We must now define which values are non-continuous by casting them as categorical. Mathematical approaches for continuous and non-continuous values differ greatly. Nominal categorical features were identified and cast to categorical data types using the meta data editor to ensure proper mathematical treatment by the machine learning algorithm.

The first edit metadata module will cast all strings. The column "MSSubClass" uses numeric integer codes to represent the type of building the house is, and therefore should not be treated as a continuous numeric value but rather a categorical feature. We will use another metadata editor to cast it into a category.
[/av_icon_box]

[/av_two_third][av_one_third min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/meta-data-casting.png' attachment='27965' attachment_size='full' align='center' styling='' hover='' link='lightbox' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_one_third][av_hr class='short' height='50' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_one_third first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/clean-missing-data.png' attachment='27966' attachment_size='full' align='center' styling='' hover='' link='lightbox' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_one_third][av_two_third min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_icon_box position='left' boxed='' icon='ue8d5' font='entypo-fontello' title='Clean Missing Data' link='' linktarget='' linkelement='' font_color='' custom_title='' custom_content='' color='' custom_bg='' custom_font='' custom_border='']
Most algorithms are unable to account for missing values and some treat it inconsistently from others. To address this, we must make sure our dataset contains no missing, “null”, or “NA” values.

Replacement of missing values is the most versatile and preferred method because it allows us to keep our data. It also minimizes collateral damage to other columns as a result of one cell’s bad behavior. In replacement, numerical values can easily be replaced with statistical values such as mean, median, or mode. While categories can be commonly dealt with by replacing with the mode or a separate categorical value for unknowns.

For simplicity, all categorical missing values were cleaned with the mode and all numeric features were cleaned using the median. To further improve a model's performance, custom cleaning functions should be tried and implemented on each individual feature rather than a blanket transformation of all columns.
[/av_icon_box]

[/av_two_third]
[/av_section]

[av_section color='main_color' custom_bg='#ffffff' src='' attachment='' attachment_size='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9' min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' id='']
[av_heading tag='h3' padding='10' heading='Model Building' color='' style='blockquote modern-quote modern-centered' custom_font='#ffffff' size='40' subheading_active='' subheading_size='15' custom_class=''][/av_heading]

[av_hr class='invisible' height='50' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_two_third first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_icon_box position='left' boxed='' icon='ue812' font='entypo-fontello' title='Statistical Feature Selection' link='' linktarget='' linkelement='' font_color='' custom_title='' custom_content='' color='' custom_bg='' custom_font='' custom_border='']
Not every feature in its current form is expected to contain predictive value to the model, and may mislead or add noise to the model. To filter these out we will perform a Pearson correlation to test all features against the response class (sales price) as a quick measure of their predictive strength, only picking the top X strongest features from this method, the remaining features will be left behind. This number can be tuned for further model performance increases.
[/av_icon_box]

[/av_two_third][av_one_third min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/filter-based-feature-selection.png' attachment='27969' attachment_size='full' align='center' styling='' hover='' link='lightbox' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_one_third][av_hr class='invisible' height='50' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_one_third first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/Reguarlization.png' attachment='27973' attachment_size='full' align='center' styling='' hover='' link='lightbox' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_one_third][av_two_third min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_icon_box position='left' boxed='' icon='ue8c2' font='entypo-fontello' title='Select an Algorithm' link='' linktarget='' linkelement='' font_color='' custom_title='' custom_content='' color='' custom_bg='' custom_font='' custom_border='']
First, we must identify what kind of machine learning problem this is: classification, regression, clustering, etc. Since the response class (sales price) is a continuous numeric value, we can tell that it is a regression problem. We will use a linear regression model with regularization to reduce over-fitting of the model.

  • To ensure a stable convergence of weight and biases, all features except the response class must be normalized to be placed into the same range.

[/av_icon_box]

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/linear-regression-300x71.png' attachment='27974' attachment_size='medium' align='center' styling='' hover='' link='lightbox' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_two_third][av_hr class='short' height='50' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_icon_box position='left' boxed='' icon='ue8da' font='entypo-fontello' title='Model Training and Evaluation' link='' linktarget='' linkelement='' font_color='' custom_title='' custom_content='' color='' custom_bg='' custom_font='' custom_border='']
The method of cross validation will be used to evaluate the predictive performance of the model as well as that performance's stability in regard to new data. Cross validation will build ten different models on the same algorithm but with different and non-repeating subsets of the same dataset. The evaluation metrics on each of the ten models will be averaged and a standard deviation will infer to the stability of the average performance.
[/av_icon_box]

[av_two_fifth first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/cross-validation.png' attachment='27976' attachment_size='full' align='center' styling='' hover='' link='lightbox' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_two_fifth][av_three_fifth min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/Cross-validation.jpg' attachment='27977' attachment_size='full' align='center' styling='' hover='' link='lightbox' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_three_fifth][av_one_full first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_icon_box position='left' boxed='' icon='ue8d8' font='entypo-fontello' title='Parameter Tuning' link='' linktarget='' linkelement='' font_color='' custom_title='' custom_content='' color='' custom_bg='' custom_font='' custom_border='']
This experiment will build a regression model which minimizes mean RMSE of the cross validation results with the lowest variance possible (but also consider bias-variance trade-offs).

  1. The first regression model was built using default parameters and produced a very inaccurate model ($124,942 mean RMSE) and was very unstable (11,699 standard deviation).
  2. The high bias and high variance of the previous model suggest the model is over-fitting to the outliers and is under-fitting the general population. The L2 regularization weight will be decreased to lower the penalty of higher coefficients. After lowering the L2 regularization weight, the model is more accurate with an average cross validation RMSE of $42,366.
  3. The previous model is still quite unstable with a standard deviation of $8,121. Since this is a dataset with a small number of observations (1460), it may be better to increase the number of training epochs so that the algorithm has more passes to reach convergence. This will increase training times but also increase stability. The third linear model had the number of training epochs increased and saw a better mean cross validation RMSE of $36,684 and a much more stable standard deviation of $3,849.
  4. The final model had a slight increase in the learning rate which improved both mean cross validation RMSE and the standard deviation.

[/av_icon_box]

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/parameter-tuning.png' attachment='27979' attachment_size='full' align='center' styling='' hover='' link='lightbox' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_one_full][av_hr class='short' height='50' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_two_third first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_icon_box position='left' boxed='' icon='ue82b' font='entypo-fontello' title='Deployment' link='' linktarget='' linkelement='' font_color='' custom_title='' custom_content='' color='' custom_bg='' custom_font='' custom_border='']
The algorithm parameters that yeilded the best results will be the one that is shipped. The best algorithm (the last one) will be retrained using 100% of the data since cross validation leaves 10% out each time for validation.
[/av_icon_box]

[/av_two_third][av_one_third min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/09/retrain.png' attachment='27980' attachment_size='full' align='center' styling='' hover='' link='lightbox' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_one_third]
[/av_section]

[av_section min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' id='' color='main_color' custom_bg='' src='' attachment='' attachment_size='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9' overlay_opacity='0.5' overlay_color='' overlay_pattern='' overlay_custom_pattern='']
[av_one_full first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_icon_box position='left' boxed='' icon='ue8c9' font='entypo-fontello' title='Further Improve this Model' link='' linktarget='' linkelement='' font_color='' custom_title='' custom_content='' color='' custom_bg='' custom_font='' custom_border='']
Feature engineering was entirely left out of this experiment. Try engineering more features from the existing dataset to see if the model will improve. Some columns that were originally dropped may become useful when combined with other features. For example try bucketing the years in which the house was built by decade. Clustering the data may also yeild some hidden insights
[/av_icon_box]

[/av_one_full]
[/av_section]

[av_section color='main_color' custom_bg='#ffffff' src='' attachment='' attachment_size='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9' min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' id='']
[av_one_third first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_heading tag='h3' padding='10' heading='Data Science and Data Engineering Bootcamp' color='' style='blockquote modern-quote modern-centered' custom_font='#ffffff' size='' subheading_active='' subheading_size='15' custom_class=''][/av_heading]

[av_textblock size='' font_color='' color='']

Like what you're learning in tutorials but want more? Register for our 5-Day Data Science and Data Engineering Bootcamp! Learn everything you need to know and participate in a Hack Day.

[/av_textblock]

[av_button label='Learn More' link='page,19727' link_target='' size='medium' position='center' icon_select='no' icon='ue8b9' font='entypo-fontello' color='theme-color' custom_bg='#444444' custom_font='#ffffff']

[/av_one_third][av_one_third min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_heading tag='h3' padding='10' heading='Working Demo: Titanic Survival Predictor Web App' color='' style='blockquote modern-quote modern-centered' custom_font='#ffffff' size='' subheading_active='' subheading_size='15' custom_class=''][/av_heading]

[av_textblock size='' font_color='' color='']

Deployed a model? What's next? Build a web app! We have a working web app demo setup here where we wrapped our Titanic predictive model around a front end UI.

[/av_textblock]

[av_button label='Titanic Demo' link='manually,http://demos.datasciencedojo.com/demo/titanic/' link_target='' size='medium' position='center' icon_select='no' icon='ue8b9' font='entypo-fontello' color='theme-color' custom_bg='#444444' custom_font='#ffffff']

[/av_one_third][av_one_third min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_heading tag='h3' padding='10' heading='Intro to Text Processing' color='' style='blockquote modern-quote modern-centered' custom_font='#ffffff' size='' subheading_active='' subheading_size='15' custom_class=''][/av_heading]

[av_textblock size='' font_color='' color='']

Interested in other tutorials from Data Science Dojo? Check out this tutorial mini series where we teach you how to deal with data that is textual and unstructured.

[/av_textblock]

[av_button label='View Tutorial' link='post,25634' link_target='' size='medium' position='center' icon_select='no' icon='ue8b9' font='entypo-fontello' color='theme-color' custom_bg='#444444' custom_font='#ffffff']

[/av_one_third]
[/av_section]

[av_section min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' id='' color='main_color' custom_bg='' src='' attachment='' attachment_size='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9' overlay_opacity='0.5' overlay_color='' overlay_pattern='' overlay_custom_pattern='']
[av_one_half first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_image src='http://datasciencedojo.com/wp-content/uploads/2015/02/12-WEEKS.jpg' attachment='25362' attachment_size='full' align='center' styling='' hover='' link='page,19727' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]

[/av_one_half][av_one_half min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_hr class='invisible' height='100' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_textblock size='' font_color='' color='']

Want to take your data science skills to the next level? Check out our 5-day Data Science & Data Engineering bootcamp.

[/av_textblock]

[av_hr class='invisible' height='25' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_button label='Learn More' link='page,19727' link_target='' size='large' position='center' icon_select='no' icon='ue8b9' font='entypo-fontello' color='theme-color' custom_bg='#444444' custom_font='#ffffff']

[/av_one_half]
[/av_section]

[av_section min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' id='' color='main_color' custom_bg='' src='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9' video_mobile_disabled='']
[av_one_full first]

[av_heading tag='h2' padding='20' heading='Discussion' color='' style='blockquote modern-quote modern-centered' custom_font='' size='' subheading_active='' subheading_size='12' custom_class=''][/av_heading]

[av_comments_list]

[/av_one_full]
[/av_section]

Topics: Business, Data Science & Engineering, Predictive Modeling