Breathing Structure to the Unstructured

Written by Phuc Duong | Mar 23, 2016 12:28:30 AM

[av_four_fifth first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']
[av_heading tag='h2' padding='20' heading='Breathing Structure to the Unstructured' color='' style='blockquote modern-quote' custom_font='' size='' subheading_active='subheading_below' subheading_size='15' custom_class='']
Text Analytics for Machine Learning: Part 1
[/av_heading]
[/av_four_fifth]

[av_one_fifth min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']
[av_codeblock wrapper_element='' wrapper_element_attributes='']


[/av_codeblock]
[/av_one_fifth]

[av_two_third first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']
[av_textblock size='' font_color='' color='']
Have you ever wondered how Siri can understand English? How you can type a question into Google and get what you want?

Over the next week, we will release a five part blog series on text analytics that will give you a glimpse into the complexities and importance of text analytics and natural language processing.

This first section discusses how text is converted to numerical data.
[/av_textblock]
[/av_two_third]

[av_one_third min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']
[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/03/definitions-300x203.jpg' attachment='25646' attachment_size='medium' align='center' styling='' hover='' link='' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]
[/av_one_third]

[av_hr class='short' height='50' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_heading tag='h3' padding='10' heading='Make Words Usable for Machine Learning' color='' style='blockquote modern-quote' custom_font='' size='' subheading_active='' subheading_size='15' custom_class=''][/av_heading]

[av_textblock size='' font_color='' color='']
In the past, we have talked about how to build machine learning models on structured data-sets. However, life does not always give us data that is clean and structured. Much of the information generated by humans has little or no formal structure: emails, tweets, blogs, reviews, status updates, surveys, legal documents, and so much more. There is a wealth of knowledge stored in these kinds of documents which data scientists and analysts want access to. "Text analytics" is the process by which you extract the useful information out of text.

Some examples include:

All of these written texts are unstructured; machine learning algorithms and techniques work best (or often, work only) on structured data. So, in order for our machine learning models to operate on these documents, we must convert the unstructured text into a structured matrix. Usually this is done by transforming each document into a sparse matrix (a really big but mostly empty table). Each word gets its own column in the data-set, which tracks whether or not a word appears (binary) in the text OR how often the word appears (term-frequency). For example, consider the two statements below. They have been transformed into a simple term frequency matrix. Each word gets a distinct column and the frequency of occurrence is tracked. If this were a binary matrix, there would only be ones and zeros instead of a count of the terms.
[/av_textblock]

[av_table purpose='tabular' pricing_table_design='avia_pricing_default' pricing_hidden_cells='' caption='' responsive_styling='avia_responsive_table']
[av_row row_style='avia-heading-row'][av_cell col_style='']Document[/av_cell][av_cell col_style='']twinkle[/av_cell][av_cell col_style='']little[/av_cell][av_cell col_style='']star[/av_cell][av_cell col_style='']all[/av_cell][av_cell col_style='']the[/av_cell][av_cell col_style='']night[/av_cell][/av_row]
[av_row row_style=''][av_cell col_style='']Twinkle, twinkle, little star.[/av_cell][av_cell col_style='']2[/av_cell][av_cell col_style='']1[/av_cell][av_cell col_style='']1[/av_cell][av_cell col_style=''][/av_cell][av_cell col_style=''][/av_cell][av_cell col_style=''][/av_cell][/av_row]
[av_row row_style=''][av_cell col_style='']Twinkle, twinkle, all the night.[/av_cell][av_cell col_style='']2[/av_cell][av_cell col_style=''][/av_cell][av_cell col_style=''][/av_cell][av_cell col_style='']1[/av_cell][av_cell col_style='']1[/av_cell][av_cell col_style='']1[/av_cell][/av_row]
[/av_table]

[av_one_fourth first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']
[av_image src='http://datasciencedojo.com/wp-content/uploads/2016/03/Cluster-2-300x201.png' attachment='25658' attachment_size='medium' align='center' styling='' hover='' link='' target='' caption='' font_size='' appearance='' overlay_opacity='0.4' overlay_color='#000000' overlay_text_color='#ffffff' animation='no-animation'][/av_image]
[/av_one_fourth]

[av_three_fourth min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']
[av_textblock size='' font_color='' color='']
Why do we want numbers instead of text? Most machine learning algorithms and data analysis techniques assume numerical data (or data that can be ranked or categorized). Similarity between documents is calculated by determining the distance between the frequency of words. For example, if the word "team" appears 4 times in one document and 5 times in a second document, they will be calculated as more similar than a third document where the word "team" only appears once.
[/av_textblock]
[/av_three_fourth]

[av_hr class='short' height='50' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_heading tag='h3' padding='10' heading='Build a Matrix' color='' style='blockquote modern-quote' custom_font='' size='' subheading_active='' subheading_size='15' custom_class=''][/av_heading]

[av_textblock size='' font_color='' color='']
While our example was simple (6 words), term frequency matrices on larger datasets can be tricky.

Imagine turning every word in the Oxford English dictionary into a matrix, that’s 171,476 columns. Now imagine adding everyone’s names, every corporation or product or street name that ever existed. Now feed it slang. Feed it every rap song. Feed it fantasy novels like Lord of the Rings or Harry Potter so that our model will know what to do when it encounters “The Shire” or “Hogwarts”. Good, now that’s just English. Do the same thing again for Russian, Mandarin, and every other language.

After this is accomplished, we are approaching a several billion column matrix; two problems arise. First, it becomes computationally unfeasible and memory intensive to perform calculations over this matrix. Secondly, the curse of dimensionality kicks in and distance measurements become so absurdly large in scale that they all seem the same. Most of the research and time that goes into natural language processing is less about the syntax of language (which is important) but more about how to reduce the size of this matrix.
[/av_textblock]

[av_textblock size='' font_color='' color='']
Now that we know what we must do and the challenges that we must face in order to reach our desired result. The next three blogs in the series will be directly to address these problems. We will introduce you to 3 concepts: conforming, stemming, and stop word removal.
[/av_textblock]

[av_hr class='short' height='50' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_textblock size='18' font_color='' color='']

Want to learn more about text analytics? Check out the short video on our curriculum page OR watch our video on tweet sentiment analysis.

Part two is available here.

[/av_textblock]

[av_one_half first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']
[av_social_share title='Share this entry' style='' buttons='' share_facebook='' share_twitter='' share_pinterest='' share_gplus='' share_reddit='' share_linkedin='' share_tumblr='' share_vk='' share_mail=''][/av_social_share]
[/av_one_half]

[av_hr class='invisible' height='100' shadow='no-shadow' position='center' custom_border='av-border-thin' custom_width='50px' custom_border_color='' custom_margin_top='30px' custom_margin_bottom='30px' icon_select='yes' custom_icon_color='' icon='ue808' font='entypo-fontello']

[av_comments_list]