Module 2 What’s data science?

2.1 Before class #1

Required external reading for this module: What’s data science? (4,660 words, approx. 20 minutes of reading time)

Watch the YouTube video Angry Hiring Manager Panel from 10:18 to 16:48 (6.5 minutes) and list the skills they mention as important to have in a data science position.

Fill out Survey 1 (10 min)

2.2 What’s data science?

Data science is one of the fields with the highest demand, with prospects of increased demand for the next decade (Kross et al. 2020; Hadavand, Gooding, and Leek 2018). Interestingly, the data scientist title was invented in 2008, and the median base salary for a data scientist surpassed $100,000 in the United States in 2019 (Robinson and Nolis 2020).

CHALLENGE

Based on your own experience and on your reading for this module, in your groups discuss the following question:

  • What is data science?

2.3 What does a data scientist do?

Data science is an interdisciplinary field, and as such data scientists hold jobs with a broad range of skills, from statistics to communication. A quick search for data science jobs reveals this long list of skills. However, no single data scientist has all skills listed for different data science jobs. Instead, each data scientist specializes in different skills (Robinson and Nolis 2020).

CHALLENGE

Make a list of skills listed on data science job announcements and in the video you just watched. Based on these, discuss the following questions in your group:

  1. Which skills do you already have? At what level of proficiency?

  2. Which skills are you interested in developing further?

  3. Based on the skills you already have, and the skills you want to acquire, what type of job in data science would you be interested in?

2.4 Data Science Workflow

The basic data science workflow involve three main parts:

  1. The Question: Form the question you want to answer. Many times you will be given a question, and you have to “translate” it so you can answer it with your data analysis.

  2. Data Acquisition: data file, database, or web API

  3. Data Wrangling: import + tidy data + transform (Grolemund and Wickham 2018)

  4. Data Exploration: transform + visualize + model + repeat (Grolemund and Wickham 2018)

  5. Results Communication: visualize + write + knit (Grolemund and Wickham 2018)

typical data science project model (Grolemund and Wickham, 2018)

Figure 2.1: typical data science project model (Grolemund and Wickham, 2018)

CHALLENGE

In your groups, based on your own intuition and experience, and based on the Introduction to R for Data Science book (Grolemund and Wickham 2018), summarize what each of the following steps means:

  1. Tidy

  2. Transform

  3. Visualize

  4. Model

  5. Communicate

We will approach the entire data science workflow in this course (but not necessarily every step listed), not in this order. We start with step 3 (Data Wrangling) and 4 (Data Exploration), before we address step 2 (Data Acquisition) and step 5 (Communication)

CHALLENGE

Go back to the list of skills and job positions we discussed (based on the reading and the video):

  1. Which steps in the data science workflow correspond to the job skills we talked about?

2.5 Before class #2

Please fill out Survey 1 (10 min).

Reading: Data Science examples (1,0333 words, 8 min)

Reading: Data Intake (1,686 words, 12 min)

2.6 What’s data?

CHALLENGE

In your small group, discuss the examples provided in the excerpt from “Executive Data Science” (Caffo, Peng, and Leek 2016).

  1. Is data science about “data”? Why or why not?

  2. Why did Netflix end up not implementing the best solution from the Netflix prize challenge?

  3. What data was used in each of the examples provided in the reading?

  4. What is data? (come up with a definition).

 

Examples of what data might look like.

  • Structured data (rare):
State School Year Average Tuition
Nevada 2004-05 3621.392
Nevada 2005-06 3687.290
Florida 2004-05 3848.201
Florida 2007-08 3879.416
Florida 2006-07 3887.656
Florida 2005-06 3924.234
Wyoming 2008-09 3928.671
Wyoming 2007-08 4071.898
Wyoming 2004-05 4086.351
Wyoming 2006-07 4122.205

CHALLENGE

Which of the columns (or variables) in the data frame above are categorical, which are quantitative?

 

  • Structured, but messy data (more common):
State 2004-05 2005-06 2006-07 2007-08 2008-09 2009-10 2010-11 2011-12 2012-13 2013-14 2014-15 2015-16
Alabama 5682.838 5840.550 5753.496 6008.169 6475.092 7188.954 8071.134 8451.902 9098.069 9358.929 9496.084 9751.101
Alaska 4328.281 4632.623 4918.501 5069.822 5075.482 5454.607 5759.153 5762.421 6026.143 6012.445 6148.808 6571.340
Arizona 5138.495 5415.516 5481.419 5681.638 6058.464 7263.204 8839.605 9966.716 10133.503 10296.200 10413.844 10646.278
Arkansas 5772.302 6082.379 6231.977 6414.900 6416.503 6627.092 6900.912 7028.991 7286.580 7408.495 7606.410 7867.297
California 5285.921 5527.881 5334.826 5672.472 5897.888 7258.771 8193.739 9436.426 9360.574 9274.193 9186.824 9269.844
Colorado 4703.777 5406.967 5596.348 6227.002 6284.137 6948.473 7748.201 8315.632 8792.856 9292.954 9298.599 9748.188
Connecticut 7983.695 8249.074 8367.549 8677.702 8720.976 9371.019 9827.013 9736.431 10036.627 10453.110 10663.995 11397.337
Delaware 8352.890 8610.597 8681.846 8945.801 8995.473 9987.183 10534.181 11026.241 11362.690 11502.524 11514.660 11676.216
Florida 3848.201 3924.234 3887.656 3879.416 4150.004 4783.032 5510.659 5940.945 6494.901 6451.664 6345.000 6360.159
Georgia 4298.040 4492.167 4584.268 4790.266 4831.365 5549.913 6428.007 7709.284 7853.257 7992.390 8063.014 8446.961

user_id screen_name text reply_to_screen_name
6.331283e+07 blagogirl (???) The illiterate calling Iran out? 80 million bounty on Trumps head? realTuckFrumper
6.331283e+07 blagogirl (???) Iran does NOT fear Trump. They realize what OUR country is dealing with. “The White House is inflicted with mental retardation” JonHutson
1.125104e+18 dl_kirkwood I’m afraid 11 soldiers had to be shipped out from the Iran hit after all with traumatic brain injuries. Seems the Military does not notify homeland unless a soldiers is shipped out for the injury. So, Trump did not know for a week. https://t.co/HdBNbKClBl NA
2.820552e+07 djbarro (???) Are you going to carry a sign supporting the women in Iran brave enough to remove their hijabs and go to prison? GloriaAllred
1.506314e+08 kizu91 US…Special…Representative…Hold…Press…Briefing…Situation in…Iran…Video…first…week…January…saw…drastic…spike…tensions…Washington…Tehran…President…Donald Trump…order…assassination…elite…Quds…Force…commander…Qasem…Soleimani…Iraq NA
1.506314e+08 kizu91 crash…land…collide…plane…aircraft…all…176…people…on board…Iran…missile…attack…US…base…Iraq…rocket…Western…Sahara…Suriname…Colombia…Dominica…Australia…Anguilla…Guadeloupe…Uruguay…Cyprus…Namibia…Brazil…Paraguay…Denmark…55 NA
1.506314e+08 kizu91 Iran…MP…Urge…Gov’t…Expel…UK…Envoy…Consider…Downgrading…Diplomatic…Ties…Alleged…Meddling…envoy…Robert Macaire…detained…days…ago…alleged…participation…unsanctioned…protest…Tehran…down…Ukraine…Boeing…737…release…15…minutes NA
1.506314e+08 kizu91 Government…Supporter…Gather…Tehran…13….Friday…Prayer…Video…Iran…gather…rally…commemorate…kill…fatal…crash…land…collide…Ukraine…Boeing…plane…aircraft…shot…down…missile…rocket…January…Imam…Khomeini…International…Airport…16 NA
1.506314e+08 kizu91 British…Treasury…Expand…Hezbollah…Asset…Freeze…UK…government…approved…measure…follow…heat…conflict…United States…Islamic…Republic…Iran…Trump…Administration…target…assassination…high-profile…military…general…early…January…film NA
7.297365e+17 SwmpladySH Hackers Are Coming for the 2020 Election — And We’re Not Ready https://t.co/q82kNu9gMd via (???) NA
  • Textual Data (always messy):
##  [1] "CHAPTER I"                                                               
##  [2] ""                                                                        
##  [3] ""                                                                        
##  [4] "Emma Woodhouse, handsome, clever, and rich, with a comfortable home"     
##  [5] "and happy disposition, seemed to unite some of the best blessings of"    
##  [6] "existence; and had lived nearly twenty-one years in the world with very" 
##  [7] "little to distress or vex her."                                          
##  [8] ""                                                                        
##  [9] "She was the youngest of the two daughters of a most affectionate,"       
## [10] "indulgent father; and had, in consequence of her sister's marriage, been"

** CHALLENGE **

What data formats are out there in the world. Create a list based on your experience and the excerpt from “Modern Data Science with R” (Baumer, Kaplan, and Horton 2017).

2.7 What does data analysis look like?

The way you communicate your data analysis will depend on what question you’re trying to answer and who your audience is. Here are some of my favorite data analysis reports:

References

Baumer, Benjamin S, Daniel T Kaplan, and Nicholas J Horton. 2017. Modern Data Science with R. CRC Press.

Caffo, Brian, Roger D Peng, and Robert H Leek. 2016. Executive Data Science: A Guide to Training and Managing the Best Data Scientists. Leanpub.

Grolemund, Garrett, and Hadley Wickham. 2018. R for Data Science. O’Reilly.

Hadavand, Aboozar, Ira Gooding, and Jeffrey T Leek. 2018. “Can Mooc Programs Improve Student Employment Prospects?” Available at SSRN 3260695.

Kross, Sean, Roger D Peng, Brian S Caffo, Ira Gooding, and Jeffrey T Leek. 2020. “The Democratization of Data Science Education.” The American Statistician 74 (1): 1–7.

Robinson, Emily, and Jacqueline Nolis. 2020. Build a Career in Data Science. Manning.