A simple introduction to Data Science
When you hear the term data scientist, what do you think of? If you’re like most people, you might think of something incredibly complex, with statistical terms and programming languages that are beyond comprehension. You might think that only Ph.D.’s in computer science can Become a Data Scientist.
But if you peel back the layers, you’ll find that this isn’t the case. Data science, coined by DJ Patil who is now the Chief Data Scientist at the White House, is just a 21st century spin on mathematics that people have been doing for centuries. Big data, data science, and analytics are just fancy terms for using information available to gain insight and improve business. Whether it’s a small Excel spreadsheet or a 100 million records in a database, the goal is always the same: to find value.
You too can start down the path of data science, and learn a lot along the way. Let’s demonstrate with a simple example.
Step 1: Have a question or something you’re curious about.
In this spring’s NBA Playoffs, Steph Curry and the Golden State Warriors were down 2 games to 1 against the Memphis Grizzlies, and Curry’s 3-point shooting was down in the previous two games which the Warriors lost. Commentators were speculating; have the Grizzlies figured out Steph Curry? Can he bounce back and guide the Warriors to victory?
Step 2: Gather data that exists for your area of interest.
We can use easily available data from basketball-reference.com in this situation. I simply took Steph Curry’s game log for the 2014-15 regular season and created a .CSV file (uploaded here if you want to download it). Here’s what the data looked like:
Step 3: Analyze your data, using whichever software and method you prefer.
Data science can range from making simple bar graphs in Excel to running multi-variable logistic regression in Hadoop. In this case, I’ll do some straightforward analysis on the data in R, which is free to download here.
For this analysis, I looked at his three-point percentage for all 82 regular season games and identified which games he shot 20 percent of less. I used that to average his three-point percentage for all games following those low shooting games.
Here’s the script I used to import the data, identify the relevant games, and do the calculation. This may seem difficult, but using resources to learn each concept such as importing data, loops, arrays, etc, you could do this within a few days.
Step 4: Look at your analysis, interpret, and apply what you learned.
Based on the analysis, in games following his low-shooting games, Curry shot an average of 42.4 percent on three-pointers. Based on this, you proclaim that Curry will regress to his mean and return to his All-Star form. That means history suggests he would bounce back and find his usually-superb shooting stroke, with each additional three pointer boosting the Warriors scoring and leading them to victory.
We all know what happened next.
To conclude, even if R or this type of analysis isn’t your cup of tea, I encourage you to gather some data and see what you learn. You can even start very simple in Excel then build your way up to more complex tools. Go forth, and what you find may surprise you.