Data Science

A kid’s guide to data science – clustering

Bedtimes are the Worst

Having a bedtime is tough when you’re a kid. I get it. When the YouTube video you’re watching gets cut short because you have to get ready for bed, it’s the worst.

But like most kids, occasionally you get to stay up past your bedtime. Have you ever wondered why adults let kids stay up past their bedtimes? Think about it. Every so often, you get to stay up late.

Why? What’s different about those days? Maybe you should keep track of your bedtimes to see if you can discover the magic formula to staying up late!

Tracking Your Bedtimes

Tracking your bedtimes is easy. Let’s make a list of things that might help you solve this puzzle.

  • The date. It’s good to know which days you go to bed on time versus days you stay up late.
  • Your age. As you get older, you’re allowed to stay up later. That’s a good thing!
  • Is it a school night? You almost never stay up late on a school night.
  • Are you home? 
  • Is it a sleepover? Sleepovers are the best!
  • Are you sick?
  • Your bedtime. (Boo, Hiss!)
  • The actual time you went to bed that day. 

Look at this in action.

  • January 14, 2016 (The date)
  • 12 (Your age)
  • Yes (Is it a school night?)
  • Yes (Are you home?)
  • No (Is it a sleepover?)
  • No (Are you sick?)
  • 8:30pm (Your bedtime.)
  • 8:30pm (Actual time you went to bed.)

Great! You’ve got your first bedtime tracked! You’re well on your way.

Super Bedtime Tracking

Now let’s change the way we’re tracking your bedtimes to make it easier to track lots of days.

Let’s take a look at what 3 days looks like.

Screen Shot 2016-01-14 at 10.08.09 PM


Now imagine tracking bedtimes for a whole year—all 365 days. That’s a lot of data!

Clustering the Data

If you had a year’s worth of bedtimes tracked, you could begin to look for patterns. Are certain days better than others for staying up late? Clustering, aka grouping, the data allows you to observe meaningful patterns. Do you see any interesting patterns in the table below? It looks different from the table above since we consolidated the data into 4 groups.

Screen Shot 2016-01-25 at 5.16.46 PM

Group 4 is a winner! Nothing exciting here though—you already know that you get to stay up late during weekend sleepovers.

Screen Shot 2016-01-25 at 5.16.14 PM

Groups 1 and 3 are pretty boring. No staying up late on school nights. Boo!

Group 2 is interesting. Friday is not a school night, but being sick means going to bed early.

What did we learn?

If you want to stay up late more often…

  1. Have more sleepovers!
  2. Don’t get sick! The easiest way to avoid colds is to wash your hands!
Big Data

How Big is Your Data?

Big Data architecture can add layers of complexity to an IT environment.  Not sure if you’re dealing with Big Data?  Use Gartner’s 3Vs as a litmus test: Volume, Velocity, and Variety.

The 3Vs become a reasonable test on whether you should add Big Data to your architecture.

  1. Volume. The amount of data.  How much data are you processing? Some smartwatch manufacturers store every user interaction. For these companies, this might be hundreds of terabytes or petabytes of data.
  2. Velocity. The speed of data. How fast is your data moving in and out of the system? Some data is created and captured in realtime at the point of the transaction.
  3. Variety. The assortment of data. Do you have different types of data? Data can consist of text, images, audio, video and the supporting metadata.

Having only 2 of the 3 Vs may mean that you can avoid adding Big Data to your architecture and skip the added complexity.

Tag Clouds

Twitter Tag Clouds – Visualizing Popular Hashtags

tag cloud is a visual representation of text data and is typically made up of single word tags. The frequency of each tag is usually represented by size or color.

I created the following tag clouds using Twitter’s API and two KNIME workflows. Twitter’s API returned 1379 tweets by searching the #browns hashtag. The Browns are currently in the news for firing both their head coach and GM, so I thought the hashtag would make a good candidate for tag clouds.


First Tag Cloud

Tag cloud number 1 is based on common keyword tags found in all 1379 tweets. I stripped usernames and URLs from the tweets before processing them. I then used KNIME’s POS Tagger node to assign parts of speech to each term. The resulting tag cloud highlights nouns in brown, verbs in orange, and adjectives in black. Larger words appear more often in the tweets that were analyzed.


Second Tag Cloud

Tag cloud number 2 is based on the same tweets and keyword tags. For this tag cloud, I used KNIME’s Named Entity Tagger node to tag terms as either organizations, locations, or people. The resulting tag cloud highlights people in brown, organizations in orange, and locations in black. Terms in green could not be identified by the tagger. As with the cloud above, the larger the font, the higher the tag frequency.


Interested in creating your own tag clouds? I’ll have instructions posted soon. Until then, feel free to leave a comment with your Twitter username and the hashtag you’d like analyzed. I’ll tag you with the results.

Big Data, Data Science

Using Prescriptive Analytics to Make Better Decisions

Business Analytics is broken down into three distinct phases.

  1. Descriptive – What happened? This phase involves traditional BI tools to help organizations process and report on historical data. Trends are analyzed and decisions are made. The majority of management reporting uses this approach.
  2. Predictive – What will happen? This phase uses machine learning algorithms to build models from historical data and then uses those same models to predict a future outcome or its likelihood.
  3. Prescriptive – What action should be taken? This phase prescribes actions to achieve the best possible outcome based on the predictions made. Actions that lead to the highest chance of success are prescribed.

Prescriptive Analytics predicts and compares the likely outcomes of any number of actions, and then chooses the very best action to help advance an organization’s objectives.

Consider implications for the healthcare industry. Healthcare predictions are most useful when that knowledge prescribes clinical action for each predicted outcome.

Similar insights can help organizations improve decision making and have more control of business outcomes. Prescriptive analytics is an important next step on the path to insight-based actions and recommendations.

Data Science

6 steps to data mining awesomeness

Have a data mining project on the horizon?  These 6 steps make up the Cross Industry Standard Process for Data Mining (CRISP-DM) and will help make it awesome!  datamining

  1. Gain an understanding of the business problem you are trying to solve. Are the business requirements well defined?
  2. Get to know the data. What data is available? Is it complete? What data is needed?  Now is also a good time to identify any data quality problems. 
  3. Prepare the data. Data is rarely clean or in the right format for your modeling tools. This step can be time consuming.   
  4. Create your model(s).  – Pick your modeling tool and build your model – Linear Regression, Classification, Clustering. Several techniques can be used to solve the same data mining problem. Now might also be a good time to revisit Step 3 if the data isn’t quite right. 
  5. Evaluate your results.  Are the results meaningful? Do they solve the problem you identified in Step 1?  Ultimately, a decision on the use of the results should be made.
  6. Deploy your model!  How should the model be deployed? What steps should be taken to maximize the benefit of the model and results?

That’s it!

Do you use a different process? I’d love to hear about it. Please leave a comment.