Thursday 26 December 2013

Tools for Data Scraping and Visualization

Over the last few weeks I co-taught a short-course on data scraping and data presentation for.  It was a pleasure to get a chance to teach with Ethan Zuckerman (my boss) and interact with the creative group of students! You can peruse the syllabus outline if you like.

In my Data Therapy work I don’t usually introduce tools, because there are loads of YouTube tutorials and written tutorials.  However, while co-teaching a short-course for incoming students in the Comparative Media Studies program here at MIT, I led two short “lab” sessions on tools for data scraping, interrogation, and visualization.

There are a myriad of tools that support these efforts, so I was forced to pick just a handle to introduce to these students.  I wanted to share the short lists of tools I choose to share.

Data Scraping:

As much as possible, avoid writing code!  Many of these tools can help you avoid writing software to do the scraping.  There are constantly new tools being built, but I recommend these:

   1.Copy/Paste: Never forget the awesome power of copy/paste! There are many times when an hour of copying and pasting will be faster than learning any sort of new tool!
 
   2.Import.io: Still nascent, but this is a radical re-thinking of how you scrape.  Point and click to train their scraper.  It’s very early, and buggy, but on many simple webpages it works well!
 
   3.Regular Expressions: Install a text editor like Sublime Text and you get the power of regular expressions (which I call “Super Find and Replace”).  It lets you define a pattern and find it in any large document.  Sure the pattern definition is cryptic, but learning it is totally worth it (here’s an online playground).
 
   4.Jquery in the browser: Install the bookmarklet, and you can add the JQuery javascript library to any webpage you are viewing.  From there you can use a basic understanding of javascript and the Javascript console (in most browsers) to pull parts of a webpage into an array.
 
   5.ScraperWiki: There are a few things this makes really easy – getting recent tweets, getting twitter followers, and a few others.  Otherwise this is a good engine for software coding.
 
   6.Software Development: If you are a coder, and the website you need to scrape has javascript and logins and such, then you might need to go this route (ugh).  If so, here’s a functioning example of a scraper built in Python (with Beautiful Soup and Mechanize).  I would use Watir if you want to do this in Ruby.

Data Interrogation and Visualization:

There are even more tools that help you here.  I picked a handful of single-purpose tools, and some generic ones to share.

   1.Tabula: There are  few PDF-cleaning tools, but this one has worked particularly well for me.  If your data is in a PDF, and selectable, then I recommend this! (disclosure: the Knight Foundation funds much of my paycheck, and contributed to Tabula’s development as well)

   2.OpenRefine: This data cleaning tool lets you do things like cluster rows in your data that are spelled similarly, look for correlations at a high level, and more!  The School of Data has written well about this – read their OpenRefine handbook.

   3.Wordle: As maligned as word clouds have been, I still believe in their role as a proxy for deep text analysis.  They give a nice visual representation of how frequently words appear in quotes, writing, etc.

   4.Quartz ChartBuilder: If you need to make clean and simple charts, this is the tool for you. Much nicer than the output of Excel.

   5.TimelineJS: Need an online timeline?  This is an awesome tool. Disclosure: another Knight-funded project.

   6.Google Fusion Tables: This tool has empowered loads of folks to create maps online.  I’m not a big user, but lots of folks recommend it to me.

   7.TileMill: Google maps isn’t the only way to make a map.  TileMill lets you create beautiful interactive maps that fit your needs. Disclosure: another Knight-funded project.

   8.Tableau Public: Tableau is a much nicer way to explore your data than Excel pivot tables.  You can drag and drop columns onto a grid and it suggests visualizations that might be revealing in your attempts to find stories.

I hope those are helpful in your data scraping and story-finding adventures!

Source:http://datatherapy.wordpress.com/2013/10/24/tools-for-data-scraping-and-presentation/

No comments:

Post a Comment