One Man, One World (ஒரு மனிதன், ஒரு உலகம் )

Coimbatore Weather and Questioning Amma!

leave a comment »

A week ago, Amma was telling the weather was getting hot in Coimbatore. I was telling her it is going to get worse in the next two months. She shot back saying that March is the hottest month while April and May are less hotter in Coimbatore. Growing up in India you are thought that your mother knows the best and she is right (almost) always. Well I could not resist the thought of putting the thesis to test and the internet comes to my help. So here goes the perl and R code to  get the temperature data and explore it. The metric of choice would (arbitrarily) the average temperature. In R plots in this post the average is plotted by a big black dot. The month with the highest average temperature will be adjudged the hottest month. There are a lot of metrics to use but this is the simplest and the most intuitive.

(1)  perl code to scrap temperature data from web. The wunderground website has data in csv format. I did not do checks like limiting days in June to 30. We will fix that in clean up script which will build a single csv file with data from years 2005 to 2008. You need the CPAN package LWP.You should be able to google and figure out how to install LWP package.

Save this file as [dir]/src/get_temperature_files.pl
To run
% cd [dir]/src
% perl get_temperature_files.pl
You will have the data in the directory [dir]/data
The data is for four years from 2005 to 2008.

#------------------------------------------------------------------
use warnings;
use strict;
#------------------------------------------------------------------
use LWP::UserAgent;
#------------------------------------------------------------------
@ARGV == 0 or die "Sorry. The correct usage is:\
       perl get_temperature_files.pl\n";
#------------------------------------------------------------------

my $datadir = "../data/";
mkdir($datadir, 0755) unless -d $datadir;
# VOCB - Coimbatore
my $base_url =  "http://www.wunderground.com/history/airport/VOCB/";
my $suffix = "/DailyHistory.html?format=1";

for (my $year = 2005; $year < 2009; ++$year) {
  for (my $month = 1; $month <= 12; ++$month) {
    for (my $day = 1; $day <= 31; ++$day) {
      my $webfile = $year."/".$month."/".$day;
      print "Getting: $webfile\n";

      my $url = $base_url.$webfile.$suffix;
      my $webPage = getWebPage($url);

      my $outfile = $year."_".$month."_".$day.".csv";
      $outfile = $datadir.$outfile;
      open(OUTFILE, ">$outfile");
      print OUTFILE "$webPage";
      close(OUTFILE);

# let us be patient and decent
      sleep(1);
    }
  }
}

#------------------------------------------------------------------
# subroutines
#------------------------------------------------------------------
sub getWebPage {
  my ($url) = @_;
  my $browser = LWP::UserAgent->new();
  my $response = $browser->get($url);

# error checks
  die "Weird content type at $url -- ", $response->content_type()
    unless $response->is_success();

  my $webPage = $response->content();
  return($webPage);
}
#------------------------------------------------------------------

(2) Clean up the data and construct the data as a single file.
Save this file as [dir]/src/build_csv.pl
To run
% cd [dir]/src
% perl build_csv.pl ../data/ > cbe.csv

#------------------------------------------------------------------
use warnings;
use strict;
#------------------------------------------------------------------
@ARGV == 1 or die "Sorry. The correct usage is:\
       perl build_csv.pl dir_containing_csv_files\
       Example:\
       perl build_csv.pl ../data/\n";
#------------------------------------------------------------------
my $datadir = $ARGV[0];
# make sure exactly one / is present after $datadir
$datadir =~ s/[\/]+$//;
$datadir .= "/";

# days in a month
my @days_in_month  = (31,28,31,30,31,30,31,31,30,31,30,31);
my @month_names = ("Jan", "Feb", "Mar", "Apr", "May", "Jun",
    "Jul", "Aug", "Sep", "Oct", "Nov", "Dec");

opendir(DIR, $datadir);
my @files = grep { /\.csv$/ } readdir(DIR);
closedir(DIR);

my $header_flag = 0;
foreach (@files) {
  my $file = $_;

# get the time information
# which we need to add to the csv file
  my ($time, $suffix) = split(/\./, $file);
  my ($year, $month, $day) = split(/\_/, $time);

# handle leap year
  if (0 == $year % 4) {
    $days_in_month[1] = 29;
  } else {
    $days_in_month[1] = 28;
  }

  if ($day <= $days_in_month[$month-1]) {
# read the raw csv file and clean it up
    my $csvfile = $datadir.$file;
    open(TIMEFILE, "<$csvfile");

    while () {
# remove everything between and including < >
      s/\<.*\>//;
# skip blank lines
      next if /^(\s)*$/;

# skip the header after the printing it
# for the first time
      if(!$header_flag) {
        print "year, month, day, $_";
        $header_flag = 1;
        next;
      } else {
        next if /^[a-z]+.*$/i;
      }

      chomp;
      print "$year, $month_names[$month-1], $day, $_\n";
    }

    close(TIMEFILE);
  }
}
#------------------------------------------------------------------

(3) Explore the data using R

library(lattice)
# read the raw data
filename <- "cbe.csv";
x <- read.csv(file = filename, header = TRUE, as.is = TRUE);

# factor hack to get the plots in xyplot() in correct order
x$month = factor(x$month, levels=x$month)
x$year = factor(x$year, levels=x$year)
x$TimeIST = factor(x$TimeIST, levels=x$TimeIST)
x$TemperatureC = (x$TemperatureF - 32)*(5.0/9.0);

Now for answering the question at the top of the post.

hyear <- bwplot(TemperatureC ~ month | year, data=x, ylab = "Temperature (C)");
plot(hyear);

temp-month

Looks as if Amma thesis is probably rejected! Four years worth data shows that highest average temperature is in April. Let us see a split by years and see if there is a year in which March’s average temperature was the highest. Looks like we need to do little clean up of the data. There is a zero in October. Definitely not possible in Coimbatore!

hmonth <- bwplot(TemperatureC ~ month, data=x, ylab = "Temperature (C)");
plot(hmonth);

temp-month-year-split

Well it looks like at least in 2005 and 2007 March’s average temperature is the highest. Although April matches March in both those years. I am trying to find something to salvage for my Amma!

Here is an another plot which splits by the hour of the day.

hhour <- bwplot(TemperatureC ~ TimeIST, data=x, ylab = "Temperature (C)");
plot(hhour);

temp-hour
The one surprising thing one was the fact that the lowest temperatures occur between 2:30 AM and 5:30 AM not around midnight. The highest temperatures are around 2:30 PM not noon. Well the zero is definitely an error since it shows up at 11:30 AM.

Update 1 (March 16 2009):

You can download [~700 KB, rename it as cbe.csv] the big csv file which contains the data for the weather in Coimbatore for years 2005 — 2008. Now you can skip to the Step (3) and use R to analyze the data.

About these ads

Written by anandram

March 8, 2009 at 19:08

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: