Tuesday, June 24, 2014

Downloading ESPN Fantasy Football Projections


I'm starting another series of posts that will look at determining the best fantasy football draft using optimization and game theory. However, first we will need some data to play with. Using FantasyFootballAnalytics code, I adjusted it to download stats off of the ESPN rankings page.

First up, let's load some useful packages.

require("XML")
require("stringr")
require("ggplot2")
require("plyr")


Next, I want to make sure we read in all strings as strings and not create factors from them.
options(stringsAsFactors = FALSE)

Here's the hard part. We need to go out to the website, download the data and clean it up.
clean <- function(pos) {
    link<-"http://games.espn.go.com/ffl/tools/projections?&seasonTotals=true&seasonId=2014&slotCategoryId="
    x <- readHTMLTable(paste(link,pos,sep=""))$playertable_0
    #update the column names
    names(x) <- c("position.rank","name.team.pos", "passing.completions.per.attempt", "passing.yards", "passing.tds", "passing.interceptions", "rushing.rushes", "rushing.yards", "rushing.tds", "receiving.catches", "receiving.yards", "receiving.tds", "total.points")

    #remove header row
    x <- x[-1,]

    #separate out the completions and attempts
    x$passing.completions <- str_sub(x$passing.completions.per.attempt, end=str_locate(string=x$passing.completions.per.attempt, '/')[,1]-1)
    x$passing.attempts <- str_sub(x$passing.completions.per.attempt, start=str_locate(string=x$passing.completions.per.attempt, '/')[,1]+1)
    x[,"passing.completions.per.attempt"] <- NULL

  if (pos!="16") {
    x$name <- str_sub(x$name.team.pos, end=str_locate(string=x$name.team.pos, ',')[,1]-1)
    x$team <- str_sub(x$name.team.pos, 
                      start=str_locate(string=x$name.team.pos, ',')[,1]+2, 
                      end = str_locate(string=x$name.team.pos, ',')[,1]+4)
    x$team <- str_trim(x$team, side="right")
    x$team <- toupper(x$team) 
  }  else {
    x$name <- str_sub(x$name.team.pos, end=str_locate(string=x$name.team.pos, ' ')[,1]-1)
    team.lookup<-as.data.frame(matrix(c("DEN","Broncos","GB","Packers","NO","Saints","CAR","Panthers",
                                        "WSH","Redskins","DET","Lions","IND","Colts","PHI","Eagles","OAK","Raiders",
                                        "SEA","Seahawks","SF","49ers","DAL","Cowboys","ATL","Falcons","NE","Patriots",
                                        "SD","Chargers","MIN","Vikings","CHI","Bears","KC","Chiefs","CIN","Bengals",
                                        "PIT","Steelers","NYG","Giants","ARI","Cardinals","MIA","Dolphins",
                                        "BAL","Ravens","TB","Bucaneers","CLE","Browns","HOU","Texans","STL","Rams",
                                        "BUF","Bills","NYJ","Jets","TEN","Titans","JAC","Jaguars"),ncol=2,byrow=TRUE))
    names(team.lookup)<-c("team","name")
    x<-merge(x=x,y=team.lookup,by="name")
  } 
    x[,"name.team.pos"] <- NULL

    #change from character to numeric
    for (c in names(x)) {
        if (!(c %in% c("pos","name","team"))) x[,c] <- as.numeric(x[,c])
        #replace NAs with 0
        x[is.na(x[,c]),c] <- 0
    }
    return(x)
}


Now that the function is prepped for us, we just need to fill in the appropriate tail to each website. Note that since we'll need more than 40 RBs and 40 WRs, we'll grab the top 120 players at each position.

qb <- clean("0")
rb <- rbind(clean("2"),clean("2&startIndex=40"),clean("2&startIndex=80"))
wr <- rbind(clean("4"),clean("4&startIndex=40"),clean("4&startIndex=80"))
te <- clean("6")
def <- clean("16")
k <- clean("17")


Now let's add in a variable for each position.

qb$pos <- as.factor("QB")
rb$pos <- as.factor("RB")
wr$pos <- as.factor("WR")
te$pos <- as.factor("TE")
def$pos <- as.factor("DEF")
k$pos <- as.factor("K")


Finally, let's combine all of the position specific datasets.

d <- rbind(qb,rb,wr,te,def,k)

We can now rank each player by the point totals.

d$rank <- rank(-d$total.points, ties.method="min")

Finally, we can order the dataset by that rank.

d <- d[order(d$rank),]

Well this is a great start. Note that we have total points for Defense and Kicker, but no accompanying statistics. I'm going to explore the ESPN website to see what I can find but this will do in the meantime.

Saturday, March 29, 2014

Measuring Accuracy of a Naive Bayes Classifier in Python

In the last post, I built a Naive Bayes classifier that uses a training dataset to classify tweets into a particular sentiment. One thing I didn't mention is how to measure the accuracy of that classifier. In order to determine how large a training dataset needs to be, we first need a metric that can be used to measure the classifier.

In the previous program, I had read the tweets into a list of tuples. The first step is to change this into a list of lists. This can be done simply with the following generator:

tweets = [list(t) for t in taggedtweets]

Now that we have a list of tweets, we can simply iterate through each tweet and add the result of classifying each tweet.

for t in tweets:
    t.append(classifier.classify(feature_extractor(t[0])))

Now the list has the tweet, the training classification and the scoring classification. We can then simply calculate how many of them are the same and divide by the number of tweets. Since I like to view the results in the shell, I added a nice print statement that will format the accuracy as a percent.

accuracy = sum(t[1]==t[2] for t in tweets)/float(len(tweets))
print "Classifier accuracy is {:.0%}".format(accuracy)

Using the training dataset I provided, this comes out with a accuracy of 46%. Not exactly the most accurate dataset. Ideally I would compare this against another dataset, but I don't have one available at the moment. Looking at the results can bring up another question as to whether this is the best metric to use. The "classify" method chooses the classification with the highest probability. Is it worth taking into account the probability distribution calculated by the classifier? For example, a tweet could have this probability distribution:

P(sentiment==positive)=0.000000
P(sentiment==negative)=0.620941
P(sentiment==neutral)=0.379059

In this example, the tweet is actually classified as neutral. Using the above metric, it gets marked as being inaccurate. However, should the fact that the model's probability of being neutral count in some way? I think it should get some partial credit. Instead of counting it as not matching (score of 0), I use the probability of what the tweet was categorized (score of 0.379059). Although this increases the credit for non-matches, this decreases the credit given for matches. This calculation changes the code slightly:

tweets = [list(t) for t in taggedtweets]

for t in tweets:
    t.append(classifier.classify(feature_extractor(t[0])))
    pc = classifier.prob_classify(feature_extractor(t[0]))
    t.append(pc.prob(t[1]))

accuracy = sum(t[1]==t[2] for t in tweets)/float(len(tweets))
weighted_accuracy = sum(t[3] for t in tweets)/float(len(tweets))
print "Classifier accuracy is {:.0%}".format(accuracy)
print "Classifier weighted accuracy is {:.0%}".format(weighted_accuracy)

The results aren't quite what I expected. It comes out to roughly 46% like the simple accuracy. I think this weighted accuracy will be a better measurement. Using this metric, we can now work on improving the model and determining the size of a training set required to feed into the model.