Instructions

  1. Download our sample data: Dat_Trans_and_Items.csv (be careful, if you open this file in MSExcel and save, Excel will delete some data)
  2. Create a .R file with your solution (include any plots or documents you feel appropriate).
  3. Email deliverable to leepslab@gmail.com (or reply to a personal email you may have received).
  4. Submissions will be judged individually, and need not be complete. This is meant to be an indicator of your ability to work independently, learn new packages, and adapt to different tasks. Questions via email are absolutely acceptable. A mostly complete task that required email assistance is better than nothing at all. This isn't homework, it's real work, what matters most is the ability to get things done, even if that means asking for help or using code found online. Just be sure to cite any sources of outside assistance.

1. Summary of Data.

Dat_Trans_and_Items.csv - This dataset simulates transactions data.

Dat_Trans_and_Items header names
"ItemID" A multi-digit integer reflecting an individual good, i.e. each unique ItemID number reflects one "type" of good.
"Trader_1_ID" A multi-digit integer reflecting an individual trader on one side of this transaction.
"Trader_2_ID" A multi-digit integer reflecting an individual trader on the other side of this transaction.
"Transaction_ID" A 19 digit integer reflecting an individual transaction (be careful, if you open this file in MSExcel and save, Excel will delete 4 digits off this).
"Trade_time" Time (in seconds) when this trade was executed.
"Game" An unimportant column, reflecting on which market this trade occurred.
"Originator" A value of 1 implies that this good (indicated by ItemID) originated with Trader_1 and went to Trader_2. A value of 0 implies that the good associated with this row (indicated by ItemID) originated with Trader_2 and went to Trader_1.

2. Initial Questions, Summary Stats

  1. Load the data.
  2. How many unique trades are there?
  3. How many unique traders participated in this market?
  4. How many types of goods were traded?
  5. What was the average number of goods traded in each transaction?
  6. Look at "Trade_time", how many sample days is there data for? (Again, this is time in seconds, i.e. the second when this transaction was executed. To answer this you'll need to look closely at--and think carefully about--the distribution of Trade_times.)
  7. How many days separate each of these sample days?

3. Summary Stats by Sample Day

  1. For each sample day, how many trades occurred?
  2. For each sample day, how many traders participated in the market?
  3. For each sample day, how many unique goods were traded?
    For each of these questions, feel free to create a table and/or show this information graphically. (Although doable without it, check out the ggplot2 package.)

4. Summary Stats by Trader

  1. What is the average number of trades each trader participated in?
  2. Show the number of trades each trader participated in by an informative histogram.

5. Reshaping Data

For each of these question, also think about optimizing your code for speed.

  1. Create a new data frame called Dat.Trade, where each row is a unique transaction ID. (You'll want to remove the "ItemID" and "Originator" columns, as these are only relevant to the previous data frame.)
  2. Add a column to Dat.Trade called "itemsFrom_1_to_2", that contains the ItemsIDs of all the goods in this trade (indicated by this row's Transaction_ID) that went from Trader_1 to Trader_2. For example, if for this Transaction_ID Trader_1 gave to Trader_2 no items, then this field would be empty. If one items was passed to Trader_2 the one item would be listed. If two items passed to Trader_2 then the two items would be listed here, etc. (You'll want to note the "originator" column in the initial data frame.)
  3. Similarly to Pt. 2 above, add a column to Dat.Trade called "itemsFrom_2_to_1", that contains the ItemsIDs of all the items in this trade that went from Trader_2 to Trader_1.
  4. Add a column to Dat.Trade called "Day", which indicates the sample day of this transaction.
  5. For each of the above questions, if the dataset "Dat_Trans_and_Items.csv" was a random sample of 2000 Transaction IDs drawn from a dataset composed by 2 billion Transactions IDs, how long would the code take to run over the full data?

(Although doable without it, check out the plyr and reshape2 packages.)