网络智能跟大数据公开课Homework3 Map-Reduce编程

2013-10-06

网络智能和大数据公开课Homework3 Map-Reduce编程Web Intelligence and Big Data by Dr. Gautam Shroff这

网络智能和大数据公开课Homework3 Map-Reduce编程

Web Intelligence and Big Data
by Dr. Gautam Shroff这门课是关于大数据处理，本周是第一次编程作业，要求使用Map-Reduce对文本数据进行统计。使用的工具为轻量级的mincemeat。需要注意的是，使用正则式来匹配单词。做完之后先按照姓名和频率排序，即双重排序，然后写入文件。做作业时因为有两分钟的时间限制，要即时进行搜索。作业要求如下：Download data files bundled as a .zip file from hw3data.zip
Each file in this archive contains entries that look like:
journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2.
that represent bibliographic information about publications, formatted as follows:
paper-id:::author1::author2::…. ::authorN:::title
Your task is to compute how many times every term occurs across titles, for each author.
For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2.
Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms.
The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python.
These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively.
I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar.
Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author.

输入范例如下：

Stephen L. Bloom *** scalar *** 1Stephen L. Bloom *** concatenation *** 1Stephen L. Bloom *** point *** 1Stephen L. Bloom *** varieties *** 1Stephen L. Bloom *** observation *** 1Stephen L. Bloom *** equivalence *** 1Stephen L. Bloom *** axioms *** 1Stephen L. Bloom *** languages *** 1Stephen L. Bloom *** logical *** 1Stephen L. Bloom *** algebras *** 1Stephen L. Bloom *** equations *** 1Stephen L. Bloom *** number *** 1Stephen L. Bloom *** vector *** 1Stephen L. Bloom *** polynomial *** 1Stephen L. Bloom *** solving *** 1Stephen L. Bloom *** equational *** 1Stephen L. Bloom *** axiomatizing *** 1Stephen L. Bloom *** characterization *** 1Stephen L. Bloom *** regular *** 2Stephen L. Bloom *** sets *** 2Stephen L. Bloom *** iteration *** 3Stephen L. Lieman *** unacceptable *** 1Stephen L. Lieman *** correcting *** 1Stephen L. Lieman *** never *** 1Stephen L. Lieman *** powerful *** 1Stephen L. Lieman *** accept *** 1

热点排行