Project Plan: Big Data Analytics

by Professor Sathish Kumar

Synopsis.  While posing one of the greatest socioeconomic challenges [Dai et al., 2012], climate change also presents an opportunity for data science and big data research since climate science involves novel data, methods, and evaluation challenges [Langley, 2011]. Despite this potential, data science and big data analytics has had little impact on furthering our understanding of climate change and climate science in spite of the abundant climate data. This gap stems from the complex nature of climate data as well as the scientific questions climate science brings forth [Donges et al., 2009]. I am planning to use the computer and cluster to: a) explore theory-guided data science and machine learning methods that blend with the big data analytics and the climate science and principles, and b) experiment with interpretation process to extract accurate insight from large climate data specific to coastal area and propose methods to mitigate the natural coastal disasters Genetic markers, genome sequencing and bioinformatics play a major role in our understanding of heritable traits. However, while great advances have been made in characterizing genomes and genetic markers, the ability to link these genomic differences with heritable phenotypic traits remains a bottleneck. Molecular genetic markers [Duran et al., 2009], with the help of bioinformatics, are gradually bridging the divide between traits and increasingly available genome sequence information [Edwards and Batley, 2004, 2008]. I am planning to explore the next generation sequencing technologies to study the crop genomes, and perform large-scale diversity studies, at the whole genome level. This genome level study will help to apply molecular markers for a broader range of traits in a greater diversity of species than currently possible, accelerating crop breeding and improvement to meet the challenge of climate change related global food insecurity.

Software Description:  Through MAUI, I am planning to use software such as Hadoop-MapReduce suite of software tools for the big data analytics research activities mentioned above. Hadoop MapReduce is a software framework that helps to write applications to process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. In addition, I am planning to use tools like R and Caffe for Deep Machine Learning.


1. Dai A, 2012: “Increasing drought under global warming in observations and models”, Nat Clim Change 2012; 3:52–58

2. Langley P, 2011: “The changing science of machine learning”, Mach Learn, 82:275–279

3. Donges JF., Zou Y., Marwan N., Kurths J, 2009: “The backbone of the climate network”, Europhy Lett; 87:48007

4. Duran C, Appleby N, Edwards D, and Batley J, 2009b: “Molecular genetic markers: discovery, applications, data storage and visualization”, Current Bioinformatics 4:16–27

5. Edwards D, and Batley J, 2008: “ Bioinformatics: fundamentals and applications in plant genetics, mapping and breeding”. In: Kole C, Abbott AG (eds.) Principles and practices of plant genomics. Science, Enfield, NH, pp 269–302