1. Truth Discovery for Passive and Active Crowdsourcing
Crowd-contributed data has become a vast treasure trove of information recent years. It covers almost every aspect of our lives with an astonishing growth, and consequently, fundamentally changed how we learn about the world. The crowdsourcing related topics have been intensively studied from different perspectives. To give a full picture of crowdsourcing, this tutorial dedicates to a unified study of the crowdsourcing techniques from various perspectives. To systemically introduce the success and extensive studies in crowdsourcing, we approach the topic from the data collection perspective: “passively” crowdsourced data and “actively” crowdsourced data. The covered crowdsourcing research tasks include information aggregation, budget allocation, worker incentive mechanism, etc. This tutorial offers a thorough exploration and will greatly benefit both academia and industry communities, motivate researchers to develop effective and efficient crowdsourcing systems, and enable practitioners to apply and adapt crowdsourcing approaches to real applications.
Jing Gao is currently an assistant professor in the Department of Computer Science at the University at Buffalo (UB), State University of New York. She received her Ph.D. in Computer Science from University of Illinois Urbana-Champaign in 2011, and subsequently joined UB in 2012. She is broadly interested in data and information analysis with a focus on truth discovery, crowdsourcing, multi-source data analysis, anomaly detection, information network analysis, transfer learning, data stream mining, and ensemble learning. She is also interested in various data mining applications in health care, bioinformatics, transportation, cyber security and education. She has published more than 90 papers in referred journals and conferences and her work has received over 2700 citations. She is the recipient of NSF CAREER award and IBM faculty award.
Qi Li received the BS degree in Mathematics from Xidian University and the MS degree in Statistics from University of Illinois at Urbana-Champaign, in 2010 and 2012 respectively. She is currently working toward the Ph.D. degree in the Department of Computer Science and Engineering, University at Buffalo. Her research interest includes truth discovery, data aggregation, and crowdsourcing. She has published papers on these topics in SIGMOD, VLDB, KDD, and WSDM.
Wei Fan is currently the senior director and deputy head of Baidu Big Data Lab in Sunnyvale, California. He received his PhD in Computer Science from Columbia University in 2001. His main research interests and experiences are in various areas of data mining and database systems, such as, deep learning, stream computing, high performance computing, extremely skewed distribution, cost-sensitive learning, risk analysis, ensemble methods, easy-to- use nonparametric methods, graph mining, predictive feature discovery, feature selection, sample selection bias, transfer learning, time series analysis, bioinformatics, social network analysis, novel applications and commercial data mining systems. His co-authored paper received ICDM’06/KDD11/KDD12/KDD13/KDD97 Best Paper & Best Paper Runner-up Awards. He led the team that used his Random Decision Tree (www.dice.com) method to win 2008 ICDM Data Mining Cup Championship. He received 2010 IBM Outstanding Technical Achievement Award for his contribution to IBM Infosphere Streams. He is the associate editor of ACM Transaction on Knowledge Discovery and Data Mining (TKDD). During his times as the Associate Director in Huawei Noah’s Ark Lab in Hong Kong from August 2012 to December 2014, he has led his colleagues to develop Huawei StreamSMART a streaming platform for online and real-time processing, query and mining ofvery fast streaming data. StreamSMART is 3 to 5 times faster than STORM and 10 times faster than SparkStreaming, and was used in Beijing Telecom, Saudi Arabia STC, Norway Telenor and a few other mobile carriers in Asia. Since joining Baidu Big Data Lab, Wei has been working on medical and healthcare research and applications, such as deep learning-based disease diagnosis based on NLP input as well as medical dialogue robot.
2. Similarity Search on Time Series Data: Past, Present, and Future
Similarity search on time series data has been studied for decades, and we continue to see new solutions being developed to address challenges rising from novel applications, natural constraints, and processing platforms. In this tutorial, we briefly describe the last three decades of research on similarity search in time series data, discuss recent developments and project future challenges towards similarity based temporal mining systems.
The tutorial will cover a large set of similarity measures, domain dependent invariances, bounding techniques for indexing and scalability, and representational techniques for efficiency. We will discuss relative strengths and weaknesses of various methods and make recommendations for researchers and practitioners. We will present high-level mining problems that directly benefit from similarity search and are still open.
The tutorial will discuss a rich set of application areas where similarity search has led to successful knowledge discovery, such as physiological monitoring, data center monitoring, activity recognition and animal tracking.
Dr. Mueen is an Assistant Professor in Computer Science at the University of New Mexico. He has won the runner-up of SIGKDD doctoral dissertation contest in 2012 and the best paper award in the same conference. His research work has been published in top conferences including SIGMOD, KDD, WWW and ICDM. He has recently given well received tutorials in SDM and ICDM conferences on repeated pattern mining from time series data. He has co-presented another tutorial in KDD 2016 on speeding up DTW based time series mining. His tutorials are widely accepted by young researchers and practitioners in industry and academia, and are supported by NSF.
3. Big Data Science in Drug Discovery and Development
Increasingly, effective drug discovery and development involve searching and data mining of large amount of heterogeneous information from many sources covering the domains of chemistry, biology and pharmacology amongst others. In this tutorial, we provide a comprehensive review on the state-of-the-art big data approaches in drug discovery and development, and illustrate the future directions of big data research in this area. In addition, we review publicly available large-scale databases that are relevant to big data research in drug discovery. The big data approaches that this tutorial will focus on include matrix computation and factorization methods (such as non-negative matrix factorization, joint matrix factorization, tensor factorization), active learning, multi-task learning, and the most recent deep learning methods. We survey various related articles from big data science venues as well as from bioinformatics, cheminformatics venues to share with the audience the key problems, big data solution approaches and trends in drug discovery research, with different applications such as drug repositioning, predictive drug safety, and precision medicine.
Dr. Ping Zhang is a Research Staff Member at Center for Computational Health, IBM T. J. Watson Research Center. He is leading translational informatics research in IBM. His research focuses on Machine Learning, Data Mining, and their applications to Drug Discovery and Health Informatics. He has published more than 25 articles in refereed journals and conferences, including AMIA, BIBM, ECML/PKDD, KDD, WWW conferences, and BMC Bioinformatics, JAMIA, Nucleic Acids Research, Proteome Science, Scientific Reports journals. Dr. Zhang served on the program committees of leading international conferences including KDD, IJCAI, UAI, SDM, BIBM, and ICHI. He also serves on editorial board of CPT: Pharmacometrics & Systems Pharmacology, an official journal of the American Society for Clinical Pharmacology.
Dr. Xia Ning is an Assistant Professor at the Department of Computer and Information Science, Indiana University – Purdue University Indianapolis (IUPUI). She is also a member of the Center of Computational Biology and Bioinformatics (CCBB), Indiana University School of Medicine (IUSM). Her research is on Data Mining, Machine Learning and their applications in Drug Discovery, Recommender Systems and Cyber-Physical Systems. Her research has been disseminated on top-tier conferences (e.g., KDD, ICDM, SDM, AISTATS, AAAI, IJCAI) and journals (e.g., Journal of Chemical Informatics and Modeling), through open-source software distributions, and via multiple (pending) patents (e.g., with NEC Labs America and Qualcomm, Inc.)
Dr. David Wild is an Associate Professor of Informatics at the Indiana University School of Informatics and Computing. He completed a Ph.D. in Computational Drug Discovery at Sheffield University in 1994. He worked for several years in scientific computing in the pharmaceutical industry, before moving into academia in 2004 to form new academic research and educational programs at Indiana University School of Informatics and Computing. He is Director of Data Science Academic Programs, which comprise online and residential degrees currently containing over 300 graduate students. He also leads the Integrative Data Science Laboratory, developing advanced data science technologies for healthcare and drug discovery. He co-founded and is CEO of Data2Discovery Inc., a data science company with a mission to transform healthcare through revolutionary data intelligence. David has around 100 research publications, mainly in the fields of cheminformatics and network chemical biology, and is co-editor of the Journal of Cheminformatics. He has been PI or CoPI on over $2m in research funding.
4. Graph Exploration: Taking the User into the Loop
The increasing interest in social networks, knowledge graphs, protein-interaction, and many other types of networks has raised the question how users can explore such large and complex graph structures easily. Current tools focus on graph management, graph mining, or graph visualization but lack user-driven methods for graph exploration. In many cases graph methods try to scale to the size and complexity of a real network. However, methods miss user requirements such as exploratory graph query processing, intuitive graph explanation, and interactivity in graph exploration. While there is consensus in database and data mining communities on the definition of data exploration practices for relational and semi-structured data, graph exploration practices are still indeterminate.
In this tutorial, we will discuss a set of techniques, which have been developed in the last few years for independent purposes, within a unified graph exploration taxonomy. The tutorial will provide a generalized definition of graph exploration in which the user interacts directly with the system either providing feedback or a partial query. We will discuss common, diverse, and missing properties of graph exploration techniques based on this definition, our taxonomy, and multiple applications for graph exploration. Concluding this discussion we will highlight interesting and relevant challenges for data scientists in graph exploration.
Davide Mottin is a postdoctoral researcher at Hasso Plattner Institute. His research interests include graph mining, novel query paradigms, and interactive methods. He received his PhD in 2015 from the University of Trento. He is a PhD on the Move programme recipient.
Anja Jentzsch is a PhD student in the Information Systems Group at Hasso Plattner Institute Potsdam. She is a Linked Data enthusiast, being involved in several Linked Data projects like Wikidata and DBpedia since 2007. Currently, she is working on exploring frequent and common graph structures in heterogeneous graph databases.
Emmanuel Müller is professor and head of the Knowledge Discovery and Data Mining group at Hasso Plattner Institute. His research interests include graph mining, stream mining, clustering and outlier mining on graphs, streams, and traditional databases. He received his PhD in 2010 from RWTH Aachen University, had been independent group leader at Karlsruhe Institute of Technology (2010 – 2015) and postdoctoral fellow at University of Antwerp (2012 – 2015).
5. IoT Big Data Stream Mining
The challenge of deriving insights from the Internet of Things (IoT) has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors and devices is bound to become a key area of data mining research as the number of applications requiring such processing increases.
Dealing with the evolution over time of such data streams, i.e., with concepts that drift or change completely, is one of the core issues in IoT stream mining. This tutorial is a gentle introduction to mining IoT big data streams. The first part introduces data stream learners for classification, regression, clustering, and frequent pattern mining. The second part deals with scalability issues inherent in IoT applications, and discusses how to mine data streams on distributed engines such as Spark, Flink, Storm, and Samza.
Gianmarco De Francisci Morales is a Scientist at QCRI. Previously he worked as a Visiting Scientist at Aalto University in Helsinki, as a Research Scientist at Yahoo Labs in Barcelona, and as a Research Associate at ISTI-CNR in Pisa. He received his Ph.D. in Computer Science and Engineering from the IMT Institute for Advanced Studies of Lucca in 2012. His research focuses on scalable data mining, with an emphasis on Web mining and data-intensive scalable computing systems. He is an active member of the open source community of the Apache Software Foundation, working on the Hadoop ecosystem, and a committer for the Apache Pig project. He is one of the lead developers of Apache SAMOA, an open-source platform for mining big data streams. He commonly serves on the PC of several major conferences in the area of data mining, including WSDM, KDD, CIKM, and WWW. He co-organizes the workshop series on Social News on the Web (SNOW), co-located with the WWW conference.
Albert Bifet is Associate Professor at Telecom ParisTech and Honorary Research Associate at the WEKA Machine Learning Group at University of Waikato. Previously he worked at Huawei Noah’s Ark Lab in Hong Kong, Yahoo Labs in Barcelona, University of Waikato and UPC BarcelonaTech. He is the author of a book on Adaptive Stream Mining and Pattern Learning and Mining from Evolving Data Streams. He is one of the leaders of MOA and Apache SAMOA software environments for implementing algorithms and running experiments for online learning from evolving data streams. He is serving as Co-Chair of the Industrial track of IEEE MDM 2016, ECML PKDD 2015, and as Co-Chair of BigMine (2015, 2014, 2013, 2012), and ACM SAC Data Streams Track (2016, 2015, 2014, 2013, 2012).
Joao Gama received, in 2000, his Ph.D.degree in Computer Science from the Faculty of Sciences of the University of Porto, Portugal. He joined the Faculty of Economy where he holds the position of Associate Professor. He is also a senior researcher and vice-director of LIAAD, a group belonging to INESC TEC. He has worked in several National and European projects on Incremental and Adaptive learning systems, Ubiquitous Knowledge Discovery, Learning from Massive, and Structured Data, etc. He served as Co-Program chair of ECML’2005, DS’2009, ADMA’2009, IDA’ 2011, and ECM-PKDD’2015. He served as track chair on Data Streams with ACM SAC from 2007 till 2016. He organized a series of Workshops on Knowledge Discovery from Data Streams with ECML-PKDD conferences and Knowledge Discovery from Sensor Data with ACM SIGKDD. He is author of several books in Data Mining (in Portuguese) and authored a monograph on Knowledge Discovery from Data Streams. He authored more than 250 peer-reviewed papers in areas related tomachine learning, data mining, and data streams. He is a member of the editorial board of international journals ML, DMKD, TKDE, IDA, NGC, and KAIS.
Wei Fan is the Deputy Head at Baidu Research Big Data Lab. He received his PhD in Computer Science from Columbia University in 2001. His main research interests and experiences are in various areas of data mining and database systems, such as, stream computing, high performance computing, extremely skewed distribution, cost-sensitive learning, risk analysis, ensemble methods, easy-to use nonparametric methods, graph mining, predictive feature discovery, feature selection, sample selection bias, transfer learning, time series analysis, bioinformatics, social network analysis, novel applications and commercial data mining systems. His co-authored paper received ICDM 2006 Best Application Paper Award, he led the team that used his Random Decision Tree method to win 2008 ICDM Data Mining Cup Championship. He received 2010 IBM Outstanding Technical Achievement Award for his contribution to IBM Infosphere Streams. He is the associate editor of ACM Transaction on Knowledge Discovery and Data Mining (TKDD). At Huawei, he led his colleagues to develop Huawei StreamSMART, a streaming platform for online and real-time processing, query and mining of very fast streaming data. In addition, he also led his colleagues to develop a real-time processing and analysis platform of Mobile Broad Band (MBB) data.
6. Large Scale Distributed Data Science using Apache Spark 2.0
Apache Spark is an open-source cluster computing framework. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala, SQL and R (MapReduce has 2 core calls), and its core data abstraction, the distributed data frame. In addition, it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing.
This tutorial will provide an accessible introduction to large-scale distributed machine learning and data mining, and to Spark and its potential to revolutionize academic and commercial data science practices. It is divided into two parts: the first part will cover fundamental Spark concepts, including Spark Core, functional programming ala map-reduce, data frames, the Spark Shell, Spark Streaming, Spark SQL, MLlib, and more; the second part will focus on hands-on algorithmic design and development with Spark (developing algorithms from scratch such as decision tree learning, association rule mining (aPriori), graph processing algorithms such as pagerank/shortest path, gradient descent algorithms such as support vectors machines and matrix factorization, and deep learning. Industrial applications and deployments of Spark will also be presented. Example code will be made available in python (pySpark) Jupyter notebooks.
As this tutorial will be hands-on in nature, please try to install the following on your laptops prior to attending: Spark 2.0 (or later), Python 2.7.12, and Jupyter notebooks. Webpage for Tutorial: http://CIKM2016-sparktutorial.droppages.com/. The slides (with pointers to notebooks) for the tutorial are located at:
- Introduction to Apache Spark and programming in Spark
- Machine learning with Spark
Dr. James G. Shanahan: Dr. James G. Shanahan has spent the past 25 years developing and researching cutting-edge artificial intelligent systems. He has (co) founded several companies including: Church and Duncan Group Inc. (2007), a boutique consultancy in large scale AI which he runs in San Francisco; RTBFast (2012), a real-time bidding engine infrastructure play for digital advertising systems; and Document Souls (1999), a document-centric anticipatory information system. In 2012 he went in-house as the SVP of Data Science and Chief Scientist at NativeX, a mobile ad network that got acquired by MobVista in early 2016. In addition, he has held appointments at AT&T (Executive Director of Research), Turn Inc. (founding chief scientist), Xerox Research, Mitsubishi Research, and at Clairvoyance Corp (a spinoff research lab from CMU).
Dr. Shanahan has been affiliated with the University of California at Berkeley (and Santa Cruz) since 2008 where he teaches graduate courses on big data analytics, large-scale machine learning, and stochastic optimization. He also advises several high-tech startups (including Quixey, Aylien, VoxEdu, and others) and is executive VP of science and technology at Irish Innovation Center (IIC). He has published six books, more than 50 research publications, and over 20 patents in the areas of machine learning and information processing. Dr. Shanahan received his PhD in engineering mathematics from the University of Bristol, U. K., and holds a Bachelor of Science degree from the University of Limerick, Ireland. He is a EU Marie Curie fellow. In 2011 he was selected as a member of the Silicon Valley 50 (Top 50 Irish Americans in Technology).
Liang Dai: Liang Dai is data scientist at NativeX, a leading ad technology company for mobile games. He works on large scale data mining projects in distributed platform, e.g. AWS, Hadoop, Spark, etc. He focuses on end to end data modeling pipeline: including preprocessing raw data, designing behavior and non-behavior features, selecting features based on experimental results, building predictive models and deploying models in production to handle large volume of ads placement requests. Liang is also pursuing Ph.D. in Technology Information and Management department, UC Santa Cruz. There he does research in data mining on digital marketing, including campaign evaluation, online experiment design, customer value improvement, etc. Liang received the B.S. and the M.S. from Information Science and Electronic Engineering department, Zhejiang University.
7. Data-Driven Behavioral Analytics: Observations, Representations and Models
Social networks, recommender systems and e-commercial websites have enabled large collections of behavioral data of unprecedented size and complexity. Comparing with human experience-driven analytics, Data-Driven Behavioral Analytics aims at uncovering the underlying patterns in the behavioral data and generating great scientific values for behavioral science and social science as well as marketing values for accurate behavior prediction and suspicious behavior detection. What are the major concepts and principles in Data-Driven Behavioral Analytics? How can we observe the patterns from the large behavioral data? How to represent human behaviors in different applications? What are the effective and efficient computational models for behavioral analysis? In this tutorial, we answer the above questions by introducing the novel methodology of data-driven approaches (from observations, representations to models). More about this tutorial is available in here.
Dr. Meng Jiang is a Postdoctoral Research Fellow in University of Illinois at Urbana-Champaign, USA. His research area includes behavior modeling, behavior prediction and suspicious behavior detection. He obtained his Ph.D. in 2015 from Tsinghua University, China. He visited Carnegie Mellon University in 2013 and University of Maryland, College Park in 2016. He has published over 15 papers as the first author in top-tier conferences and journals including SIGKDD, CIKM, ICDM, AAAI and TKDE. He gave an ICDM 3-hour tutorial, Modeling Behaviors in Social Media, in 2015. He received the SIGKDD 2014 Best Paper Finalist.
Dr. Peng Cui is an Assistant Professor in Tsinghua University, China.His research area includes social network analysis and social multimedia computing. He obtained his Ph.D. in 2009 from Tsinghua University, China and joined the Department of Computer Science and Technology in 2012. He was honored with ACM China Rising Star Award in 2015. He has received three best paper awards including the ICDM 2015 Best Student Paper Award.
Dr. Jiawei Han is Abel Bliss Professor in the Department of Computer Science, University of Illinois at Urbana-Champaign, USA. He is a ACM Fellow and IEEE Fellow. He received ACM SIGKDD Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), IEEE Computer Society W. Wallace McDowell Award (2009), and Daniel C. Drucker Eminent Faculty Award at UIUC (2011). He served in 2009-2016 as the Director of Information Network Academic Research Center (INARC) supported by the Network Science-Collaborative Technology Alliance (NS-CTA) program of U.S. Army Research Lab, and is currently serving as the co-Director of an NIH-funded BD2K Center, KnowEnG. His co-authored textbook “Data Mining: Concepts and Techniques” (Morgan Kaufmann) has been adopted worldwide.
8. Learning, Prediction and Optimisation in RTB Display Advertising
In display and mobile advertising, the most significant development in recent years is the Real-Time Bidding (RTB), which allows selling and buying in real-time one ad impression at a time. The ability of making impression level bid decision and targeting to an individual user in real-time has fundamentally changed the landscape of the digital media. The further demand for automation, integration and optimisation in RTB brings new research opportunities in the IR fields, including information matching with economic constraints, CTR prediction, user behaviour targeting and profiling, personalised advertising, and attribution and evaluation methodologies. In this tutorial, teamed up with presenters from both the industry and academia, we aim to bring the insightful knowledge from the real-world systems, and to provide an overview of the fundamental mechanism and algorithms with the focus on the IR context. We will also introduce to CIKM researchers a few datasets recently made available so that they can get hands-on quickly and enable the said research.
Weinan Zhang recently received his Ph.D. from University College London and is now an assistant professor in Shanghai Jiao Tong University. His research interests include machine learning, dynamic optimisation and their applications in RTB based display advertising and recommender systems. Particularly, He focuses on the research of optimal DSP bidding strategies for RTB display advertising. He is also interested in deep learning models and has developed several domain-specified DNNs for predicting users’ online commercial behaviours. Weinan Zhang has published more than 20 papers in top international conferences including SIGKDD, CIKM, SIGIR, AAAI, RecSys and WI. He also has made publications in well-recognised journals including ACM TIST, IPM, and JMLR. He and Dr. Shuai Yuan won the final session of iPinyou Global Bidding Algorithm Competition in 2013.
Jian Xu is currently Principal Data Scientist at TouchPal Inc., Mountain View, CA, where he is in charge of the overall monetization solutions. Before joining TouchPal, he served as Senior Data Scientist and Senior Research Engineer at Yahoo Inc., responsible for various advertising related technologies such as response prediction, budget pacing, and bid optimization. His research interests center around Data Mining, Machine Learning, and Computational Advertising. His recent research includes developing high performance advertising systems and monetization from massive data. He has published or filed more than 10 US patents and published research papers in top academic conferences and journals such as SIGKDD, AAAI, ICDCS, and SIGKDD Explorations, which received more than 500 citations. He also served as reviewer for well-recognized journals including TKDE, TIST, WWW, KAIS, Big Data Research and is on the Editorial Advisory Board of Information Systems (Elsevier). All the required information can be found in the following website: http://www.optimalrtb.com/cikm16/