This site (HighlyScalableSystems.com) is set up to talk about highly scalable systems. Highly Scalable Systems publishes articles, tutorials and news on systems technologies especially for cloud computing. It is welcomed if you share and publish technical articles. Please check here for contribution information.

Recent posts

  • Posted on Sunday September 14, 2014

    Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data […]
  • Posted on Tuesday March 18, 2014

    Benchmarks are important to understand the performance and quantitative and qualitative comparison of different systems. Many analytic frameworks, such as Hive, Impala and Shark, are designed and implemented these years and become fundamental software for processing big data. How to benchmark these big data analytic systems is an interesting problem. The Big Data Benchmark The […]
  • Posted on Tuesday February 04, 2014

    The public cloud storage services like Amazon S3, Google Cloud Storage and Windows Azure Storage replicate the data to ensure high availability. On the other hand, with data being replicated, the storage services exhibits certain data consistency models. Different cloud service providers employ different data consistency models nowadays. In this post, we survey the data […]
  • Posted on Thursday September 05, 2013

    John Ousterhout is a professor of Deparment of Computer Science from Stanford University. One recent project he is working on is the RAMCloud, a “new class of storage, based entirely in DRAM, that is 2-3 orders of magnitude faster than existing storage systems”. He posts his “Favorite Sayings” on his homepage. These sayings are precious […]
  • Posted on Saturday August 31, 2013

    When dealing with environments where memory is a constraint it is important to intelligently design memory usage. Be it embedded systems or supercomputers memory is always expensive. And with each boolean value using a byte it actually wastes a lot of memory. If not for addressability in languages like C and C++ booleans could have […]
  • Posted on Friday July 19, 2013

    Software Engineering Advice from Building Large-Scale Distributed Systems by Jeff Dean. Slides download: Software Engineering Advice from Building Large-Scale Distributed Systems by Jeff Dean Numbers Everyone Should Know L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with […]
  • Posted on Wednesday July 17, 2013

    Here is a list of tutorials for learning how to write MapReduce programs on Hadoop, the opensource MapReduce implementation with HDFS. MapReduce Tutorials The official tutorial on Hadoop MapReduce framework: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html. Yahoo! Hadoop Tutorial A comprehensive tutorial on Hadoop from Yahoo! Developer Network: http://developer.yahoo.com/hadoop/tutorial/. More about MapReduce To better understand the design behind MapReduce, it […]
  • Posted on Wednesday April 10, 2013

    I compiled a list of good systems conferences and deadlines for my own reference. Here I share the list and hope it can help others who also need such a list. This list is kept updated. A PDF version: Systems Conference and Deadlines. Links to conference websites: Systems Conferences.
  • Posted on Tuesday January 22, 2013

    Storage Architecture and Challenges in Faculty Summit, July 29, 2010, by Andrew Fikes, Principal Engineer. Download PDF. This slides introduces some of Google’s storage systems with insights and discussion of problems.
  • Posted on Tuesday January 22, 2013

    Designs, Lessons and Advice from Building Large Distributed Systems by Jeaf Dean. Everyone who is interested in large distributed systems should read: PDF for Designs, Lessons and Advice from Building Large Distributed Systems by Jeaf Dean.
  • Posted on Friday December 21, 2012

    Update on Mar. 7, 2014: it seems that Faraz has graduated from Purdue and the webpage are not available anymore. Please check the comment for the latest links. MapReduce is a well-known programming model designed for generating and processing large data. There are various MapReduce implementations. One widely known and used one may be Hadoop. […]
  • Posted on Tuesday December 18, 2012

    TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark. TeraGen generates random data that can be used as input data for a subsequent running […]
  • Posted on Tuesday December 11, 2012

    Research on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows. Storage systems Google File System (GFS): http://research.google.com/archive/gfs.html HDFS implementation: http://hadoop.apache.org/docs/stable/hdfs_design.html Colossus (GFS2): http://www.fclose.com/b/cloud-computing/3202/colossus-successor-to-google-file-system-gfs/ BigTable: http://research.google.com/archive/bigtable.html Megastore: http://research.google.com/pubs/pub36971.html Spanner: http://research.google.com/archive/spanner.html Dynamo: […]
  • Posted on Tuesday December 11, 2012

    Cosmos is “Microsoft’s internal data storage/query system for analyzing enormous amounts (as in petabytes) of data”. There is no paper/technical report about Cosmos published yet. I compiled a list of information about Cosmos on the Web as follows. What is Microsoft’s Cosmos service? by Yaron Y. Goland. Microsoft Cosmos: Petabytes perfectly processed perfunctorily by Seth […]
  • Posted on Friday November 30, 2012

    Colossus is the successor to the Google File System (GFS) as mentioned in the recent paper on Spanner on OSDI 2012. Colossus is also used by spanner to store its tablets. The information about Colossus is slim compared with GFS which is published in the paper on SOSP 2003. There is still some information about […]
  • Posted on Thursday October 25, 2012

    I am trying to find out the top conferences that have the largest average number of citations in the last 5 years on the Internet but fail to find one. However, there are many rankings about the overall citations and numbers of publications. Hence, it is not hard to calculate the average number of citations […]
  • Posted on Tuesday October 09, 2012

    Update: If you are new to Hadoop and trying to install one. Please check the newer version: Hadoop Installation Tutorial (Hadoop 2.x). Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed […]
  • Posted on Saturday September 15, 2012

    Understanding the literature is usually the first step to do research, which is the same for system research on cloud computing. A reading list may help a lot to those that just start in cloud computing research. Prof. Lin Gu, my PhD supervisor, compiles a reading list for system research on cloud computing. The reading […]
  • Posted on Saturday September 01, 2012

    This post lists important conferences related to Cloud Computing in year 2013. SOSP 2013 SOSP’13: The 24th ACM Symposium on Operating Systems Principles. November 3-6, 2013, Nemacolin Woodlands Resort, Pennsylvania. The biennial ACM Symposium on Operating Systems Principles is the world’s premier forum for researchers, developers, programmers, and teachers of computer systems technology. Academic and […]
  • Posted on Sunday January 15, 2012

    Hadoop’s namenode and datanodes expose a bunch of TCP ports used by Hadoop’s daemons to communicate to each other or listen directly to users’ requests. These ports information are needed by both the Hadoop users and cluster administrators to write programs or configure firewalls/gateways accordingly. A post written by Philip Zeyliger from Cloudera’s blog summarizes the […]

Most viewed posts

Latest updated posts