Bigtable: A Distributed Storage System for Structured Data

To be able improve applicability, scalability, performance and availability in data storage for large data, the authors have implemented and deployed a distributed storage system which is called Bigtable, and this would be the main motivation of the paper. To manage large data, the system provides a simple data model for dynamic control over data layout and format for clients as describe as following paragraph.

For their contributions, the authors have spent roughly seven person-years on design and implementation. They have introduced an interesting model which a map data structure, the concept of row and column families, and time stamps which form the basic unit of access control and so on. Also the refinements and the performance evaluation which describes in the paper have shown an improvement. Three of the real applications or products have success by using the Bigtable implementation and concepts.

The paper’s single most noticeable deficiency already describes by the authors in the paper which are the following. For example, consideration of the possibility of multiple copies of the same data doesn’t count; a permission to let the user tell us what data belongs in memory and what data should stay on the disk rather than trying to determine this dynamically. Lastly, there are no complex queries to execute or optimize. The Bigtable seems to take to another whole level of manipulating the data, however my question is still concerned about the networking such that it seems to me that the latency plays an important role to be able to retrieve or display the result of queries. In my personal opinion, there is still a bottle neck because it is a distribute servers which require a high-performance network infrastructure to achieve the highest performance.

I would rate the significant of the paper 5/5(breakthrough) because of the Bigtable model system is amazing such that it could adapts to handle some very large data, and it has been used in many popular application that we have been using nowadays, for examples, Google products such as Google earth and Google analytics and etc. The concept of adding a new machine when it needs more performance to perform database operations is spectacularly. I believe that the Bigtable will be very useful in future use, and we will most likely to see the next coming products from such companies take this model to approve their use of database.

Reference:
Bigtable: A Distributed Storage System for Structured Data, F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber, Proc. of the 7th Conf. on USENIX Sym. on Operating Systems Design and Implementation, November 2006, pp. 205-218.

Serverless Network File Systems

Serverless Network File Systems, T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Wang, Proc. of the 15th ACM Symposium on Operating Systems Principles, December 1995, pp. 109-126.

The authors believe that the traditional central network file system still has a bottle neck, such that all the miss read/write goes through the central server. It is also expensive, such that it requires man to control or operate the server to be able to balance the server loads. Therefore, they have introduced a server less file systems distribute file system server which responsibilities across large numbers of cooperating machines. Ideally, the authors have implemented a prototype serverless network file system called xFS to provide better performance and scalability than traditional file systems.

There are three factors which motivate their work on the implementation of the serverless network file systems: the first one is the opportunity to provided by fast switched LANs, the second one is the expanding demands of users and the last one is the fundamental limitations of central server systems.Taking about their contributions, the authors make two sets of contributions. Firstly, xFs synthesizes a number of recent innovations which provide a basis for serverless file system design. Secondly, they have transformed DASH’s scalable cache consistency approach into a more general, distributed control system that is also fault tolerant. Moreover, they have improved the Zebra to eliminate bottlenecks.

The paper’s single most noticeable deficiency is the limitation of the measurements, such that the workloads are not real workloads, and they are micro benchmarks that provide a better performance in term of parallelism than real workloads. Another limitation of the measurements is that they compare against NFS, hence scalability is limited.

This paper seems very solid and interesting to me, I like many ideas, for example, the idea of taking advantage of the cooperative caching to server client memory. However, I still have a question regarding to the future work and its limitation such that, what would be a real workloads the author most likely to measure on and how much expectation would the author prefer to see according to such workloads.

I would rate this paper 5/5(breakthrough) due to the challenging idea and how the authors implements and their measurements. It improves the old fashion server in term of performance, scalability, and availability. It could also help reduce the cost of hardware.

The Multics virtual memory: concepts and design

The Multics virtual memory: concepts and design, A. Bensoussan, C. T. Clingen and R. C. Daley, Communications of the ACM, Vol. 15, NO. 5, May 1972, pp. 308 – 318.

As we might know, the use of on-line operating systems has been growing as well as the need to share information among system users. However, they share by the use of segmentation. This motivated the authors, such that, in order to take advantage of the direct addressability of large amounts of information which made possible by large virtual memories, the authors are motivated to develop a Multics (Multiplexed Information and Computing Service) to provide a generalized basis for the direct accessing and sharing of online information. There are two goals; the first goal is it must be possible for all on-line information stored in the system to be addressed directly by a processor. Another goal is that it must be possible to control access.

Regarding to the authors contributions, the authors have introduced an idealized memory by using the segmentation and paging features of the 645 assisted by the software features. Also, to take some advantages of existing mechanism , the Multics processes and the Multics supervisor were introduced The symbolic addressing conventions technique also provide an ease of use for users, such that a user can reference a segment’s pathname and supplying the rest of the pathname according to system conventions. Moreover, by making a segment known to a process and improve the segment fault handler have given the Multics a lot of performance.

The paper’s single most noticeable deficiency is that there are too many assumptions, so it makes the readers pretty confused of how to use the features of the Multics. The conclusion of the paper should summarize what the authors have contributed and how to improve it in the future work, instead of showing of user and supervisor view points. It would be good if the authors emphasize of how the selection algorithm work. For the question according to the paper, I would like to know how much it improves from the old fashion of the concept.
Lastly, I would rate the significance of the paper 3(modest) due to the fact that this paper is published 30 more years ago. It lacks of experimental and compare/contrast with the use of segmentation.

Why Thread?

Here is a quick brief about a concept of thread. Thread is a light weight process, it takes less time to create, context switch or destroy thread than a process. We want simultaneous activities for a better interraction with a user or take advantages of multi-processors to archieve a maximum system resourse utilization.

For instance, in word processor, one thread responses for I/O, while another is doing a grammar check. Now, lets take a look at a simple JAVA program to see how the thread create, run, and how threads assignment work in JAVA.


public class thread extends Thread{
public void run() {
for(int i = 0; i < 10; i++) {
System.out.println("Child thread " + i);
}
}

public static void main(String[] args) {
thread t = new thread();
t.start();

for(int i = 0; i < 10; i++) {
System.out.println("Parent thread " + i);
}
}// end main
}// end class thread

In the code above, we created two threads. When the new thread create, it will go to the run() method and do the job, so we can see from the output from the program that there will be two threads execute concurrently.

program output:

Parent thread 0
Parent thread 1
Child thread 0
Parent thread 2
Child thread 1
Parent thread 3
Child thread 2
Parent thread 4
Child thread 3
Parent thread 5
Child thread 4
Parent thread 6
Child thread 5
Parent thread 7
Child thread 6
Parent thread 8
Child thread 7
Parent thread 9
Child thread 8
Child thread 9

A Dynamic Data Race Detector for Multithreaded Programs

Eraser: A Dynamic Data Race Detector for Multithreaded Programs, by Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson, ACM Transactions on Computer Systems, Vol. 15, No. 4, November 1997, pp. 391-411.

According to the paper, Erase: A dynamic Race Detector for Multithread Program, the authors claim that dynamic data race is hard to detection, so programmers are suffered when programming by using thread. There are already work that solving about the data race problem from Lamport’s happen relation, however, it costly so they would like to introduce a new method. These are the main motivations of the author regarding to the paper.

The authors contribute by introducing a dynamic race detection tool which is called “Eraser” this tool will monitor the program when it reads and writes when it executes, they state that the tool is more effective and un-sensitive than manual debugging. Another important of the main contributes of this paper is a Lockset algorithm, which will use to detect the data race in multithread programs.

Moreover, about the Eraser detection program, the program can detect race condition in Operating Kernel. For their experiment, the authors test Eraser on the real programs and applications. the HTTP server and indexing engine from AltaVista, the Vesta cache server, the Petal distributed disk system and various programs from programming assignment from students. However, the author is not concerned about its performance due to the high overhead. However, but the authors believe that it is fast enough to debug most of the programs and focus on the false alarms of the program when it found the data race.

The most efficiency of this paper is that the program Eraser cannot prove that the test program is race data free. Also, checking for dynamic data race is impractical. The experiment methods should cover most of the operating systems that we use these days and various of programming language should be tested instead of having only C++ programming language. Moreover, the use of the Eraser program should be describe for the audience, so they can know how the program works out for each test programs. The graph and performance should be provide instead of describing what happen for the program they run on.I would rate the significant of this paper 4/5(modest) due to the challenge topic and idea.

Scalable Threads for Internet Services

Capriccio: Scalable Threads for Internet Services, R. V. Behren, J. Condit, F. Zhou, G. C. Necula, E. Brewer, Proc. of the Nineteenth Symposium on Operating System Principles (SOSP-19), Lake George, New York. October 2003, pp. 268-281.

Thread-based versus event-based programming has been a popular topic recently. For this paper, the authors have shown a strong motivation and contribution such as developing a scalable thread packet for use with high-concurrency servers which is called Capriccio.

The authors have noticed a lot of disadvantages of using event-based programming. For instance, the “stack ripping” where programmers have to save and restore live state is too complicated to use. The authors believed that by using thread-based could make life easier and could also achieve high currency just like the event-based programming as well.

In order to make thread-based model to be better than event-based model, they have build the thread package under the user-level threads, due to the fact that the user-level thread have more advantages in term of performance and flexibility over the kernel one. The implementation of Capriccio is amazing such that we don’t have to modify our applications to be able to use features from the thread package. Capriccio uses and takes advantages of new mechanisms from the latest Linux for its synchronization, I/O and Scheduling mechanisms. This is the reason why the result from the benchmark which they showed in the paper is surprisingly good for thread creation, context switch and so on; it is faster when comparing to the original Linux threads and the others comparators.

The idea of introducing linked stack management, resource-aware scheduling, blocking graph and modify some algorithm are surely improve the system utilization. Base on the performance from their evaluation which they compare between the default web servers such as Apache, Haboob, the results looks realistic. Because of the benchmarks they use are the real world application, and the Capriccio performs very well for both scalability and scheduling.

However, we already know that there must be some disadvantages of using thread-based model. One of them which I am very concerned is the issue when having multiple processors for both homogeneous and heterogeneous chip types. The authors mentioned the drawback of user-level threading such that it could make it more difficult to take advantage of multiple processors. As we know, SMP (symmetric multiprocessing) or CMP (chip multi processor) like Intel duo core has been increasing in the computer market these days. I wonder if the thread-based model will take advantages of having multiple processors more than the event-based model or not. What if we try to fix both user-level and kernel level threads instead of employ only the user-thread level. The future work section in the paper doesn’t give much detail regarding to the issue.

Lastly, I would rate the significant of this paper 5/5 (breakthrough) because they have use and modify many mechanisms and creating a new thread packet to show us that thread-based programming is better to use for high-concurrency internet servers. Their dedication and ideas are impressive

Why Events Are A Bad Idea

According to the paper, ‘Why Events Are A Bad Idea (for high-concurrency servers), R. Behren, J. Condit and E. Brewer, Proceedings HotOS IX, Kauai, Hawaii, May 2003, pp. 19-24.’ As we know, thread versus message passing(event-based) programming has been debating in term of which is the best in term of performance lately, and many people believe that the event-based programming is much better in many ways than thread programming. In the paper, the main motivation of the authors is to show that thread programming is better than event-based programming in highly concurrent applications environment. They have shown us that thread could perform about the same as event-based in many criticize cases and it could have done better if we have fixed the complier. In other hand, they have concluded that thread will outperform event-based programming by judging from their analysis from the simulation they built. For this review, I will explain the authors main contribution, theirs deficiency. Lastly, I will rate the significance of the paper based on my personal opinion.

According to the paper, the authors has shown us the different between events and threads in term of their responsibilities such that events use event handlers and send /wait for messages, while threads use the function forks and so on. They also describe the problem with threads which has been criticism from other who think that event-base does better, such as performance, control flow, synchronization, state management and scheduling. They proved that these problems caused by the implementation of the programmers.

To make us believe that thread could perform better than event-based, they points of the two important properties of why thread could do better. For example, in modern servers, the requests from the client are independent, and the code which handles the request is sequential. So, they came up with the experimental by modify the compilers and integrate the complier and runtime system. Moreover, they ran the simulation and analyze the results such that event-based requires too many contexts switches and use too much heap due to the fact that its execution is so dynamically. Therefore, they conclude that the thread avoids this kind of problem and could give us a better in execution time.

In my opinion, I think the deficiency is that they haven’t done enough experiments with other cases such as they could test on other operating systems, or by using other benchmark suits to test on various inputs before they conclude that the simple thread programming perform better than the event-based one. However, thread versus message passing is an interesting topic, but in term of practicing in real world applications, it would cost so much time and afford to modify or integrate the complier and runtime like they mentions in the paper. Finally, what if their future results show a big advantages of thread and huge different in term of performance between them, but in reality many programmers still don’t quite understand how the thread really work, so are we going to achieve the utilization of the computer resource we have? I would rate the significance of this paper 3/5 because of the lack of evidences in term of real-application and the lack of references from others research which support the author’s arguments.