ADS SlidesRicardo

Texto

(1)27/03/15. Course Outline Errors in Experimental Measurements § Accuracy, precision, and resolution § Error and the Gaussian distribution § Quantifying errors - confidence interval, t-distribution, binomial distribution Comparing Alternatives § Hypothesis test § Comparing two alternatives - before-and-after, Non-corresponding measurements § Comparing Proportions § Comparing more than two alternatives - ANOVA (analysis of variance), FTest, Contrasts Graphs § § § § § § . histogram, frequency graph boxplots graph stem and leaf diagram scatterplot diagram dispersion diagram. Bibliography. i ava. David J. Lilja, Measuring Computer Performance: A Practitioner's Guide, Cambridge University Press, 2000, http://labq.com/perf-book.shtml.. R. Jain, "The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling," Wiley- Interscience, New York, NY, April 1991, ISBN:0471503361.. Slides e Cronograma. David J. Lilja, Measuring Computer Performance: A Practitioner's Guide, Cambridge University Press, 2000, http://labq.com/perf-book.shtml.. pdf. Bibliography. http://www.cin.ufpe.br/~rmfl/ADS_MaterialDidatico. le lab. R. Jain, "The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling," Wiley- Interscience, New York, NY, April 1991, ISBN:0471503361.. ilab ava pdf. le. Common Mistakes in Performance Evaluation Prof. Ricardo Massa F. Lima CIn - UFPE. No Goals 1.

(2) 27/03/15. Many performance efforts are started without clear goals A performance analyst is routinely hired along with the design team. The analyst may then start modeling or simulating the design. What about the goals?. the model will help answer a design questions that may arise!. the model will be flexible enough to be easily modified to different problems!. … before writing the first line of a simulation code or the first equation of an analytical model or before setting up a measurement experiment, it is important to understand the system and identify the problem to be solved…. Once the problem is clear and the goals have been written down, finding the solution is often easier. there is no such thing as a general-purpose model for performance analysis! Each model must be developed with a particular goal in mind The metrics, workloads, and methodology all depend upon the goal. The part of the system design that needs to be studied in the model varies from problem to problem.. Setting goals is not a trivial exercise! Most performance problems are vague when first presented Find the timeout algorithm for retransmissions on a network!. how the load on the network should be adjusted under packet loss?. ?. ? ?. Biased Goals 2.

(3) 27/03/15. Let’s show that OUR system is better than THEIRS The problem becomes that of finding the metrics and workloads such that OUR system turns out better rather than that of finding the right metrics and workloads for comparing the two systems. One rule of professional etiquette for performance analysts is to be unbiased. Often analysts adopt an unsystematic approach whereby they select system parameters, factors, metrics, and workloads arbitrarily. Examples: • my computer is faster to calculate floating point operations è use floating point benchmark to compare computers performance • allocate only senior employees (ignore the existence of trainees) to the project and calculate the performance of my team. Unsystematic Approach. A Systematic Approach State Goals and Define the System List Services and Outcomes Select Metrics List Parameters Select Factors to Study Select Evaluation Technique Select Workload. This leads to inaccurate conclusions. Design Experiments Analyze and Interpret Data Present Results. 3.

(4) 27/03/15. Analysis without Understanding the Problem. …analysts who are trained in modeling aspects of performance evaluation but not in problem definition or result presentation often find their models being ignored by the decision makers who are looking for guidance and not a model…. nothing really has been achieved until a model has been constructed and some numerical results have been obtained.. Inexperienced analysts. a large share of the analysis effort goes in to defining a problem. This share often takes up to 40% of the total effort. Development of the model itself is a small part of the problem-solving process.. Experienced analysts. Incorrect Performance Metrics. Example of incorrect performance metrics. … analysts often choose those that can be easily computed or measured rather than the ones that are relevant…. Comparing the performance of two different CPU architectures (RISC and CISC): • metric: MIPS • the instructions on the two computers are unequal Evaluating the viability of a new product • metric: what is the minimum price the company is able to develop the product • are the clients interested in the product? • how much clients are willing to pay for the product?. 4.

(5) 27/03/15. …the workload should be representative of the actual usage of the system…. Unrepresentative Workload. Wrong Evaluation Technique …analysts often have a preference for one evaluation technique that they use for every performance evaluation problem… Ex: 1) those proficient in queueing theory tend to change every performance problem to a queueing problem even if the system is too complex and is easily available for measurement 2) those proficient in programming will tend to solve every problem by simulation.. …the workload has a significant impact on the results of a performance study… …wrong workload will lead to inaccurate conclusions…. There are 3 evaluation techniques: – measurement, – simulation, – analytical modeling. …this marriage to a single technique leads to a model that they can best solve rather than to a model that can best solve the problem… …an analyst should have a basic knowledge of all three techniques…. 5.

(6) 27/03/15. Overlooking Important Parameters. Ignoring Significant Factors. …it is a good idea to make a complete list of system and workload characteristics (parameters) that affect the performance of the system… Workload parameters may include – the number of users – request arrival patterns – priority, and so on.. …the final outcome of the study depends heavily upon the selected parameters…. …overlooking one or more important parameters may render the results useless…. …a factor is a parameters that are varied in the study… …other parameters may be fixed at their typical values... …parameters which have significant impact on the performance should be used as factors… Ex: …if packet arrival rate rather than packet size affects the response time of a network gateway, it would be better to use several different arrival rates in studying its performance…. … do not waste time comparing alternatives that the end user cannot adopt… … an analyst may know the time distribution for page references in a computer system but have no idea of the time distribution of disk references… Mistake: use the page reference distribution as a factor but ignore disk reference distribution. …the choice of factors should be based on their relevance and not on the analyst’s knowledge of the factors…. …the disk may be the bottleneck…. 6.

(7) 27/03/15. … a software manager may have all information about the performance of software projects, but have no control of the software quality… Mistake: increase the number of programmers, keeping the number of testing engineers constant …software maintenance, due to program errors, consumes most part of projects resources …. Inappropriate Experimental Design. What’s the number of measurement or simulation to be conducted and the parameter values used in each experiment? …improper selection of these values can result in a waste of the analyst’s time and resources.. …in naive experimental design, each factor is changed one by one… …this “simple design” may lead to wrong conclusions if the effect of one parameter depends upon the values of other parameters…. Inappropriate Level of Detail. …better alternatives are the use of the full factorial experimental designs and fractional factorial designs…. …avoid formulations that are either too narrow or to broad… too. det. d aile. …the goals of a study have a significant impact on what is modeled and how it is analyzed… …use a simple high-level model for comparing alternatives that are very different (several alternatives analyzed rapidly and inexpensively).... al fici per u s too. …use detailed model for comparing alternatives that are slight variations of a common approach…. g wron ls the mode eters param. 7.

(8) 27/03/15. What does it mean? performance analysts who are good in measurement techniques but lack data analysis expertise. No Analysis. Erroneous AnalysiS. They collect enormous amounts of data but do not know how to analyze or interpret it The result is a set of disks full of data without any summary At best, the analyst may produce a thick report full of raw data and graphs without any explanation of how one can use the results. use the average mean to summarize execution rates. …sensitive analysis determines the importance of different parameters…. No Sensitivity Analysis. …technique for systematically changing parameters in a model to determine the effects of such changes… Example (business) part of capital budgeting decisions determine the importance of different parameters on the demand for a given product: • • • • • • . Concurrency Investment Environmental impact Political trend Economic situation Social aspects. 8.

(9) 27/03/15. Ignoring Errors in Input. Improper Treatment of Outliers. …if an outlier is not caused by a real system phenomenon, it should be ignored…. …if the outlier is caused by a real system, it. should be appropriately included in the model... …deciding which outliers should be ignored and which should be included is part of the art of performance evaluation and requires careful understanding of the system…. Assuming No Change in the Future. …the future workload and system behavior is assumed to be the same as that already measured.... …the analyst should discuss this assumption and limit the amount of time into the future that predictions are made…. Ignoring Variability. 9.

(10) 27/03/15. …if the variability is high, the mean alone may be misleading to the decision makers… …decisions based on the daily averages may not be useful if the load demand has large hourly peaks, which impact performance.... …performance analysts should convey final conclusions in as simple a manner as possible… …given two analyses leading to the same conclusion, one that is simpler and easier to explain is obviously preferable…. …in the industrial world, the decision makers are rarely interested in the modeling technique or its innovativeness... …a majority of day-to-day performance problems are solved by simple models... ...even if the time was not restricted, complex models are not easily understood by the decision makers, and the model results may be misbelieved…. Too Complex Analysis. …significant difference in the types of models published in the literature and those used in the real world... …models published in the literature are generally too complex… ...the ability to develop and solve a complex model is valued more highly in academic circles than the ability to draw conclusions from a simple model…. Improper Presentation of Results. 10.

(11) 27/03/15. …an analysis that does not produce any useful results is a failure…. selling the results. … of the analysis to decision makers is the responsibility of the analyst.... prudent. …this requires the use of words, pictures, and graphs explain the results and the analysis…. to. the right metric. … to measure the performance of an analyst is not the number of analyses performed but the number of analyses that helped the decision makers…. …. writing and speaking are social skills. while modeling and data analysis are substantive skills…. Ignoring Social Aspect. …decision. makers are under time pressures and results as soon as possible…. would like to get to the final. …analysts spent a lot of time on the analysis and is more …only analysts with good social skills are successful in. selling their results to the decision makers…. …acceptance of the analysis results requires developing a. trust between the decision makers and analysts…. interested in talk about the innovativeness of the modeling approach…. …this disparity in viewpoint may lead to a report that is too long and fails to make an impact…. …the presentation to the decision makers should have minimal analysis jargon and emphasize the final results… …the presentation to other analysts should include all the details of the analysis techniques… …combining these two presentations into one could make it meaningless for both audiences…. Omitting Assumptions and Limitations:. 11.

(12) 27/03/15. break …analysts list the assumptions and limitations at the beginning of the report… …forget the assumptions and limitations at the end and make conclusions about environments to which the analysis does not apply…. A Systematic Approach A Systematic Approach to Performance Evaluation Prof. Ricardo Massa F. Lima CIn - UFPE. 1. 2. 3. 4. 5. 6. 7. 8. 9. . State goals and define boundaries Select evaluation techniques Select performance metrics List system and workload parameters Design experiments Select factors and values Select workload Analyze and interpret the data Present the results. Repeat.. A Systematic Approach 1. 2. 3. 4. 5. 6. 7. 8. 9. . State goals and define boundaries Select evaluation techniques Select performance metrics List system and workload parameters Design experiments Select factors and values Select workload Analyze and interpret the data Present the results. Repeat.. system boudaries . goals . fine als & De State Go. s Boundarie. Just “measuring performance” or “seeing how it works” is too broad Ex: goal is to decide which ISP (Internet Service Provider) provides better throughput. Definition of system may depend upon goals if measuring CPU instruction speed, system may include CPU + cache if measuring response time, system may include CPU + memory + … + OS + user workload. 12.

(13) 27/03/15. Project. Project. set up the groups maximum: 4 students. choose an application field. Project State Goals . • evaluate the level of co2 emission . Deﬁne Boundaries • auto industry . Deﬁne the System . computer network computer architecture computer programs software projects web services cloud computing manufacturing systems supply chain business processes business administration, etc. Different typical goals of performance analysis • Compare alternatives : the goal is to provide quantitative information about which configurations are best under specific conditions. • Determine the impact of a feature (before-and-after comparison): the goal is to evaluate the impact of adding/removing a well-defined component of the system • System tuning: the goal is finding the set of parameter values that produces the best overall performance • Identify relative performance: the goal is quantify the change in performance relative to: • a previous generations of the system. • a customer's expectations • a competitor's systems • Performance debugging: apply tools and analysis techniques to determine why the program is not meeting performance expectations. • Set expectations (capacity planning): the goal is to set the appropriate expectations for what new system generation will actually capable of doing.. • fuel injector system . A Systematic Approach 1. 2. 3. 4. 5. 6. 7. 8. 9. . State goals and define boundaries Select evaluation techniques Select performance metrics List system and workload parameters Design experiments Select factors and values Select workload Analyze and interpret the data Present the results. Repeat.. Comparing Techniques for Performance Evaluation Prof. Ricardo Massa F. Lima CIn - UFPE. 13.

(14) 27/03/15. Select Ev. aluation T echnique. simula?on . measure ment . The effort involved in the performanceanalysis task should be proportional to the cost of making the wrong decision. analy?c model . Depends upon time, resources and desired level of accuracy. comparing different manufacturers' systems. Analytic modeling: quick, less accurate. Simulation: medium effort, medium accuracy Measurement: most effort, most accurate. There are three fundamental techniques for performance-analysis. large purchasing decision. purchasing for personal use. substantial cost of making the wrong decision. cost of choosing the wrong one is minimal. Measurement of existing systems provide the best results no simplifying assumptions need to be made. measuring . analy?cal modeling . simula?on . Simulation program to model important features of the system being analyzed flexibility: easy to modify for studying the impact of changes costs include: the effort to write and debug the simulator the time to execute the necessary simulations these costs can be relatively low compared with the cost of purchasing a real machine on which to perform the corresponding experiments. limitation: it is impossible to model every small detail of the system simplifying assumptions are required in order to: write the simulation program allow it to execute in a reasonable amount of time. Simplifications limit the accuracy of the results. not flexible - information about only the specific system measuring the performance can be timeconsuming and difficult. Analytical Modeling mathematical description of the system results tend to be much less believable and much less accurate than that obtained from simulation or measurement a simple model can provide some quick insights into behavior of the system this insight can then be used to help focus a more detailed measurement or simulation experiment. provide a coarse level of validation of a simulation or measurement. 14.

(15) 27/03/15. Example. What’s the impact of memory access on the overall execution time?. Measurement. Measurement of existing systems (Example) Measurements of this time on a real machine can be quite difficult! operation of a complex memory hierarchy structure are not observable from a user's application program Possible approach: write simple programs that exercise specific portions of the memory hierarchy • program that repeatedly references the same variable can be used to estimate the time to access the first-level cache • program that always forces a cache miss can be used to measure the main memory access time It is hard to write these programs and determine their precise memory-referencing characteristics. Simulation (Example). Example. What’s the impact of memory access on the overall execution time?. Memory parameters can be easily changed to study its impact on performance • cache associativity • relative cache and memory delays • sizes of the cache and memory, and so forth Challenge to accurately model • the overlap of memory delays. Simulation. • execution of instructions in contemporary processors – out-of-order instruction issuing – branch prediction Even with simplifying assumptions, simulation can provide useful insights into the effect of the memory system on the performance. Analytical Modeling (Example). Example. What’s the impact of memory access on the overall execution time?. Model variables : time delay for cache references : time delay for main memory : cache hit ratio Th : cache miss ratio Average time required for all cache hits: tc. To apply this simple model we need to know these values often found in the manufacturer's specifications. ×h. Average time required for all cache misses: t m × (h − 1) A simple model of the overall average memory-access time:. Analytical Modeling. The image cannot be displayed. Your computer may not have. Only coarse estimate of the average memory-access time It can provide insights into the relative effects of: increasing the hit ratio changing the memory-timing parameters. 15.

(16) 27/03/15. Summary. Exercises. A comparison of the performance-analysis solution techniques Characteristic. Analytical modeling Simulation. Measurement. Flexibility. High. High. Low. Cost. Low. Medium. High. Believability. Low. Medium. High. Accuracy. Low. Medium. High. 1) Compare the three main performance-analysis solution techniques across several criteria. What additional criteria could be used to compare these techniques? 2) Identify the most appropriate solution technique for each of the following situations. – Estimating the performance benefit of a new feature that an engineer is considering adding to a computer system currently being designed. – Determining when it is time for a large insurance company to upgrade to a new system. – Deciding the best vendor from which to purchase new computers for an expansion to an academic computer lab. – Determining the minimum performance necessary for a computer system to be used on a deep-space probe with very limited available electrical power. Exercises (answers). Exercises (answers). Compares the three main performance-analysis solution techniques across several criteria. What additional criteria could be used to compare these techniques? – Impact on environment (simulation or analytical modeling). Identify the most appropriate solution technique for each of the following situations. – Estimating the performance benefit of a new feature that an engineer is considering adding to a computer system currently being designed. Decision can has a great impact on the system performance (measure). – Information security (simulation or analytical modeling) – Lack of information about the system (measurement). – Determining when it is time for a large insurance company to upgrade to a new system. Impossible to measure the future. Important decision for the company’s business (simulation). – Deciding the best vendor from which to purchase new computers for an expansion to an academic computer lab. Academic lab; expansion; this is not critical at all (analytical model). – Determining the minimum performance necessary for a computer system to be used on a deep-space probe with very limited available electrical power. Extremely critical mission (measure).. Project In our project, we are not going to create a real model. Instead, we assume that the model exists. Then, we create a data base and assume it was produced through the simulation, measuring or calculation of such model.. eak es br t u n i 30 m. But…feel free to create your own model !. 16.

(17) 27/03/15. A Systematic Approach 1. 2. 3. 4. 5. 6. 7. 8. 9. . State goals and define boundaries Select evaluation techniques Select performance metrics List system and workload parameters Design experiments Select factors and values Select workload Analyze and interpret the data Present the results. Repeat.. Metrics of Performance. Prof. Ricardo Massa F. Lima CIn - UFPE. The first step for understanding system’s performance is determining what things are interesting and useful to measure! Determine the metrics!. Select Metrics Criteria to compare performance In general, related to speed, accuracy and/or availability of system services network performance Speed: throughput and delay Accuracy: error rate Availability: data packets sent do arrive. processor performance Speed: time to execute instructions. What is a performance metric? A value used to describe the performance of the system!. We typically need to measure:. • a count of how many times an event occurs • the duration of some time interval • the size of some parameter. Metrics are obtained from secondary values measured in the system how many times a processor initiates an input/output request how long each of these requests takes How many bits are transmitted and stored Actual Value -- Performance Metric --. 17.

(18) 27/03/15. A typical metric is the race metric or throughput. N e the number of events that occur in a given interval t the time interval over which the events occur. Throughput =. Ne t. Choosing the performance metric depends on the goals for the specific situation and the cost of gathering information. Linearity. Characteristics of a good performance metric. Linearity if the value of the metric changes by a certain ratio, the actual performance of the machine should change by the same ratio. Ex: If you duplicate the system’s resources, you expect the new system to run in half the time taken by the old It makes the metric intuitively appealing to most people Not all types of metrics satisfy this proportionally Ex1: the dB scale used to describe the intensity of sound is a logarithmic metric Ex2: cash flow does not necessarily increases linearly with the company’s profit. Reliability Metric says sys A outperforms sys B ⇒. Actual Behaviour sys A outperforms sys B. While this requirement would seem to be so obvious, commonly used performance metrics do not satisfy it Ex1:. • The MIPS metric is notoriously unreliable. • one processor with lower MIPS rating executes a specific program in less time than the other processor. Ex2:. • Level of satisfaction • May depend on social aspects, age, gender, region, etc.. 18.

(19) 27/03/15. Repeatability. Easiness of measurement. The same value of the metric is measured each time the same experiment is performed. The metric is easy to measure. This implies that a good metric is deterministic. A metric difficult to measure or to derive is: • unlike to be used • error prone: the only thing worse than a bad metric is a metric whose value is measured incorrectly. Ex: the number of tasks executed per day may vary if there is no standard for tasks size. Consistency. Independence. The units of the metric and its definition are the same across different systems and different configurations of the same system.. Common used performance metrics orient the decision of many purchasers of computer systems. Important to compare the performance of the different systems While this requirement would seem to be so obvious, commonly used performance metrics do not satisfy it. Manufacturers design their machines to: • optimize the value obtained for that particular metric • influence the composition of the metric to their benefit. To prevent corruption of its meaning, a good metric should be independent of such outside influences. • MIPS and MFLOPS (different processors can do substantially different amounts of computation with a single instruction) • Tasks/day (tasks with different granularity). eak es br t u n i ten m. Processor and system performance metrics A wide variety of performance metrics has been proposed and used in the computer field. Unfortunately, many of these metrics are not good or they are used and interpreted incorrectly.. 19.

(20) 27/03/15. The clock rate. MIPS Millions of Instructions executed per Second. Claim: a 250 MHz system must always be faster at solving the user's problem than a 200 MHz system! It ignores: • amount of computation is actually accomplished in each clock cycle • complex interactions with the memory and I/O subsystems • processor may not be the performance bottleneck. MIPS =. n te × 10 6. Where t e is the time required to execute. n. total instructions.. Problem: different processors can do different amounts of computation with a single instruction (CISC and RICS). The metric is good?. The metric is good?. Characteristics. yes/no. Reason. Characteristics. yes/no. Reason. Linearity. no. Systems do not scale with the clock. Linearity. no. 2xMIPS ⇒ 2x performance. Reliability. no. Reliability. no. does not correlate well to performance at all. Repeatability. yes. Constant. Repeatability. yes. Constant. Easiness of measurement. yes. Provided by manufactures. Easiness of measurement. yes. Count the number of instructions executed. Consistency. yes. Value of MHz is precisely defined across all systems. Consistency. no. does not correlate well to performance at all. Independence. yes. Free from corruption. Independence. yes. Free from corruption. MFLOPS Millions of FLOating-Point per Second Problem 1: • The MFLOPS rating for a system executing a program that performs no floating-point calculations is exactly zero. Problem 2: • Some processors can calculate functions, such as sin, cos, and log, in a single instruction, while others require several multiplications, additions, and table look-ups. • Should these operations be counted as a single floating-point operation or multiple floating-point operations?. SPEC System Performance Evaluation Cooperative A set of integer and floating-point benchmark programs Reflect the way most computer systems are used Standardized the methodology for measuring and reporting the performance of these programs 1. . Measure - measure the time required to execute each program on the system being tested. 2. Normalize - divide the time measured by the time required to execute each program on a standard basis machine. The metric is good? Characteristics. yes/no. Reason. Linearity. yes. Better than MIPS (more specific). Reliability. no. Problems 1 and 2. Repeatability. yes. Constant. Easiness of measurement. yes. Count the number of instructions executed. Consistency. no. Problem 2. Independence. no. Problem 2. 3. Geometric mean – calculate the geometric mean of all normalized values to produce the performance metric. ⎛. n. ⎝. i=1. ⎞1/ n. Geometric Mean: M G = n T1 × T2 ... × Tn = ⎜∏ Ti ⎟ ⎠. Some performance analysts advocate that this is the correct mean to use when – summarizing € normalized numbers. – summarizing measurements with a wide range of values (a single value has less influence than it would on the arithmetic mean). It maintains consistent relationships when comparing € normalized values regardless of the basis system used to normalize the measurements.. Geometric mean with times Program. System 1 System 2 System 3. 1. 417. 244. 2. 83. 70. 134 70. 3. 66. 153. 135. 4. 39,449. 33,527. 66,000. 5. 772. 368. 369. Geometric mean. 587. 503. 499. Rank. 3. 2. 1 120 . 20.

(21) 27/03/15. Geometric mean normalized to System 1 Program. Geometric mean normalized to System 2 Program. System 1 System 2 System 3. 1. 1.0. 0.59. 0.32. 2. 1.0. 0.84. 0.85. 3. 1.0. 2.32. 2.05. 4. 1.0. 0.85. 1.67. 5. 1.0. 0.48. 0.45. Geometric mean. 1.0. 0.86. 0.84. Rank. 3. 2. 1. System 1 System 2 System 3 1.71. 1.0. 1.19. 1.0. 0.55 1.0. 0.43. 1.0. 0.88 1.97. 1.18. 1.0. 2.10. 1.0. 1.0. Geometric mean. 1.17. 1.0. 0.99. Rank. 3. 2. 1. 121 . What’s going on here?!. Total execution times Program. System 1 System 2 System 3. System 1 System 2 System 3. 1. 417. 244. 134. 2. 83. 70. 70. Geometric mean wrt 1. 1.0. 0.86. Rank. 3. 2. 0.84 1. Geometric mean wrt 2. 1.17. 1.0. 0.99. 3. 66. 153. 135. Rank. 3. 2. 1. 4. 39,449. 33,527. 66,000. Arithmetic mean. 8,157. 6,872. 13,342. 5. 772. 368. 369. Rank. 2. 1. 3. Total. 40,787. 34,362. 66,798. Arithmetic mean. 8,157. 6,872. 13,342. Rank. 2. 1. 3 123 . geometric mean is consistent regardless of the normalization basis. it is consistently wrong 124 . Averaging normalized values. Geometric mean for times ⎛. n. 1/ n. ⎞. not directly proportional to sum of times MG = ⎜∏ Ti ⎟. ⎝ i=1 ⎠. not appropriate for summarizing times. € not inversely proportional to sum of times. ⎛ n ⎞1/ n ⎛ n F ⎞1/ n MG = ⎜∏ M i ⎟ = ⎜∏ ⎟ ⎝ i=1 ⎠ ⎝ i=1 Ti ⎠ not appropriate for summarizing rates. €. 122 . 125. Program. sys 1. norm. sys 1. Sys 2. norm. sys 2. Sys 3. norm. sys 1. 1. 417. 1. 244. 0,59. 134. 0,32. 2. 83. 1. 70. 0,84. 70. 0,84. 3. 66. 1. 153. 2,32. 135. 2,05. 4. 39,449. 1. 33,527. 0,85. 66,000. 1,67. 5. 772. 1. 368. 0,48. 369. 0,48. Total. 40,787. 1. 34,362. 5,07. 66,798. 5,36. Arithmetic mean. 8,157. 1. 6,872. 1,01. 13,342. 1,07. Rank. 2. 1. 1. 2. 3. 3. 126 . 21.

(22) 27/03/15. Averaging normalized values. Normalization Averaging normalized values doesn’t make sense mathematically… …it gives a number, but the number has no physical meaning…. Program. sys 1. sys 2. sys 3. 1. 417. 244. 134. 2. 83. 70. 70. 3. 66. 153. 135. 4. 39,449 33,527 66,000. 5. 772. Total Arithmetic mean. 368. 369. 40,787 34,362 66,798 8,157. 6,872. 13,342. Normalized mean (sys 1). 1. 0,84. 1,64. Rank. 2. 1. 3. …first compute the mean, then normalize. t en t is ns o c 128 . 127. SPEC System Performance Evaluation Cooperative. SPEC System Performance Evaluation Cooperative. A set of integer and floating-point benchmark programs. A set of integer and floating-point benchmark programs. Reflect the way most computer systems are used. Reflect the way most computer systems are used. Standardized the methodology for measuring and reporting the performance of these programs. Standardized the methodology for measuring and reporting the performance of these programs. 1. . Measure - measure the time required to execute each program on the system being tested. 2. Normalize - divide the time measured by the time required to execute each program on a standard basis machine 3. Geometric mean – calculate the geometric mean of all normalized values to produce the performance metric. SPEC System Performance Evaluation Cooperative Problems • Not intuitive - the geometric mean produces a metric that is not linearly related to a program's execution time • Unreliable – a program may execute faster on a system with a lower SPEC than it does on a system with a higher rating • Dependent of outside influences • many compiler developers tune their optimizations to the characteristics of this collection of applications • Programs in the SPEC suite is determined by a committee of representatives from the manufacturers. !. 1. . Measure - measure the time required to execute each program on the system being tested. 2. Compute the mean 3. Normalize. QUIPS Quality Improvements Per Second Fixes neither time nor problem size Focus on the quality of the solution (Qs) for a given mathematical problem Focus on floating-point and memory system performance Qs is defined on the basis of mathematical characteristics of the problem being solved Measure the time required to achieve that level of quality (t). QUIPS =. Qs t. €. 22.

(23) 27/03/15. QUIPS QUality Improvements Per Second. QUIPS QUality Improvements Per Second. Good aspects of the metric. Problem: narrow focus on floating-point and memory system performance. • Mathematically precise definition of ‘quality’ • self-consistent when it is ported to different machines (characteristic 5). • Good for predicting the performance of numerical programs • Does not exercise some aspects of other types of applications. • easily repeatable (characteristic 3). • input/output subsystem. • the resulting measure of quality is linearly related to the time. • instruction cache. required to obtain the solution (characteristic 1) • benchmark developed to be easy to measure (characteristic 4). • operating system's ability to multiprogram • Difficult to change the quality definition to focus on other aspects of a system's performance. Reliable only for QUIPS like applica?ons . Execution Time. Execution Time. Quite simply!. Without a precise and accurate measure of time, it is impossible analyze the performance characteristics!. The system that produces the smallest total execution time for a given application program has the highest performance!. How to measure the execution time of a program, or a portion of a program? What are the limitations of the measuring tool?. Execution Time. Execution Time. The basic technique. Example. • counts the number of clock ticks that have occurred between the start and the end of the event • The elapsed time:. Et = Ct × t. Where: • Ct is the difference between the two count values • t is the period of the clock ticks. €. main() {! !int i; ! !float a; ! ! !init_timer(); ! !! !/* Read the starting time */ ! !start_count = read_count;! ! !/*Stuff to be measured */ ! !for (i=0; i < 10000000; i++){ ! ! !a = i* a /10;! ! }! ! !/* Read the ending time. */ ! !end_count=read_count; ! ! !elapsed_time = (end_count - start_count)*clock_cycle;! } !. 23.

(24) 27/03/15. Execution Time. Execution Time. The measurement includes input/output operations, memory paging, and other system operations. The measured execution time can vary significantly • background operating system tasks. Problem 1: system being measured is time-shared • elapsed execution time includes the time spent waiting while other users' applications execute. • different virtual-to-physical page mappings • cache behavior. Solution 1: total CPU time • total time the processor actually spends executing the program • Problem 2: ignores the waiting time that is inherent to the application (waiting for input/output operations, memory paging, and other system operations) • Solution 2: • report both the CPU time and the total execution time • the reader determines the interference of others factors. • variable system load in a time-shared system, and so forth.. Measure the execution time several times and report at least the mean and variance! It satisfies all of the characteristics of a good performance metric. It is one of the best processor-centric performance metrics!. Other types of performance metrics. Sumary Clock. MIPS. MFLOPS. SPEC. Linear. QUIPS. TIME. ≈✓. ✓. Reliable. ≈✓. Repeatable. ✓. ✓. ✓. ✓. ✓. ✓. Easy to measure. ✓. ✓. ✓. ≈✓. ✓. ✓. Consistent. ✓. ✓. ✓. ✓. Independent. ✓. ✓. ✓. ✓. Speedup and relative change Useful metrics for comparing systems Normalize performance to a common basis Although defined in terms of throughput or speed metrics, this metrics are often calculated directly from execution times. In addition to the processor-centric metrics, there are many other performance metrics Response Time. time interval between user request and the arrival of system response Example performance of online transaction-processing systems. Throughput. number of jobs completed per unit time 1. . Examples real-time video-processing: number of video frames processed per second.. 2. . bandwidth of a communication network: number of bits transmitted per second.. Speedup Speedup of system 2 with respect to system 1 S 2,1 =. R2 R1. where R1 and R2 are the ‘speed' metrics being compared € A speed metric is a rate metric (i.e. throughput). R1 =. D1 T1. R2 =. D2 T2. where Dn is analogous to the ‘distance traveled' in time Tn by the program when executing on system n € € Assuming that the ‘distance traveled' by each system is the same. D1 = D2 = D. ⇒. S2,1 =. D2 D R2 T2 T2 T1 = = = D R1 D1 T2 T1 T1. €. 24.

(25) 27/03/15. Means-based metrics vs Ends-based metrics. Relative Change Relative change of system 2 with respect to system 1. Δ 2,1 =. R2 − R1 R1. Reliability is one of the most important characteristics of a performance metric. Assuming that the ‘distance traveled' by each system is the same. D1 = D2 = D ⇒ Δ 2,1 =. R2 − R1 = R1. D2. What makes a metric unreliable?. D1 D€ − D T2 − T1 T2 T1 T1 − T2 = = = S2,1 −1 D1 D T2 T1 T1. • it measures what was done • no matter if the work was useful or not Nop instructions, multiply by zero, …. Typically, Δ2,1 is multiplied by 100 to express the metric as a percentage with respect to a given basis system. €. System x. Execution Time Tx. Speedup Sx,1. 1. 480. 1. Relative Change Δx,1 (%) 0. 2. 360. 1.33. +33. 3. 540. 0.89. -11. 4. 210. 2.29. +129. What makes a metric reliable? • accurately and consistently measures progress towards a goal. Means-based metrics vs Ends-based metrics. Example. 2N floating-point operations (FLOP): N addition + N multiplication. Ends-based. Means-based. s = 0;! for (i=1; i<=N; i++){! s = s + x[i] * y[i];! {!. • t+ number of cycles required to execute one addition • t1 total number of cycles required to execute this program. Execution time. QUIPS. SPEC. MIPS. MFLOPS. Clock rate. • t* number of cycles required to execute one multiplication. t1 = N × (t* + t + ) cycles • The resulting execution rate is. R1 = €. 2×N 2 = FLOP /cycle N × (t* + t + ) (t* + t + ). €. It may be possible to reduce the total execution time if elements of the two vectors are zero:. Assuming. s = 0;! for (i=1; i<=N; i++){! if(x[i] != 0 && y[i] != 0)! s = s + x[i] * y[i];! {! • tif number of cycles required to execute the if statement • f is the fraction of N for which both x[i] and y[i] are nonzero • t2 total number of cycles required to execute this program. €. t+ = 5 cycles. f = 10%. t€ 1 = N × (t* + t + ) = 15 × N cycles = 60 × N ns t€ 2 = N × [t if + f × (t* + t + )] = N × [4 + 0.1×15] = 5.5 × N cycles = 22 × N ns €. speedup of program 2 relative to program 1 S2,1 = 60N. €. Calculating the execution rates realized by each program. • The resulting execution rate is. 2×N × f 2× f = FLOP /cycle N × [t if + f × (t* + t + )] t if + f × (t* + t + ). t* = 10 cycles. Then. t 2 = N × [t if + f × (t* + t + )] cycles. R2 =. tif = 4 cycles. clock = 250MHz ⇒ cycle = 4ns. €. 22N. = 2.73. R1 =. 2 2 = = 33 €MFLOPS (t* + t + ) (10 + 5) × 4ns. R2 =. 2× f 2 × 0.1 = = 9 MFLOPS tif + f × (t* + t + ) [4 + 0.1× (10 + 5)]× 4ns. €. €. 25.

(26) 27/03/15. Conclusions. Conclusions. Even though we have reduced the total execution time. Even though we have reduced the total execution time. S2,1 = 60N. 22N. = 2.73. S2,1 = 60N. program 2 is 172% faster than program 1. The means-based metric (MFLOPS) shows the contrary €. R1 = 33 MFLOPS. program 2 is 72% slower than program 1. R2 = 9 MFLOPS. unfairly gives program 1 credit for all of the useless operations of multiplying and adding zero. = 2.73. program 2 is 172% faster than program 1. The means-based metric (MFLOPS) shows the contrary €. €. €. €. €. Lista 1. 22N. R1 = 33 MFLOPS. program 2 is 72% slower than program 1. R2 = 9 MFLOPS. unfairly gives program 1 credit for all of the useless operations of multiplying and adding zero. This example highlights the danger of using the wrong metric to reach a conclusion about system performance. Project. Define the metrics for your performance analysis project. Evaluate the quality of the metrics you have chosen.. Are these metrics aligned with your goals?. 26.

(27) 27/03/15. row r o m u to o y e Se. A Systematic Approach 1. 2. 3. 4. 5. 6. 7. 8. 9. . Project List all parameters that affect performance System parameters (hardware and software) CPU type, OS type, …. Workload parameters number of users, type of requests. List may not be initially complete, so have working list and let grow as progress. State goals and define boundaries Select evaluation techniques Select performance metrics List system and workload parameters Design experiments Select factors and values Select workload Analyze and interpret the data Present the results. Repeat.. A Systematic Approach 1. 2. 3. 4. 5. 6. 7. 8. 9. . State goals and define boundaries Select evaluation techniques Select performance metrics List system and workload parameters Design experiments Select factors and values Select workload Analyze and interpret the data Present the results. Repeat.. Design Experiments Design Experiments -- an introduction -Prof. Ricardo Massa F. Lima CIn - UFPE. Want to maximize results with minimal effort Phase 1 many factors, few levels see which factors matter. Phase 2 few factors, more levels see where the range of impact for the factors is. 27.

(28) 27/03/15. Introduction. Introduction Goal is to obtain maximum information with minimum number of experiments Proper analysis will help separate out the factors. Key assumption is non-zero cost. Takes time and effort to gather data Takes time and effort to analyze and draw conclusions à Minimize number of experiments run. Statistical techniques will help determine if differences are caused by variations from errors or not. Good experimental design allows you to: Isolate effects of each input variable. Determine effects due to interactions of input variables Determine magnitude of experimental error. Obtain maximum info with minimum effort 163. Introduction. 164. Terminology. Consider. Study PC performance. Vary one input while holding others constant. CPU choice: Core Duo, Core 2, Xeon. Simple, but ignores possible interaction between two input variables. Memory size: 2GB, 4GB, 8GB. Test all possible combinations of input variables. Disk drives: 1-4. Can determine interaction effects, but can be very large. Workload: study, scientific, play. Ex: 5 factors with 4 levels à 45 = 1024 experiments.. Users: high school, college, graduate. Repeating 3 times to get variation in measurement error 1024x3 = 3072. Response variable – the outcome or the measured performance. There are, of course, in-between choices…. Ex: throughput in tasks/sec or response time for a task in seconds. 165. Terminology. 166. Terminology. Factors – each variable that affects response Ex: CPU, memory, disks, workload, user. Levels – the different values factors can take. EX: CPU (3), memory (3), disks (4), workload (3), users (3). Primary factors – those of most important interest Ex: maybe CPU and memory the most. Design – specification of the replication, factors, levels Example CPU (3), memory (3), disks (4), workload (3), users (3). Secondary factors – of less importance. specify all factors, at above levels with 5 replications so 3x3x4x3x3 = 324 times 5 replications yields 1215 total. Ex: maybe user type not as important. Replication – repetition of all or some experiments Ex: if run three times, then three replications. 167. 168. 28.

(29) 27/03/15. Terminology. Common Mistakes in Experiments. Interaction – two factors A and B interact if one shows dependence upon another. Variation due to experimental error is ignored. measured values have randomness due to measurement error. Ex: non-interacting factor since A always increases by 2 A1 A2 A1 B1 3 5 A2 B2 6 8 B1. è do not assume all variation is due to factors !. Important parameters not controlled all parameters should be listed and accounted for, even if not all are varied. Otherwise, the results may not be meaningful.. B2. Ex: interacting factors since A change depends upon B A1 A2 A1 B1 3 5 A2 B2 6 9 B1. Effects of different factors not isolated. may vary several factors simultaneously and then not be able to attribute change to any one 169. B2. use of simple designs (next topic) may help, but have problems 170. Common Mistakes in Experiments. Simple Designs 1. Start with typical configuration. Interactions are ignored. often effect of one factor depends upon another. 2. Vary one factor at a time Ex: typical may be PC with core duo, 4GB RAM, 1 disk, managerial workload by college student. Ex: effects of cache may depend upon size of program need to move beyond one-factor-at-a-time designs. • vary CPU, keeping everything else constant, and compare • vary disk drives, keeping everything else constant, and compare. Too many experiments are conducted. Given k factors, with ith having ni levels. rather than running all factors, all levels, at all combinations, break into steps. Total = 1 + Σ(ni-1) for i = 1 to k. Example: in workstation study. first step, few factors and few levels. 1 + (3-1) + (3-1) + (4-1) + (3-1) + (3-1) + (3-1) = 14. » determine which factors are significant » two levels per factor (details later). processor mem.size no disks workload. users. But may ignore interaction !. more levels added at later design, as appropriate 171. Full Factorial Designs. Example of Interaction of Factors Consider response time vs. memory size and degree of multiprogramming Degree 1 2 3 4. 2GB 0.25 0.52 0.81 1.50. 4GB 0.21 0.45 0.66 1.45. 172. 8GB 0.15 0.36 0.50 0.70. Every possible combination at all levels of all factors Given k factors, with ith having ni levels Total = Π ni for i = 1 to k. Example in the CPU design study (3 CPUs)(3 mem) (4 disks) (3 loads) (3 users) = 324 experiments. advantage: can find every interaction component. fixing degree 3, mem 4GB and vary one at a time, may miss interaction. Example: degree 4, non-linear response time with memory 173. disadvantage: expensive (time and money) reducing costs: reduce levels, reduce factors, run fraction of full factorial 174. 29.

(30) 27/03/15. Fractional Factorial Designs. Project. Consider only a subset of the factors. Given k factors, with ith having ni levels, lets remove x levels Total = Π ni for i = 1 to k-x. Example in the CPU design study, lets ignore the number of disks (3 CPUs)(3 mem) (3 loads) (3 users) = 81 experiments. Look at the list of parameters you have defined and select the factors parameters that can affect the result of your performance analysis. specify nfi levels for each factor fi you must quantify the value for each level. 175. A Systematic Approach 1. 2. 3. 4. 5. 6. 7. 8. 9. . State goals and define boundaries Select evaluation techniques Select performance metrics List system and workload parameters Design experiments Select factors and values Select workload Analyze and interpret the data Present the results. Repeat.. Project. in this class, we are going to learn how to answer these questions.. Project in the previous task, you selected some factors from the list of parameters.... …but, are these factors actually relevant for your experiment? do they have significant influence in your performance results?. Select Factors to Study Divide parameters into those that are to be studied and those that are not may vary CPU type but fix OS type. may fix packet size but vary number of connections. Select appropriate levels for each factor want ones with potentially high impact. start with a short list of factors and a small number of levels (e.g. small and large). 30.

(31) 27/03/15. 30. 2k Factorial Designs. eak es br minut. Very often, many levels at each factor Example effect of network latency on user response time à there are lots of latency values to test. For each factor, choose 2 alternatives at each level. 2k factorial designs Then, can determine which of the factors impacts performance the most and study those further. 182. 22 Factorial Design. 22 Factorial Design. Special case with only 2 factors. -- easily analyzed with a regression model –. Substituting: 15 = q0 - qa - qb + qab 45 = q0 + qa - qb - qab 25 = q0 - qa + qb - qab 75 = q0 + qa + qb + qab. MIPS for mem (4 or 16 Mbytes) and cache (1 or 2 Kbytes). cache size 1 KB 2 KB. mem 4MB 15 25. mem 16MB 45 75. Let us define two variables xa and xb as follows:. Can solve to get:. ⎧−1 if 1KB cache x b = ⎨ ⎩+1 if 2KB cache. ⎧−1 if 4 MB mem x a = ⎨ ⎩+1 if 16MB mem. Interpret:. € €. a -‐1 1 -‐1 1. b -‐1 -‐1 1 1. 184. 22 Factorial Design. y y1 y2 y3 y4 . Solving, we get: q0 = ¼( y1 + y2 + y3 + y4) qa = ¼(-‐y1 + y2 -‐ y3 + y4) qb = ¼(-‐y1 -‐ y2 + y3 + y4) qab= ¼( y1 -‐ y2 -‐ y3 + y4) . . y = q0 + qaxa + qbxb + qabxaxb . So:. q0 mean performance: 40 MIPS. 183. 22 Factorial Design Exp 1 2 3 4 . y = 40 + 20xa + 10xb + 5xaxb. qa memory effect: 20 MIPS qb cache effect: 10 MIPS qab interaction effect: 5 MIPS. Performance y in MIPS can be regressed on xa and xb using the nonlinear regression model: y = q0 + qa x a + qb x b + qab x a x b. €. 4 equations in 4 unknowns. y1 = q0 -‐ qa -‐ qb + qab . y2 = q0 + qa -‐ qb -‐ qab y3 = q0 -‐ qa + qb -‐ qab y4 = q0 + qa + qb + qab . Notice: qa can be obtained by multiplying “a” column by “y” column and adding Same is true for qb and qab. 185 . i 1 1 1 1 160 40 . a -‐1 1 -‐1 1 80 20. b -‐1 -‐1 1 1 40 10. ab y 1 15 -‐1 45 -‐1 25 1 75 20 Total 5 Ttl/4. column “i” has all 1s columns “a” and “b” have all combinations of 1, -1 column “ab” is product of column “a” and “b” multiply column entries by yi and sum dived each by 4 to give weight in regression model result: y = 40 + 20xa + 10xb + 5186 xaxb . 31.

(32) 27/03/15. Allocation of Variation. Allocation of Variation. Sample variance of y:. the importance of a factor measured by proportion of total variation in response explained by the factor…. y = sy2 =. ∑. 22 i=1. (y i − y) 2. 2 2 −1 y is the mean response time. ∑ ∑. …if two factors explain 90% and 5% of the response, then the second may be ignored !. 22 i=1 2. 2. i=1. (y i − y) 2 is the sum of squares total (SST) 2 (y i − y) 2 ∴ 2 2 qa2 + 2 2 qb2 + 2 2 qab. For a 22 design, variation is in 3 parts: SST = 22q2a + 22q2b + 22q2ab. But, how to calculate the sample variance?. Portion of total variation: €. of a is 22qa2 of b is 22qb2 of ab is 22qab2. thus, SST = SSA + SSB + SSAB fraction of variation explained by a: SSA/SST 187. 188. Allocation of Variation. Exercise. In the memory-cache study y=. Determine the importance of each factor measured in the following 22 Factorial experiment to determine the influence of the project management methodology and the type of application in the time to complete a software project. (15 + 45 + 25 + 75) = 40 4 22. SST = ∑i=1 (y i − y) 2 = (25 2 +15 2 + 5 2 + 35 2 ) = 2100 2 SST = 2 2 qa2 + 2 2 qb2 + 2 2 qab = 4 × 20 2 + 4 × 10 2 + 4 × 5 2 = 2100. €. 1600 (of 2100, 76%) is attributed to memory 400 (of 2100, 19%) is attributed to cache 100 (of 2100, 5%) is attributed to interaction. this data suggests exploring memory further and not spending much time on cache (or interaction). Proj. Management Method. Scrum Rup. App1 16 days 18 days. App2 22 days 24 days. .. 189. General 2k Factorial Designs. General 2k Factorial Designs. Example design a new machine. Can extend same methodology to k factors, each with 2 levels à Need 2k experiments. -- cache, memory, and processor --. k main effects (k choose 2) two factor effects (k choose 3) three factor effects…. Factor Memory (a) Cache (b) Processors (c). Level –1 4 Mbytes 1 Kbytes 1. Level 1 16 Mbytes 2 Kbytes 2. the 23 design and MIPS performance results are:. can use sign table method 191. 4 MB mem(a) cache (b) one proc (c) two procs 1 KB 14 46 2 KB 10 50. 16 MB mem one proc two procs 22 58 34 86. 192. 32.

(33) 27/03/15. General 2k Factorial Designs. General 2k Factorial Designs. Prepare sign table: i 1 1 1 1 1 1 1 1 320 40. a -1 1 -1 1 -1 1 -1 1 80 10. b -1 -1 1 1 1 -1 1 1 40 5. c -1 -1 -1 -1 1 1 1 1 160 20. ab 1 -1 1 1 -1 -1 -1 1 40 5. ac 1 -1 -1 -1 -1 1 -1 1 16 2. bc 1 1 -1 -1 1 -1 1 1 24 3. abc -1 1 -1 -1 -1 -1 -1 1 9 1. qa=10, qb=5, qc=20, qab=5, qac=2, qbc=3, qabc=1. y 14 22 10 34 46 58 50 86 Ttl Ttl/8. qa =10, qb=5, qc=20, qab=5, qac=2, qbc=3, qabc=1. SST = 23 (qa2+qb2+qc2+qab2+qac2+qbc2+qabc2) = 8 (102+52+202+52+22+32+12) = 800+200+3200+200+32+72+8 = 4512 The portion explained by the 7 factors are: mem = 800/4512 (18%) proc = 3200/4512 (71%) mem-proc = 32/4512 (1%) mem-proc-cache = 8/4512 (0%). cache = 200/4512 (4%) mem-cache =200/4512 (4%) cache-proc = 72/4512 (2%). 193. 194. 22r Factorial Design Errors. 2kr Factorial Designs. previous cache experiment with r=3. With 2k factorial designs, it is not possible to estimate error since only done once. i 1 1 1 1 164. So, repeat r times for 2kr observations As before, will start with 22r model and expand. 41. a -1 1 -1 1 86 21.5. b -1 -1 1 1 38. ab 1 -1 -1 1 20. 9.5. 5. yi 15 48 24 77. yi1 yi2 yi3 15 18 12 45 48 51 25 28 19 75 75 81. ei1 ei2 ei3 0 3 -3 -3 0 3 1 4 -5 -2 -2 4 Total Ttl/4. Two factors at two levels and want to isolate experimental errors repeat 4 configurations r times. have estimate for each y: yi = q0 + qaxai + qbxbi + qabxaixbi + ei. Gives you error: y = q0 + qaxa + qbxb + qabxaxb + e. use sum of squared errors (SSE) to compute variance and confidence intervals: SSE = ΣΣe2ij for i = 1 to 4 and j = 1 to r. Want to quantify e. have difference (error) for each repetition: eij = yij – yi = yij - q0 - qaxai - qbxbi - qabxaixbi. 195. 22r Factorial Allocation of Variation Total variation (SST): SST = Σ(yij – y..)2 Can be divided into 4 parts: Σ(yij – y..)2 = 22rq2a + 22rq2b + 22rq2ab + [SSY - 22r(q02+qa2+qb2+qab2)] SST = SSA + SSB + SSAB + SSE SSY = 15 2 +18 2 +12 2 + ... + 75 2 + 812 = 27204 SSA = 2 2 × 3qa2 = 12 × (21.5) 2 = 5547. memorycache example. 2. 2 b. 2. SSB = 2 × 3q = 12 × (9.5) = 1083 2 SSAB = 2 2 × 3qab = 12 × 5 2 = 300. SSE = 27204 − 2 2 × 3 × (412 + 21.5 2 + 9.5 2 + 5 2 ) = 102. SST = 5547 +1083 + 300 +102 = 7032 thus, total variation of 7032 divided into 4 parts: a explains 5547/7032 è 78.88% b explains € 1083/7032 è 15.40% ab explains 300/7032 è 4.27% remaining 1.45% (102/7032) unexplained and attributed to error 197. Example: SSE = 02+32+(-3)2+(-3)2+02+32+12+42+(-5)2+(-2)2+(-2)2+42= 102. 196. Exercise Imagine you improved the previous exercise to estimate the error. You repeated the experiment three times and accessed the following results Proj. Management Method. Scrum Rup. App1 16 days 18 days. App2 22 days 24 days. .. Proj. Management Method. Scrum Rup. App1 18 days 20 days. App2 18 days 22 days. .. Proj. Management Method. Scrum Rup. App1 16 days 19 days. App2 18 days 23 days. .. 33.

(34) 27/03/15. Project. break. For the experiment that follows you are supposed to assess the results from each experiment (using simulation, measuring, or analytical modeling)…. …but, if it is not the case, just invent it do they have significant influence in your performance results?. Can we trust in the results of our experiments? Are the influence of these factors and interferences statistically relevant? Is the error influence relevant?. Average Performance and Variability Prof. Ricardo Massa F. Lima CIn - UFPE. e s e th r e sw , we n a s n To o i st e e u q m o s ! d e s e n stic. i t a st. Trying to summarize the performance of a system over all classes of applications using a single mean value can result in very misleading conclusions So, why mean values are used in performance analysis?. 34.

(35) 27/03/15. human nature!. Impossible goal!. People continue to want a simple way to compare different computer systems!. BUT…. Strong demand to reduce the performance of a computer system to a single number!. Mean values can be useful for performing coarse comparisons The performance analyst will see others use mean values to justify some result. Consequently…. It is important to 1) correctly calculate an mean value. Probability, Parameters and Statistics. 2) recognize when a mean has been calculated incorrectly 3) recognize when a mean is being used inappropriately. The Hypothesis of Random Sampling? Actual Population. The Hypothesis of Random Sampling?. Hypothetical Population conceptually contains all values that can occur from a given operation any set of observations we may collect is a kind of sample from the population. An important statistical idea is that a set of observations may be regarded as a random sample.. Consider daily temperature data Warm days tend to follow one another Such data are said to be autocorrelated and are thus not directly representable by random drawings. Considering public opinion polls A poll conducted on election night at the headquarters of one political party might give an entirely false picture of the standing of its candidate in the voting population.. This hypothesis of random sampling however will often not apply to actual data.. 35.

(36) 27/03/15. The dot diagram is a valuable device for displaying the distribution of a small body of data.. The Hypothesis of Random Sampling?. It shows: 1. the location of the observation (near 67) 2. The spread of the observation (≅5 units) 3. points more extreme than the rest – outliers. It is unfortunate that the hypothesis of random sampling is treated in much statistical writing as if it were a natural phenomenon. For real data it is a property that can never be relied upon, although suitable precautions in the design of an experiment can make the assumption relevant.. Medidas amostrais. Example in Minitab. 66.7 64.3 67.1 66.1 69.1 67.2 68.1 65.7 66.4. Medidas amostrais. Tendência ou localização central:. Localização relativa:. – média (mean), – mediana (median), – moda (mode), – média aparada (trimmed mean).. – Mínimo (minimum), – Máximo (maximum), – Quantil (quantile), – Quartil (quartile), – Percentil (percentile).. 213. 214. Medidas amostrais Dispersão: – amplitude (range), – distância inter-quartil (inter-quartile range), – variância (variance), – desvio padrão (standard deviation), – coeficiente de variação (coefficient of variation),. Medidas Amostrais Tendência ou localização central:. 215. 36.

(37) 27/03/15. Indices of central tendency. Medidas amostrais. Measures of program's execution time. Tendência ou localização central:. • subject to a variety of nondeterministic effects. – média (mean), – mediana (median), – moda (mode), – média aparada (trimmed mean).. • How to summarize this measures into a single number that specifies the center of the distribution of these values? • How to summarize the systems’ performance into a single value which represents the execution times of different benchmark? • There are four different indices of central tendency that are used to summarize multiple measurements:. Mean, Median, Mode and Trimmed Mean. 217. The sample arithmetic mean (also called average or statistic) of n observations is denoted by y, where n. y=. ∑ yi. For a population containing a very large number N of observations, the measure of location, called population mean (or parameter) is denoted by the Greek letter η (eta), where N. i =1. ∑y. n. It is a measure of location for the sample (the most commonly used measure of central tendency) For the sample of 9 observations in the dot diagram y = 66.74. Population: large set of N observations from which you extract the sample. η=. i. i =1. Example. N. If the measured value are thought of as a random process on the discrete random variable Y, the expected value of Y is defined to be N. Sample: small group of n observations actually available. E(y) = ∑ y i × pi i=1. Where, pi is the probability that the value of Y is yi. N. Parameter: population mean η. ∑y =. i. i =1. N. E(y) is also € referred to as the first moment of the random variable Y. n. Statistic: sample average. y=. ∑y. i. i =1. n. The population mean is also called the expected value of y or the mathematical expectation of y and is denoted as E(y). Thus η = E(y).. 37.

(38) 27/03/15. Potential Problem with Means. Potential Problem with Means. Recall the mean value for the sample of 9 observations in the dot diagram y = 66.74 The sample mean gives equal weight to all measurements.. Mean. One value significantly different from the others (outlier) can have a large influence on the computed value of the resulting mean. Ex: if we add a 10th measurement with the value 500 to the 9 observations in the dot diagram, the new value for the mean is 110,07 This value is substantially higher. Mean. does not capture our ‘sense’ of the central tendency 224. The Median. The Median. It may be obtained by listing the n data values in order of magnitude. The median is the middle value if n is odd and the average of the two middle values if n is even. It reduces the skewing effect of outliers Adding the 10th value (500) increases the mean from 66,74 to 110,07 The median increases only from 66,74 to 68,15. Look at these numbers: 3, 13, 7, 5, 21, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29 If we put those numbers in order we have: 3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 39, 40, 56 There are fifteen numbers. Our middle number will be the eighth number. Thus, the median appears to better capture a sense of the central tendency than does the mean 64.3 65.7 66.1 66.4 66.7 67.1 + => 68.15 69.1 67.2 68.1 500.0. Now, look at these numbers 3, 13, 7, 5, 21, 23, 23, 40, 23, 14, 12, 56, 23, 29 If we put those numbers in order we have: 3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 40, 56 There are now fourteen numbers and the median is (21 + 23)/2 = 22. The Median. Média aparada (trimmed mean) Median. Uma média aparada não é mais do que uma “mistura” entre os conceitos de média e mediana: combina as qualidades de ambas.. Mean. Median. Uma média aparada é uma média que é calculada excluindo uma certa proporção de observações em cada extremo da amostra (outlier). Mean. 227. 228. 38.

(39) 27/03/15. The sample mode. Selecting among the mean, median, and mode. The value that occurs most frequently. Use the mode!. Need not always exist for a given set of sample data. • Categorical data are those that can be grouped into distinct types or categories. Need not be unique. • Ex.. number of different computers in a organization manufactured by different companies the mode summarizes the most common type of computer the organization owns the mean and median really do not make sense in this context. mode. Selecting among the mean, median, and mode. Selecting among the mean, median, and mode Use the median!. Use the mean!. When the sum of all measurements is a meaningful and interesting value Ex. In a set of tested programs, the total time required to execute all programs is an interesting and meaningful value. When the sample data contain few values not clustered together, median may give a more meaningful indication of the central tendency Ex. • How much memory is installed in the labs’ workstations ? – – – – . The sum of the MFLOPS ratings is not a meaningful value • • • • . break. 25 machines contain 1 GBytes 38 machines contain 2 Gbytes 04 machines contain 4 Gbytes 01 machine contains 128 Gbytes. The sum of these values is meaninful (245 Gbytes) Mean = 3,6 Gbytes somewhat misleading more indicative of the ‘typical’ machine Median = 2 Gbytes 63 of the 68 machines have 2 Gbytes of memory or less. Yet even more means ! ! ! • So far we have discussed the arithmetic mean • There are two other means: – the harmonic mean – the geometric mean • Unfortunately, these means are sometimes used incorrectly, which can lead to erroneous conclusions. 39.

(40) 27/03/15. Assumption. Which mean use in the following situations? Time–based mean (e.g. seconds). Throughout the following discussion, we assume that we have measured the execution times of n benchmark programs and the total work performed by each benchmark executes F floating-point operations.. – Should be directly proportional to total weighted time – If time doubles, mean value should double !. Rate–based mean (e.g. operations/sec) – Should be inversely proportional to total weighted time – If time doubles, mean value should reduce by half !. 235. Assumptions. n. The mean executing time: M = 1. ∑T n. Measured execution times of n benchmark programs. i. mean value: directly proportional to sum of execution time. i=1. Ti, i = 1, 2, …, n €. Total work performed by each benchmark is constant. Correct mean to summarize execution time. F = # operations performed Execution rate = Mi = F / Ti. 237. Consider now the arithmetic mean to summarize the execution rates n. M=. n. n. 1 1 F F 1 ∑ Mi = n ∑ T = n ∑ T n i=1 i=1 i i=1 i. MH =. mean value: directly proportional to the inverse of the sum of execution time. Measurement (i) . Ti . 1 . 300 . 2 . 400 . 3 . 300 . 4 . 600 . 5 . 400 . F . The total number of floatingpoint operations divided by the total execution time This is the harmonic mean: €. we need a value that is inversely proportional to the sum of execution time 100.000 100.000 100.000 100.000 100.000 + + + + 300 400 300 600 400 5 × 100.000 = = 250 300 + 400 + 300 + 600 + 400. M=. M=. n n. ∑1 (F Ti ) i =1. T. harmonic mean is appropriate to summarize rates. F / Ti = 50 Inappropriate to summarize execution rates. 100.000 . n. ∑. i=1 i. F: float-point operations. €. n×F. =. n n. ∑ Ti F i =1. =. n n. 1 F ∑ Ti i =1. =. €. Not directly proportional to sum of times. not appropriate for summarizing times. n×F n. ∑T. i. i =1. €. 40.

(41) 27/03/15. ⎛. n. ⎝. i=1. ⎞1/ n. Geometric Mean: M G = n T1 × T2 ... × Tn = ⎜∏ Ti ⎟. Geometric mean with times. ⎠. Some performance analysts advocate that this is the correct mean to use when – summarizing € normalized numbers. Program. – summarizing measurements with a wide range of values (a single value has less influence than it would on the arithmetic mean). It maintains consistent relationships when comparing € normalized values regardless of the basis system used to normalize the measurements.. System 1 System 2 System 3. 1. 417. 244. 2. 83. 70. 134 70. 3. 66. 153. 135. 4. 39,449. 33,527. 66,000. 5. 772. 368. 369. Geometric mean. 587. 503. 499. Rank. 3. 2. 1 242 . Geometric mean normalized to System 1 Program. Geometric mean normalized to System 2 Program. System 1 System 2 System 3. 1. 1.0. 0.59. 0.32. 2. 1.0. 0.84. 0.85. 3. 1.0. 2.32. 2.05. 4. 1.0. 0.85. 1.67. 5. 1.0. 0.48. 0.45. Geometric mean. 1.0. 0.86. 0.84. Rank. 3. 2. 1. System 1 System 2 System 3 1.71. 1.0. 1.19. 1.0. 0.55 1.0. 0.43. 1.0. 0.88. 1.18. 1.0. 1.97. 2.10. 1.0. 1.0. Geometric mean. 1.17. 1.0. 0.99. Rank. 3. 2. 1. 243 . What’s going on here?!. Total execution times Program. 244 . System 1 System 2 System 3. System 1 System 2 System 3. 1. 417. 244. 134. 2. 83. 70. 70. Geometric mean wrt 1. 1.0. 0.86. Rank. 3. 2. 0.84 1. Geometric mean wrt 2. 1.17. 1.0. 0.99. 3. 66. 153. 135. Rank. 3. 2. 1. 4. 39,449. 33,527. 66,000. Arithmetic mean. 8,157. 6,872. 13,342. Rank. 2. 1. 3. 5. 772. 368. 369. Total. 40,787. 34,362. 66,798. Arithmetic mean. 8,157. 6,872. 13,342. Rank. 2. 1. 3 245 . geometric mean is consistent regardless of the normalization basis. it is consistently wrong 246 . 41.

(42) 27/03/15. Averaging normalized values. Geometric mean for times ⎛. n. 1/ n. ⎞. not directly proportional to sum of times MG = ⎜∏ Ti ⎟. ⎝ i=1 ⎠. not appropriate for summarizing times. € not inversely proportional to sum of times. ⎛ n ⎞1/ n ⎛ n F ⎞1/ n MG = ⎜∏ M i ⎟ = ⎜∏ ⎟ ⎝ i=1 ⎠ ⎝ i=1 Ti ⎠ not appropriate for summarizing rates. Program. sys 1. norm. sys 1. Sys 2. norm. sys 2. Sys 3. norm. sys 1. 1. 417. 1. 244. 0,59. 134. 0,32. 2. 83. 1. 70. 0,84. 70. 0,84. 3. 66. 1. 153. 2,32. 135. 2,05. 4. 39,449. 1. 33,527. 0,85. 66,000. 1,67. 5. 772. 1. 368. 0,48. 369. 0,48. Total. 40,787. 1. 34,362. 5,07. 66,798. 5,36. Arithmetic mean. 8,157. 1. 6,872. 1,01. 13,342. 1,07. Rank. 2. 1. 1. 2. 3. 3. 247. €. 248 . Averaging normalized values. Normalization Averaging normalized values doesn’t make sense mathematically… …it gives a number, but the number has no physical meaning… …first compute the mean, then normalize. Program. sys 1. sys 2. sys 3. 1. 417. 244. 134. 2. 83. 70. 70. 3. 66. 153. 135. 4 5 Total. 39,449 33,527 66,000 772. 368. 369. 40,787 34,362 66,798. Arithmetic mean. 8,157. 6,872. 13,342. Normalized mean (sys 1). 1. 0,84. 1,64. Rank. 2. 1. 3. t en t is ns o c 250 . 249. Weighted Mean. Summary of Means. So far, we assume that each individual measurements is equally important in calculating the mean In many situations this assumption need not be true program 1 is used half of the time. => w1 =0.5. other 4 programs are used equally in the remaining half of the time. => w2 = w3 = w4 = w5 = 0.125. n. Arithmetic mean =. ∑w × x i. n i. Harmonic mean =. i=1. ∑w i=1. 1 i. xi. We ignore the geometric mean in this discussion since it is not an appropriate mean for summarizing either execution times or rates. €. €. !. Avoid means if possible Loses information Arithmetic When sum of raw values has physical meaning Use for summarizing times (not rates) Harmonic Use for summarizing rates (not times) Geometric mean Not useful when time is best measure of performance 252. 42.

(43) 27/03/15. Geometric mean. Weighted means. Does provide consistent rankings Independent of basis for normalization. ∑w. n i. Standard definition of mean assumes all measurements are equally important. =1. i=1. But can be consistently wrong! n. Value can be computed, but has no physical meaning. M A = ∑ wi x i i=1. 1. MH =. ∑. n i=1. wi xi. Instead, choose weights to represent relative importance of measurement i. 254 . 253. €. Medidas amostrais Localização relativa: Mínimo (minimum) Máximo (maximum) Quartil (quartile) Quantil (quantile) Percentil (percentile). Medidas Amostrais Localização relativa. 256. Medidas de localização relativa: Quartis. Medidas de localização relativa: Mínimo e Máximo. Quartis – são os valores (Q1, Q2 e Q3) que dividem a amostra, depois de ordenada, em quatro partes iguais (ou o mais iguais possível).. Mínimo – é o valor mais reduzido da amostra Máximo – é o valor mais elevado da amostra. Q2 coincide com a mediana.. 257. 258. 43.