diff --git a/abramhindle/energymining.md b/abramhindle/energymining.md index 8f9474e..4e2ae11 100644 --- a/abramhindle/energymining.md +++ b/abramhindle/energymining.md @@ -112,9 +112,9 @@ measured in watts (W) where 1 watt is equal to 1 joule second. When you measure a task that a system executes, ask your system, does this task have a clear beginning or end? Does this task continuously -run? Tasks that do not to run continuously, such as sharpening an +run? Tasks that do not run continuously, such as sharpening an image or compressing a video file can be characterized by the energy -consumed. Where as a task that runs continuously, such a sharpening +consumed. Where as a task that runs continuously, such as sharpening video images of a surveillance web-camera or streaming video compression, is better characterized by its workload, its power, the rate of energy consumption. @@ -132,7 +132,7 @@ we're measuring physical phenomena: energy consumption. Our measurement equipment, our testbeds, or energy measurement devices all have error in them, error is inherent in physical measurement thus we need to take multiple measurements so we can rely upon statistics to -give us a more clear picture of what we measured. +give us a clearer picture of what we measured. ### Granularity @@ -201,7 +201,7 @@ to update, sometimes the network goes down, sometimes a remote site is unavailable. Often the software under test is just inherently buggy and only half of the test runs will complete. When developing tests one should instrument the tests with auditing capabilities such as -screenshots to enable postmortem investigations. Furthermore +screenshots to enable postmortem investigations. Furthermore outliers should be investigated and potentially re-run. You wouldn't want to attribute a difference in energy consumption to an erroneous test. @@ -216,7 +216,7 @@ repeated measurement and statistical analysis. Thus remember and use the scenarios: environment, N-versions, energy or power, repeated measurement, granularity, idle measurement, statistical analysis, and exceptions. - + ## Footnotes British/Canadian spelling. diff --git a/brusso/NeedsDataAnalysisPattern.md b/brusso/NeedsDataAnalysisPattern.md index 20932e9..fa39e91 100644 --- a/brusso/NeedsDataAnalysisPattern.md +++ b/brusso/NeedsDataAnalysisPattern.md @@ -5,36 +5,36 @@ When you call a doctor, you would expect she comes with a handful set of remedies for your disease. You would not be pleased to see her digging into a huge amount of clinical data while she makes a diagnosis and searches a solution for your problem neither would you expect her to propose a cure based on your case alone. The remedies she proposes are solutions to recurring problems that medical researchers identify by analysing data of patients with similar symptoms and medical history. Remedies are coded in a language that a doctor understands (e.g., they tell when and how to treat a patient) and lead to meaningful conclusions for patients with the same disease (e.g., they tell the probability the disease will be defeated and eventually with which consequences). Once found, such solutions can be applied over and over again. With the repeated use of a solution, medical researchers can indeed gain knowledge on successes and failures of a remedy and provide meaningful conclusions to future patients thereafter. -The remedy metaphor portrays what are data analysis patterns in empirical sciences. First, a pattern is a coded solution of a recurring problem. When a problem occurs several times, we accumulate knowledge on the problem and its solutions. With this knowledge, we are able to code a solution in some sort of modelling language that increases its expressivity and capability of re-use. Second, a pattern is equipped with a sort of measure of success of the solution it represents. The solution and the measure result from the analysis of historical data and provide actionable insight for future cases. - +The remedy metaphor portrays what are data analysis patterns in empirical sciences. First, a pattern is a coded solution of a recurring problem. When a problem occurs several times, we accumulate knowledge on the problem and its solutions. With this knowledge, we are able to code a solution in some sort of modelling language that increases its expressivity and capability of re-use. Second, a pattern is equipped with a sort of measure of success of the solution it represents. The solution and the measure result from the analysis of historical data and provide actionable insight for future cases. + Does it make sense to speak about patterns in modern software engineering? The answer can only be yes. Patterns are a form of re-use and re-use is one of the key principles in modern software engineering. Why is it? Re-use is an instrument to increase the economy of development and prevent human errors in software development processes. In their milestone book, Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides [Gamma et al., 1995] introduced (design) patterns as a way ''to reuse the experience instead of rediscovering it.'' Thus, patterns as a form of re-use help build software engineering knowledge from experience. Does it make sense to speak about patterns of data analysis in modern software engineering? Definitely yes. Data analysis patterns are ''remedies'' for recurring data analysis problems arisen during the conception, the development, the use of the software technology. They are codified solutions that lead to meaningful conclusions for software engineering stakeholders and can be reused for comparable data. In other words, a data analysis pattern is a sort of model expressed in a language that logically describes a solution to a recurring data analysis problem in software engineering. They can possibly be automated. As such, data analysis patterns help "rise from the drudgery of random action into the sphere of intentional design," [Pontus et al., 2012]. -Why aren't they already diffusely used? The majority of us has the ingrained belief that methods and results from individual software analysis pertain to the empirical context from which data has been collected. Thus, in almost every new study, we re-invent the data-analysis wheel. It is like we devised a new medical protocol for any new patient. Why is this? One of the reasons is related to software engineering data and the role it has gained over the years. +Why aren't they already diffusely used? The majority of us has the ingrained belief that methods and results from individual software analysis pertain to the empirical context from which data has been collected. Thus, in almost every new study, we re-invent the data-analysis wheel. It is like we devised a new medical protocol for any new patient. Why is this? One of the reasons is related to software engineering data and the role it has gained over the years. ##Software engineering data -A large part of modern software engineering research builds new knowledge by analysing data of different types. To study distributed development processes, we analyse textual interactions among developers of open source communities and use social network, complex systems, or graph theories. If we instead want to predict if a new technology will take off in the IT market, we collect economic data and use the Roger's theory of diffusion of innovation. To understand the quality of modern software products, we mine code data and their evolution over versions. Sometimes, we also need to combine data of different nature collected from different sources and analysed with various statistical methods. +A large part of modern software engineering research builds new knowledge by analysing data of different types. To study distributed development processes, we analyse textual interactions among developers of open source communities and use social network, complex systems, or graph theories. If we instead want to predict if a new technology will take off in the IT market, we collect economic data and use the Roger's theory of diffusion of innovation. To understand the quality of modern software products, we mine code data and their evolution over versions. Sometimes, we also need to combine data of different nature collected from different sources and analysed with various statistical methods. Thus, data types can be very different. Just to make few examples, data can be structured or unstructured (i.e., lines of code or free text in review comments and segments of videos), discrete or continuous (i.e., number of bugs in software or response time of web services), qualitative or quantitative (i.e., complexity of a software task and Cyclomatic code complexity), and subjective or objective (i.e., ease of use of a technology or number of back links to web sites). In addition, with the OSS phenomenon, cloud computing, and the Big Data era, data has become more distributed, big, and accessible, but also noisy, redundant, and incomplete. As such, researchers must have a good command over analysis instruments and a feel for the kinds of problems and data they apply to. ##Needs of data analysis patterns -The needs for instruments like data analysis patterns become more apparent when we want to introduce novices to the research field. In these circumstances, we encounter the following issues. +The needs for instruments like data analysis patterns become more apparent when we want to introduce novices to the research field. In these circumstances, we encounter the following issues. *Studies do not report entirely or sufficiently about their data analysis protocols.* This implies that analyses are biased or not verifiable. Consequently, secondary studies like mapping studies and systematic literature reviews that are mandated to synthesise published research lose their power. Data analysis patterns provide software engineers with a verifiable protocol to compare, unify, and extract knowledge form existing studies. -*Methods and data are not commonly shared.* It is a general custom to develop ad-hoc scripts and keep them private or use tools as black-box statistical machines. In either case, we cannot access to the statistical algorithm, verify, and re-use it. Data analysis patterns are packaged to be easily inspected, automatised, and shared. +*Methods and data are not commonly shared.* It is a general custom to develop ad-hoc scripts and keep them private or use tools as black-box statistical machines. In either case, we cannot access to the statistical algorithm, verify, and re-use it. Data analysis patterns are packaged to be easily inspected, automatised, and shared. -*Tool-driven research has some known risks.* Anyone can easily download statistical tools from the Internet and perform sophisticated statical analyses. The Turin award Butler Lampson [Lampson, 1967] warns though: ''For one unfamiliar with the niceties of statistical analysis it is difficult to view with any feeling other than awe the elaborate edifice which the authors have erected to protect their data from the cutting winds of statistical insignificance.'' A catalogue of data analysis patterns helps guide researchers in the selection of appropriate analysis instruments. +*Tool-driven research has some known risks.* Anyone can easily download statistical tools from the Internet and perform sophisticated statical analyses. The Turing award winner Butler Lampson [Lampson, 1967] warns though: ''For one unfamiliar with the niceties of statistical analysis it is difficult to view with any feeling other than awe the elaborate edifice which the authors have erected to protect their data from the cutting winds of statistical insignificance.'' A catalogue of data analysis patterns helps guide researchers in the selection of appropriate analysis instruments. *Analysis can be easily biased by the human factor.* Reviewing papers on machine learning for defect prediction, Martin Shepperd, David Bowes, Tracy Hall [Shepperd et al., 2014] analysed more than 600 samples from the highest quality studies on defect prediction to determine what factors influence predictive performance and find that ''it matters more who does the work than what is done.'' -This incredible result urges the use of data analysis patterns to make a solution independent from the researchers who conceived it. +This incredible result urges the use of data analysis patterns to make a solution independent from the researchers who conceived it. ##Building remedies for data analysis in software engineering research -As in any research field, needs trigger opportunities and challenge researchers. Today, we are called to synthesise our methods of analysis [Johnson et al. 2012] and a couple of examples of design patterns are already available [Russo, 2013]. We need more though. The scikit-learn initiative [http://scikit-learn.org/stable/index.html] can help software engineers in the case they need to solve problems with data mining, i.e. computational process of discovering patterns in data sets. -The project provides on-line access to a wide range of state-of-the-art tools for data analysis as codified solutions. Each solution comes with a short rationale of use, a handful set of algorithms implementing it, and a set of application examples. Fig. 1 illustrates how we can find the right estimator for a machine learning problem. +As in any research field, needs trigger opportunities and challenge researchers. Today, we are called to synthesise our methods of analysis [Johnson et al. 2012] and a couple of examples of design patterns are already available [Russo, 2013]. We need more though. The scikit-learn initiative [http://scikit-learn.org/stable/index.html] can help software engineers in the case they need to solve problems with data mining, i.e. computational process of discovering patterns in data sets. +The project provides on-line access to a wide range of state-of-the-art tools for data analysis as codified solutions. Each solution comes with a short rationale of use, a handful set of algorithms implementing it, and a set of application examples. Fig. 1 illustrates how we can find the right estimator for a machine learning problem. ![](ml_map.png)
@@ -42,18 +42,18 @@ The project provides on-line access to a wide range of state-of-the-art tools fo How can we import these or similar tools in the software engineering context? We need first to identify the requirements for a data analysis pattern in software engineering. -In our opinion, a data analysis pattern shall be: -- A solution to a recurrent a software engineering problem +In our opinion, a data analysis pattern shall be: +- A solution to a recurrent software engineering problem - Re-usable in different software engineering contexts -- Automatable (e.g., by coding algorithms of data analysis in some programming language ) +- Automatable (e.g., by coding algorithms of data analysis in some programming language ) - Actionable (e.g., the scikit-learn tools) -- Successful at a certain degree (e.g., by representing state-of-the-art of data analysis in software engineering) +- Successful at a certain degree (e.g., by representing state-of-the-art of data analysis in software engineering) Then the key steps to construct such pattern will include but be not restricted to: - Mining literature to extract candidate solutions - Identifying a common language to express a solution in a form that software engineers can easily understand and re-use. For instance, we can think of annotated UML or algorithm notation expressing the logic of the analysis -- Defining a measure of success for a solution -- Validating the candidate solutions by replications and community surveys to achieve consensus in the research community. +- Defining a measure of success for a solution +- Validating the candidate solutions by replications and community surveys to achieve consensus in the research community. Reflecting on the current situation, we also see the need to codify anti-patterns, i.e. what not to do in data analysis. Given the amount of evidence in our field, this must be a much easier task! @@ -61,12 +61,12 @@ Reflecting on the current situation, we also see the need to codify anti-pattern Christian Bird, Tim Menzies, and Thomas Zimmermann: 1st international workshop on data analysis patterns in software engineering (DAPSE 2013), in Proceedings of the 2013 International Conference on Software Engineering (ICSE '13). IEEE Press, Piscataway, NJ, USA, 1517-1518, 2013 [DAPSE2013] (https://www.conference-publishing.com/list.php?Event=ICSEWS13DAPSE) -Lampson Butler: A Critique of An Exploratory Investigation of Programmer Performance Under On-Line and Off-Line Conditions, IEEE Trans. Human Factors in Electronics HFE-8, 1, 48-51, 1967 +Lampson Butler: A Critique of An Exploratory Investigation of Programmer Performance Under On-Line and Off-Line Conditions, IEEE Trans. Human Factors in Electronics HFE-8, 1, 48-51, 1967 -Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides: Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995 +Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides: Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995 Pontus Johnson, Mathias Ekstedt, Ivar Jacobson: Where's the Theory for Software Engineering?, Software, IEEE , vol.29, no.5, 94-95, 2012 Barbara Russo: Parametric Classification over Multiple Samples, in Proceedings of 2013 1st International Workshop on Data Analysis Patterns in Software Engineering (DAPSE), May 21, 2013, San Francisco, CA, USA , IEEE, 23-25, 2013 [article] (http://www.inf.unibz.it/~russo/Publications/icsews13dapse-id6-p-16144-preprint.pdf) -Martin Shepperd, David Bowes, Tracy Hall, Researcher Bias: The Use of Machine Learning in Software Defect Prediction, IEEE Trans. on Software Engineering, vol. 40, no. 6, 603-616, 2014 +Martin Shepperd, David Bowes, Tracy Hall, Researcher Bias: The Use of Machine Learning in Software Defect Prediction, IEEE Trans. on Software Engineering, vol. 40, no. 6, 603-616, 2014 diff --git a/cabird/interviews.md b/cabird/interviews.md index 2a5ada5..f85ad1c 100644 --- a/cabird/interviews.md +++ b/cabird/interviews.md @@ -30,7 +30,7 @@ Once you have determined who you want to recruit for interviews, you need to con * Introduce yourself and explain your job or role. * Tell them what the goal of your research is and how conducting an interview with them will help you accomplish that goal. - * Describe how what you are doing can potentially benefit them. + * Describe how they can potentially benefit from what you are doing. * Explain how long you estimate that the interview will take. * Let them know if you'd like them to do anything to prepare for the interview. As an example, I once asked developers to open up a recent code review they had taken part in and look over it before I arrived. * Tell them how they were selected. Did you select them because they fit some criteria or were they selected at random? @@ -96,7 +96,7 @@ Finally, resist the temptation to apply quantitative methods to interview result Those who haven't conducted interviews before are often hesitant to try. You may feel more comfortable looking at raw data in the comfort of your own lab. Numbers can't put you on the spot or make you feel awkward. I can honestly say that I've learned more about how software engineering takes place by conducting interviews than I have through all of the other research methods I've used combined and I started with *much* less information about how to do it than is in this chapter. I hope this chapter has provided a few things to help you include interviews in your own research. ## References -* A. Hindle, C. Bird, T. Zimmermann, and N. Nagappan, "Relating requirements to implementation via topic analysis: do topics extracted from requirements make sense to managers and developers?," in Proceedings of the 28th IEEE international conference on software maintenance, 2012. +* A. Hindle, C. Bird, T. Zimmermann, and N. Nagappan, "Relating requirements to implementation via topic analysis: do topics extracted from requirements make sense to managers and developers?," in Proceedings of the 28th IEEE international conference on software maintenance, 2012. * Emerson R. Murphy-Hill, Thomas Zimmermann, Nachiappan Nagappan, "Cowboys, ankle sprains, and keepers of quality: how is video game development different from software development?" ICSE 2014 * Anja Guzzi, Alberto Bacchelli, Yann Riche, Arie van Deursen,"Supporting Developers' Coordination in The IDE" In Proceedings of CSCW 2015 (18th ACM conference on Computer-Supported Cooperative Work and Social Computing), pp. 518-532. 2015 * Singer, Leif, Fernando Figueira Filho, and Margaret-Anne Storey. "Software engineering at the speed of light: how developers stay current using twitter." Proceedings of the 36th International Conference on Software Engineering. ACM, 2014. diff --git a/just_herzig/bug_report_gotchas.md b/just_herzig/bug_report_gotchas.md index 2bc6c38..ab78965 100644 --- a/just_herzig/bug_report_gotchas.md +++ b/just_herzig/bug_report_gotchas.md @@ -1,44 +1,44 @@ # Gotchas from Mining Bug Reports - -Over the years, it has become common practice in empirical software engineering to mine data from version archives and bug databases to learn where bugs have been fixed in the past, or to build prediction models to find error prone code in the future. + +Over the years, it has become common practice in empirical software engineering to mine data from version archives and bug databases to learn where bugs have been fixed in the past, or to build prediction models to find error prone code in the future. Simplistically, one counts the number of fixes per code entity by mapping closed reports about bugs to their corresponding *fixing* code changes--typically one scans for commit messages that mention a bug report identifier. This however relies on three fundamental assumptions: - -* **The location of the defect is the part of the code where the fix is applied.** - However, this is not always true and bug fixes can have very different nature. Consider a method _M_ that parses a string containing email addresses and returns the email alias without the domain. However, _M_ crashes on strings containing no @ character at all. A simple fix is to check for @ characters before calling _M_. Although this resolves the issue, the method that was changed to apply this fix is not the one that is defective and might also be located in a different source file. +* **The location of the defect is the part of the code where the fix is applied.** + + However, this is not always true and bug fixes can have very different nature. Consider a method _M_ that parses a string containing email addresses and returns the email alias without the domain. However, _M_ crashes on strings containing no @ character at all. A simple fix is to check for @ characters before calling _M_. Although this resolves the issue, the method that was changed to apply this fix is not the one that is defective and might also be located in a different source file. -* **Issue reports that are flagged as BUGs describe real code defects.** +* **Issue reports that are flagged as BUGs describe real code defects.** - We rely on the fact that work items marked as BUGs are really referring to bugs. If this is not the case, the work item and its associated code changes are considered bug fixes although they might implement a new feature or performing a refactoring. + We rely on the fact that work items marked as BUGs are really referring to bugs. If this is not the case, the work item and its associated code changes are considered bug fixes although they might implement a new feature or performing a refactoring. -* **The change to the source code is atomic, meaning it's the minimal patch that fixes the bug without performing any other tasks.** +* **The change to the source code is atomic, meaning it's the minimal patch that fixes the bug without performing any other tasks.** Similar to the assumption above, we treat code changes associated to work items as a unity. Thus, we assume that all code changes applied in the same commit serve the same purpose defined by the associated work item. But what if a developer applies code changes serving multiple purposes together? Even if she stated the fact in the commit message, e.g. "Fixing bug #123, implementing improvement #445 and refactoring module X for better maintainability", we do not know which code change belongs to which work item nor which code change implements the bug fix. -Violating the first assumption yields models that predict bug fixes rather than code defects--but there is very little we can do as bugs themselves are not marked in the process and as bug fixes can be assumed at least close to the actual defect. Therefore, we will focus on the remaining two assumptions. +Violating the first assumption yields models that predict bug fixes rather than code defects--but there is very little we can do as bugs themselves are not marked in the process and as bug fixes can be assumed at least close to the actual defect. Therefore, we will focus on the remaining two assumptions. Violating the two latter assumptions however would lead to noise and bias in our datasets. In fact, if we are unable to separate bug fixes from code changes and if code changes are frequently tangled with other non corrective code changes, then counting such tangled bug fixes effectively means counting changes or churn, rather than code defects or bug fixes. A serious problem if we want to predict quality rather than churn. Thus, violating the latter two assumption imposes serious risks and might lead to imprecise code quality models. - -## Do bug reports describe code defects? - +## Do bug reports describe code defects? + + -There is no general answer to this question. Fundamentally, the answer depends on the individual bug reports filed against a system and on the more fundamental question: what is a bug? If we would ask this question to five developers, we would probably get six different answers or most of them would answer: "this depends". Asking 10 people on the street, including people not being software development engineers, would add even more diversity to the set of answers. +There is no general answer to this question. Fundamentally, the answer depends on the individual bug reports filed against a system and on the more fundamental question: what is a bug? If we would ask this question to five developers, we would probably get six different answers or most of them would answer: "this depends". Asking 10 people on the street, including people not being software development engineers, would add even more diversity to the set of answers. -#### It's the user that defines the work item type +#### It's the user that defines the work item type @@ -46,11 +46,11 @@ Although the question of a definition of a bug seems unrelated, we should bare i -Thus, the threat of violating our second assumption (bug reports describe real code defects) is high, depending on the data source and who is involved in creating bug reports. +Thus, the threat of violating our second assumption (bug reports describe real code defects) is high, depending on the data source and who is involved in creating bug reports. -#### An example +#### An example @@ -68,7 +68,7 @@ Studies have shown that fields of issue reports change very rarely, once an init As a consequence, data analysts should not blindly rely on user input data, especially if the data might stem from non-experts or data sources that reflect different points of view. It is important keep in mind that the data to be analyzed is most likely created for purposes other than mining and analyzing the data. Going back to our assumption on bug reports and the types of work items, we should be careful about simply using the data without checking if our second assumption holds. If it does not hold, it is good practice to estimate the extend of the data noise and whether it will significantly impact our analyses. - + #### How big is the problem of "false bugs"? @@ -76,13 +76,13 @@ As a consequence, data analysts should not blindly rely on user input data, espe In a study Herzig et al. [1] investigated 7,000 bug reports from five open source projects (HTTPClient, Jackrabbit, Lucene-Java, Rhino and Tomcat5) and manually classified their report categories to find out how much noise exists in these frequently used bug report data sets and what impact this noise has on quality models. - -The authors found that **issue report types are unreliable**. In the five bug databases investigated, more than 40% of issue reports were categorized inaccurately. Similarly, the study showed that **every third bug report does not refer to code defects**. In consequence, the validity of studies regarding the distribution and prediction of bugs in code is threatened. Assigning the noisy original bug data to source files to count the number of defects per file, Herzig et al. [1] found that 39% of **files were wrongly marked as fixed** although *never* containing a single code defect. + +The authors found that **issue report types are unreliable**. In the five bug databases investigated, more than 40% of issue reports were categorized inaccurately. Similarly, the study showed that **every third bug report does not refer to code defects**. In consequence, the validity of studies regarding the distribution and prediction of bugs in code is threatened. Assigning the noisy original bug data to source files to count the number of defects per file, Herzig et al. [1] found that 39% of **files were wrongly marked as fixed** although *never* containing a single code defect. -## Do developers apply atomic changes? +## Do developers apply atomic changes? @@ -94,15 +94,15 @@ The third assumption commonly made when building quality models based on softwar -Version control commits are snapshots in time. Their patches summarize code changes that have been applied since the previous commit. However, this perspective disregards the purpose of these changes and their fine-grained dependencies, e.g. in which order they were performed. +Version control commits are snapshots in time. Their patches summarize code changes that have been applied since the previous commit. However, this perspective disregards the purpose of these changes and their fine-grained dependencies, e.g. in which order they were performed. -A developer fixing a code defect by overwriting an entire method and replacing its content with a faster algorithm that also fixes the defect cannot be separated into a set of code changes fixing the defect and applying an improvement. And even if manual separation is possible, e.g. fist fixing a code defect before renaming variables, there is little to no motivation and benefit for engineers to work this way, e.g. to create local branches to fix simple bugs. This is very similar to the reasons of noisy bug report types. An engineer's focus lies on completing a task and to get work done---that is what she is paid for. And for the developer, there is no (imminent) benefit working in such a way that data analysts and their models gain higher accuracy. +A developer fixing a code defect by overwriting an entire method and replacing its content with a faster algorithm that also fixes the defect cannot be separated into a set of code changes fixing the defect and applying an improvement. And even if manual separation is possible, e.g. first fixing a code defect before renaming variables, there is little to no motivation and benefit for engineers to work this way, e.g. to create local branches to fix simple bugs. This is very similar to the reasons of noisy bug report types. An engineer's focus lies on completing a task and to get work done---that is what she is paid for. And for the developer, there is no (imminent) benefit working in such a way that data analysts and their models gain higher accuracy. -Thus, even if developers do work sequentially on work items and development tasks, we rely on them to group the applied changes in provide the appropriate snapshots (commits). At the same time, we provide little to no motivation for developers to perform this extra (and potentially time consuming) work. +Thus, even if developers do work sequentially on work items and development tasks, we rely on them to group the applied changes in provide the appropriate snapshots (commits). At the same time, we provide little to no motivation for developers to perform this extra (and potentially time consuming) work. @@ -110,7 +110,7 @@ Thus, even if developers do work sequentially on work items and development task -Each *tangled change* threats the validity of models relying on the assumption that code changes are atomic entities, e.g. assigning the number of bugs by identifying bug references in commit messages and assigning the fixed bug to all files touched in the assumed atomic commit. +Each *tangled change* threats the validity of models relying on the assumption that code changes are atomic entities, e.g. assigning the number of bugs by identifying bug references in commit messages and assigning the fixed bug to all files touched in the assumed atomic commit. @@ -118,7 +118,7 @@ Among other, Herzig and Zeller [3] showed that the bias caused by tangled change more than 7,000 individual change sets and checked whether they address multiple (tangled) issue reports. Their results show that up to 12% of commits are tangled and cause false associations between bug reports and source files. Further, Herzig [4] showed that tangled changes usually combine two or three development tasks at a time. The same study showed that between 6% and 50% of the most defect prone files do not belong in this category, because they were falsely associated with bug reports. Up to 38% of source files had a false number of bugs associated with them and up to 7% of files originally associated with bugs never were part of any bug fix. - + ## In summary @@ -132,7 +132,7 @@ Mining bug reports and associating bug fixes to files to assess the quality of t -[1] K. Herzig, S. Just, and A. Zeller, “It’s not a Bug, It’s a Feature: How Misclassification Impacts Bug Prediction,” in Proceedings of the 2013 international conference on software engineering, Piscataway, NJ, USA, 2013, pp. 392-401. +[1] K. Herzig, S. Just, and A. Zeller, “It’s not a Bug, It’s a Feature: How Misclassification Impacts Bug Prediction,” in Proceedings of the 2013 international conference on software engineering, Piscataway, NJ, USA, 2013, pp. 392-401. @@ -140,13 +140,12 @@ Mining bug reports and associating bug fixes to files to assess the quality of t -[3] K. Herzig and A. Zeller, “The Impact of Tangled Code Changes,” in Proceedings of the 10th working conference on mining software repositories, Piscataway, NJ, USA, 2013, pp. 121-130. +[3] K. Herzig and A. Zeller, “The Impact of Tangled Code Changes,” in Proceedings of the 10th working conference on mining software repositories, Piscataway, NJ, USA, 2013, pp. 121-130. -[4] K. Herzig, “Mining and Untangling Change Genealogies,” PhD Thesis, 2012. +[4] K. Herzig, “Mining and Untangling Change Genealogies,” PhD Thesis, 2012. [5] Antoniol et al., "Is it a bug or an enhancement? A text-based approach to classify change requests" In Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds (New York, NY, USA, 2008), CASCON’08, ACM, pp. 23:304–23:318. - diff --git a/sback/summarizing-unstructured-data.md b/sback/summarizing-unstructured-data.md index bce811d..1d0907f 100644 --- a/sback/summarizing-unstructured-data.md +++ b/sback/summarizing-unstructured-data.md @@ -124,7 +124,7 @@ like to have a categorization that looks like the following: Is it possible to automatically tag each line with the language it is written in? Yes! Researchers developed a number of methods to classify text in -different categories. A classical example is the case of classifying a whole +different categories. A classic example is the case of classifying a whole email into "spam" or legitimate. In the case of development emails, we developed simple methods to recognize lines of code from other text [2] and more complex ones to recognize more complex languages, such as those used in @@ -201,9 +201,3 @@ Extracting Source Code from E-Mails. ICPC 2010: 24-33 [3] Alberto Bacchelli, Tommaso Dal Sasso, Marco D'Ambros, Michele Lanza: Content classification of development emails. ICSE 2012: 375-385 - - - - - - diff --git a/trevorcarnahan/Reducing Time to Insight.md b/trevorcarnahan/Reducing Time to Insight.md index 3110be7..a773c80 100644 --- a/trevorcarnahan/Reducing Time to Insight.md +++ b/trevorcarnahan/Reducing Time to Insight.md @@ -4,7 +4,7 @@ Insight Production, preferably with data (I jest), is a big part of my role at Microsoft. And if you're reading this chapter, I'll assert you are also an Insight Producer, or well on your way to becoming one. -Since insight is the product, let's start with a simple definition. The classical definition of insight is new understanding or human comprehension. That works well for the result of a science experiment and answering a research question. Industry, with its primordial goal of growing results, focuses my definition to new _actionable_ comprehension for _improved results_. This constrains the questions of ~~research~~ insight search to subjects and behaviors known to contribute business value. +Since insight is the product, let's start with a simple definition. The classic definition of insight is new understanding or human comprehension. That works well for the result of a science experiment and answering a research question. Industry, with its primordial goal of growing results, focuses my definition to new _actionable_ comprehension for _improved results_. This constrains the questions of ~~research~~ insight search to subjects and behaviors known to contribute business value. Let's look at a more concrete example. As a purveyor and student of software engineering insights, I know large software organizations value code velocity, sometimes called speed or productivity and quality as a control. Organizations also value engineer satisfaction if you're lucky. So I seek to understand existing behaviors and their contribution to those areas. I am continuously hunting for new or different behaviors that can dramatically improve those results. diff --git a/zhang-xie/SoftwareAnalyticsInPractice.md b/zhang-xie/SoftwareAnalyticsInPractice.md index 1671ace..876e604 100644 --- a/zhang-xie/SoftwareAnalyticsInPractice.md +++ b/zhang-xie/SoftwareAnalyticsInPractice.md @@ -2,7 +2,7 @@ ## Dongmei Zhang and Tao Xie -Various types of data naturally exist in the software development process, such as source code, bug reports, check-in histories, and test cases. As software services and mobile applications are widely available in the Internet era, a huge amount of program runtime data, e.g., traces, system events, and performance counters, as well as users’ usage data, e.g., usage logs, user surveys, online forum posts, blogs, and tweets, can be readily collected. +Various types of data naturally exist in the software development process, such as source code, bug reports, check-in histories, and test cases. As software services and mobile applications are widely available in the Internet era, a huge amount of program runtime data (e.g., traces, system events, and performance counters), as well as users’ usage data (e.g., usage logs, user surveys, online forum posts, blogs, and tweets) can be readily collected. Considering the increasing abundance and importance of data in the software domain, **_software analytics_** [4,5] is to utilize data-driven approaches to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for completing various tasks around software systems, software users, and software development process. @@ -12,13 +12,13 @@ In this chapter, we discuss software analytics from six perspectives. We also sh ### Six perspectives of software analytics -The six perspectives of software analytics include research topics, target audience, input, output, technology pillars, and connection to practice. While the first four perspectives are easily accessible from the definition of software analytics, the last two need some elaboration. +The six perspectives of software analytics include research topics, target audience, input, output, technology pillars, and connection to practice. While the first four perspectives are easily accessible from the definition of software analytics, the last two need some elaboration. As stated in the definition, software analytics focuses on software systems, software users, and software development process. From the research point of view, these focuses constitute three *research topics* – software quality, user experience, and development productivity. As illustrated in the aforementioned examples, the variety of data *input* to software analytics is huge. Regarding the insightful and actionable *output*, it often requires well-designed and complex analytics techniques to create such output. It should be noted that the *target audience* of software analytics spans across a broad range of software practitioners, including developers, testers, program managers, product managers, operation engineers, usability engineers, UX designers, customer-support engineers, and management personnel, etc. **_Technology Pillars_**. In general, primary technologies employed by software analytics include large-scale computing (to handle large-scale datasets), analysis algorithms in machine learning, data mining and pattern recognition, etc. (to analyze data), and information visualization (to help with analyzing data and presenting insights). While the software domain is called the vertical area in which software analytics focuses upon, these three technology areas are called the horizontal research areas. Quite often, in the vertical area, there are challenges that cannot be readily addressed using the existing technologies in one or more of the horizontal areas. Such challenges can open up new research opportunities in the corresponding horizontal areas. -**_Connection to Practice_**. Software analytics is naturally tied with practice, with four *real* elements. +**_Connection to Practice_**. Software analytics is naturally tied with practice, with four *real* elements. **Real data**. The data sources under study in software analytics come from real-world settings including both industrial proprietary settings and open source settings. For example, open-source communities provide a huge data vault of source code, bug history, and check-in information, etc.; and better yet, the vault is active and evolving, which makes the data sources fresh and live. @@ -32,15 +32,15 @@ As stated in the definition, software analytics focuses on software systems, sof The connection-to-practice nature opens up great opportunities for software analytics to make practice impact with focus on the *real* settings. Furthermore, there is huge potential for the impact to be broad and deep because software analytics spreads across the areas of system quality, user experience, and development productivity, etc. -Despite these opportunities, there are still significant challenges when putting software analytics technologies into real use. For example, how to ensure the output of the analysis results to be insightful and actionable? How do we know whether practitioners are concerned about the questions answered with the data? How do we evaluate our analysis techniques in real-world settings? Next we share some of our learnings from working on various software analytics projects [1, 2, 3, 5]. +Despite these opportunities, there are still significant challenges when putting software analytics technologies into real use. For example, how to ensure the output of the analysis results to be insightful and actionable? How do we know whether practitioners are concerned about the questions answered with the data? How do we evaluate our analysis techniques in real-world settings? Next we share some of our learnings from working on various software analytics projects [1, 2, 3, 5]. **Identifying essential problems**. Various types of data are incredibly rich in the software domain, and the scale of data is significantly large. It is often not difficult to grab some datasets, apply certain data analysis techniques, and obtain some observations. However, these observations, even with good evaluation results from the data-analysis perspective, may not be useful for accomplishing the target task of practitioners. It is important to first identify essential problems for accomplishing the target task in practice, and then obtain the right data sets suitable to help solve the problems. These essential problems are those solving which can substantially improve the overall effectiveness of tackling the task, such as improving software quality, user experience, and practitioner productivity. **Usable system built early to collect feedback**. It is an iterative process to create software analytics solutions to solve essential problems in practice. Therefore, it is much more effective to build a usable system early on in order to start the feedback loop with the software practitioners. The feedback is often valuable for formulating research problems and researching appropriate analysis algorithms. In addition, software analytics projects can benefit from early feedback in terms of building trust between researchers and practitioners, as well as enabling the evaluation of the results in real-world settings. -**Using domain semantics for proper data preparation**. Software artifacts often carry semantics specific to the software domain; therefore, they cannot be simply treated as generic data such as text and sequences. For example, callstacks are sequences with program execution logic, and bug reports contain relational data and free text describing software defects, etc. Understanding the semantics of software artifacts is a prerequisite for analyzing the data later on. In the case of StackMine [2], there was a deep learning curve for us to understand the performance traces before we could conduct any analysis. +**Using domain semantics for proper data preparation**. Software artifacts often carry semantics specific to the software domain; therefore, they cannot be simply treated as generic data such as text and sequences. For example, callstacks are sequences with program execution logic, and bug reports contain relational data and free text describing software defects, etc. Understanding the semantics of software artifacts is a prerequisite for analyzing the data later on. In the case of StackMine [2], there was a deep learning curve for us to understand the performance traces before we could conduct any analysis. -In practice, understanding data is three-fold: data interpretation, data selection, and data filtering. To conduct data interpretation, researchers need to understand basic definitions of domain-specific terminologies and concepts. To conduct data selection, researchers need to understand the connections between the data and the problem being solved. To conduct data filtering, researchers need to understand defects and limitations of existing data to avoid incorrect inference. +In practice, understanding data is three-fold: data interpretation, data selection, and data filtering. To conduct data interpretation, researchers need to understand basic definitions of domain-specific terminologies and concepts. To conduct data selection, researchers need to understand the connections between the data and the problem being solved. To conduct data filtering, researchers need to understand defects and limitations of existing data to avoid incorrect inference. **Scalable and customizable solutions**. Due to the scale of data in the real-world settings, scalable analytic solutions are often required to solve essential problems in practice. In fact, scalability may directly impact the underlying analysis algorithms for problem solving. Customization is another common requirement to incorporate domain knowledge due to the variations of software and services. The effectiveness of solution customization in analytics tasks can be summarized as (1) filtering noisy and irrelevant data, (2) specifying between data points their intrinsic relationships that cannot be derived from the data itself, (3) providing empirical and heuristic guidance to make the algorithms robust against biased data. The procedure of solution customization can be typically conducted in an iterative fashion via close collaboration between software analytics researchers and practitioners. @@ -59,9 +59,3 @@ In practice, understanding data is three-fold: data interpretation, data selecti [5] D. Zhang, S. Han, Y. Dang, J.-G. Lou, H. Zhang, and T. Xie. Software Analytics in Practice. IEEE Software, Special Issue on the Many Faces of Software Analytics, 30(5), pages 30-37, September/October 2013. [6] K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very) large: ten years of implementation and experience. In Proc. SOSP 2009, pages 103-116, 2009. - - - - - -