Predictive modeling tasks handled by Clinical Analytics fall into one of two categories: classification and regression . Classification answers the question, “What group of patients does this individual belong to?” Its outcome is a categorical - quite often, binary - variable. Regression answers the question, “How much or how many?” Its outcome is a cardinal variable. While the outcomes of these two types of mathematical models are different, the underlying methodologies are very similar and are considered in Using logistic regression for classification problems and Regression. A combination approach may be appropriate for problems requiring the development of quantification metrics for events of interest: first, identify (or classify) potential outcomes, then evaluate the impact of each outcome separately. In this case, a classification algorithm should be followed with a regression; more often than not, quantification of only the positive outcome is of interest to us.
An important - though sometimes overlooked - step in streamlining the research and development (R&D) methodology is agreeing on standardized terminology for the research process. A well-developed glossary of terms (see Preferred terminology) can assure that identical tasks or processes are described in identical terms, a concept similar to “data integrity” as defined by the principles of database design .
Anecdotal evidence suggests that a data scientist (whatever this term currently entails) spends \(90\%\) of her time scrubbing the data and only \(10\%\) of it doing what she learned in her school’s Advanced Scientific Fortunetelling program. Like an experienced cook who appreciates the role of quality ingredients in meal preparation, a sensible data scientist may be able to achieve good results by simply ensuring that the data ingested into her algorithm is clean. Agreed-upon procedures (AUP) for data cleansing and storage are covered in Data scrubbing and Input data storage.
Consistent, scalable development of reliable and reusable software is an important part of introducing the developed methodologies into production. A collection of good coding practices relevant to predictive analytic development is presented in Coding practices. A good foundation for developing robust code and assuring business continuity includes
- proper revision control practices Revision control,
- accessible and consistently named code repositories and development sandboxes Code storage and
- readable and transparent code modules (Naming conventions, Writing quality code) in R (R).
Once an algorithm has been prototyped and implemented to the developers’ satisfaction, the responsibility for putting it into everyday use it shifts to the production team. The process of testing, validation and verification can be drawn out and contentious unless the rules of the game are well defined in advance. Efficient practices for lightening the burden on both the original developers and the QA team are described in testing and QA.
Finally, once the results have been validated, consistent and easy to understand presentation can facilitate their acceptance by the intended audience. Appropriate standards are covered in model validation.
Examples contained in this manual are based on real data, however, in order to protect potentially sensitive information, the numbers have been modified and the names of the entities involved obscured where deemed necessary.