is defined as
, where
is the number of times outcome
occurred within a series of length
. Thus, the values of zero (0) and one (1) denote the probabilities of impossible and certain events, respectively. Conditional probabilities, denoted
represent a similar concept with the addition that the probability of
is conditioned upon outcome
occurring.
The concept of likelihood is related to that of conditional probability. However, while a probability distribution is considered to be a function of the first argument, e.g.,
above, a likelihood function is considered to be a function of its second argument, e.g.,
above. Further, any positive proportionality constant may relate a likelihood function to its corresponding conditional probability. Thus, likelihoods are restricted to the semi-open interval
.
Both probabilities and likelihoods recur throughout statistical models. Consequently, those quantities must be represented by the type system in computational statistical models. Commonly, this is done by using a suitable native floating point type (e.g., double). One specific drawback with this approach is that the compiler cannot enforce the correct usage of types and confusion may arise between probabilities/likelihoods and other floating point types.
More insidious, however, is the confusion likely to arise when the computations involve both these quantities and their logarithms. Indeed, the solution to many statistical models involves logarithms of probabilities and likelihoods more than the natural quantities. However, models that require both are not uncommon.
If a native floating point type is used for probabilities and likelihoods, the programmer must not only distinguish those from other floating point quantities (e.g., parameters), but also must distinguish them from their logarithms. Subtle mistakes easily result from forgetting which domain (linear or logarithm) a quantity refers to.
Consider the following scientific question: what is the long-term trend in global climate? Clearly, an accurate answer to this question has huge implications in a variety of economic, political, scientific, and humanitarian domains. The general approach for determining the answer involves postulating a model structure, specifying an hypothesis concerning the relevant parameters governing the climate dynamics, and assessing the support given to those by some relevant data set. The most general means of accomplishing this assessment is by calculating the likelihood of the model/hypothesis given the data (Edwards 1992), a method used throughout statistical analysis and scientific modeling in all fields.
For the example posed here, one dataset of interest is the Global Historical Climatology Network (GHCN) data base. This contains historical temperature, precipitation, and pressure data for thousands of land stations worldwide, with several thousand extending back to 1950. It is used operationally by the National Climate Data Center to monitor long-term trends in temperature and precipitation. It has also been employed in several international climate assessments, including the Intergovernmental Panel on Climate Change 4th Assessment Report, the Arctic Climate Impact Assessment, and the "State of the Climate" report published annually by the Bulletin of the American Meteorological Society.
The computational challenge embedded within this example includes more than the difficulty, which is considerable, of developing reliable models of the underlying climate and sampling processes. It also includes the problems of evaluating the support across very large data sets. For example, a recent (2007-08-15) subset of the GHCN data base includes 590,543 mean temperature records. Consequently, the likelihood of a model under investigation, which often involves a product of probablities or likelihoods across the set of records, may be on the order of
or less. Clearly, this is beyond the range of floating point representations typically available, and some specialized computation is required to solve the problem.
The following code blocks, taken from example0a.cpp, illustrate a traditional solution to the problem based on accumulating the support as a sum of logarithms of the likelihoods instead of as a product of likelihoods. First, define a few types to simplify the subsequent calculation. Note that the implementations of these is not given here as they do not help motivate this example.
#include <cmath> #include <iostream> #include <vector> #include <assert.h> class GHCN_record { /* ... */ }; typedef std::vector<GHCN_record> GHCN_data_set; class GlobalClimateModel { public: virtual ~GlobalClimateModel () { } // note: return probablity (not log(likelihood)) of data virtual double probability (const GHCN_record&) const = 0; }; class GlobalClimateModelFactory { public: GlobalClimateModel* operator () (const char*) const; }; // read GHCN data set std::istream& operator >> (std::istream&, GHCN_data_set&);
Given these type definitions, the following code illustrates the traditional method of calculating the support for a specific climate model. Note that there are a number of locations in the code where external knowledge of the semantics of the data types must be kept in mind but that there is no mechanism whereby that can be encoded directly within the program.
int main () { GlobalClimateModelFactory climate_model_factory; GlobalClimateModel* climate_model (climate_model_factory("GCM")); GHCN_data_set data; std::cin >> data; assert (data.size() == 590543); // XXX - illustrate large size of dataset // Note: lnL represents // log(likelihood). The initial // condition is required to accumulate // the sum of log(likelihood) (rather // than the product of likelihood) to // avoid underflow with such a large // dataset double lnL (0); for (GHCN_data_set::const_iterator i = data.begin(); i != data.end(); ++i) { GHCN_record record = *i; // Note: p represents the probability // (not log(probability)) of the // record. double p (climate_model->probability(record)); // Note: remember the following: // - likelihood is proportional to probability // - product of L <=> sum of log(L) // - initial condition for lnL above is // appropriate for adding log(L) here lnL += log(p); } // Note: remember to convert back to // the likelihood scale (what happens // with underflow? overflow?) std::cout << "likelihood: " << exp(lnL) << std::endl; return 0; }
While it may appear to be a minor problem to keep track of the fact that the several double variables have different semantics (i.e., representing probability, likelihood, and log(likelihood) values), that is largely an artifact of this simple example in which those statements are juxtaposed. In a complete solution to the example problem, these statements could easily become spread about a much larger code base and even be maintained by different people. In such cases it becomes very difficult to enforce the semantics because they cannot be represented by the type system.
Using the capabilities of Boost.Probability, the previous block of code can be greatly simplified by directly encoding the intent (see example0b.cpp). Notice that the semantics of each value, including whether it is part of the logarithm domain or not, is represented by its type. Further, operations on the types (e.g., multiplication across likelihoods) can be expressed directly regardless of the actual domain of the types involved. The library guarrantees that appropriate conversions occur where necessary.
int main () { typedef boost::probabilities::linear_domain linear_domain; GlobalClimateModelFactory climate_model_factory; GlobalClimateModel* climate_model = climate_model_factory("GCM"); GHCN_data_set data; std::cin >> data; assert (data.size() == 590543); // XXX - illustrate large size of dataset boost::log_likelihood lnL; for (GHCN_data_set::const_iterator i = data.begin(); i != data.end(); ++i) { GHCN_record record = *i; boost::probability p (climate_model->probability(record)); lnL *= p; } std::cout << "likelihood: " << boost::probabilities::domain_cast<linear_domain>(lnL) << std::endl; return 0; }
It should be clear that in real-world instances of this, where the code is much more complex, distributed across many compilation units, and maintained by different people, the latter implementation offers significant advantages over the former in terms of clarity, safety, and maintainability.
double would require many tricks and would likely be difficult to maintain and verify.In contrast, the following example illustrates how straightforward it can be to construct a generic algorithm for solving a large class of probability and likelihood models. The properties of Boost.Probability are used to full advantage here. Importantly, despite its simplicity, the algorithm itself is widely useful in its own right for solving problems arising from practical situations.
A generic algorithm for calculating the most likely paths through a graph would greatly simplify application development. However, to be truly useful it must exhibit several important properties.
double, is difficult and convoluted at best, and impossible at worst. For example, satisfying the first criterion may require representing probabilities and likelihoods in the log domain, whereas this may not be compatible with the third criterion. Further, confusion may arise with respect to the semantics of the edge weights and the vertex distances; are they logarithms or not? Such uncertainty breeds errors, especially when, as is very likely, those developing the generic algorithm are not the same as those developing the graph representation.
The following generic algorithm, most_likely_paths(), illustrates a solution to these problems based on the capabilities of Boost.Probability. Unlike similar algorithms written using native types, e.g., double, to represent the probability and likelihood quantities, this is safe and correct for the full range of meaningful template arguments, because the Boost.Probability types ensure type safety. Thus, this algorithm may be maintained independently of application code that uses it.
First, include the relevant definitions from Boost.Probability and for the graph algorithm used to implement most_likely_paths().
#include <boost/probability.hpp> #include <boost/graph/dag_shortest_paths.hpp>
Define the operation used to combine vertex distances with edge weights. Because probabilities (and likelihoods) along a path combine mutiplicatively, we define a multiply operation. Note that this will work correctly regardless of whether the actual types being combined are quantities in the linear or logarithm domain, because operator*() has consistent semantics in either case, as well as for mixtures of the two. Were the types from Boost.Probability not being used, it would be impossible to define a single operation that would work under all of these conditions.
template <typename T, typename U> struct mpy { T operator () (const T& a, const U& b) const { return a * b; } };
The following is the actual generic algorithm for finding the most likely paths through a graph. It is implemented in terms of the shortest paths algorithm, dag_shortest_paths(), from the Boost Graph Library. Since we seek the most likely paths, the comparison operation must be greater than (>), not less than (<). Additionally, the identity and infinity values must suit the nature of the combine operation defined above and the directionality of the comparison. Note that these values are constructed from probabilities within the linear domain to ensure consistent semantics, irrespective of the actual domain of the types used to represent edge weights and vertex distances. Without the properties of the Boost.Probability types, specifying suitable initial conditions would be much more convoluted.
template <typename VertexListGraph, typename WeightMap, typename DistanceMap, typename PredecessorMap, typename ColorMap, typename IndexMap> inline void most_likely_paths (const VertexListGraph& g, typename graph_traits<VertexListGraph>::vertex_descriptor s, WeightMap weight, DistanceMap distance, PredecessorMap pred, ColorMap color, IndexMap id) { // XXX - assert that the graph is directed... typedef typename boost::property_traits<DistanceMap>::value_type distance_type; typedef typename boost::property_traits<WeightMap>::value_type weight_type; typedef std::greater<distance_type> compare_type; typedef detail::mpy<distance_type,weight_type> combine_type; distance_type identity (probability<linear_domain,typename distance_type::value_type>(1)); distance_type infinity (probability<linear_domain,typename distance_type::value_type>(0)); boost::dag_shortest_paths (g, s, weight_map(weight).distance_map(distance) .predecessor_map(pred).color_map(color) .distance_compare(compare_type()).distance_combine(combine_type()) .distance_zero(identity).distance_inf(infinity).vertex_index_map(id) ); }
It should be clear that the algorithm above is entirely generic; it applies to any graph with probability- and/or likelihood-based property maps representing the vertex distances and edge weights. It should also be evident that this algorithm is easy to develop, maintain, and verify. Even more importantly, it can be used by application code without worry that some subtlety of the semantics will not be enforced, thereby making the resultant code incorrect.
To simplify the interface for most_likely_paths(), another function of the same name is provided that bundles the four map arguments into a single set of named parameters.
First, include the relevant definitions.
#include <iomanip> #include <iostream> #include <string> #include <boost/graph/adjacency_list.hpp> #include <boost/likelihood.hpp> #include <boost/most_likely_paths.hpp>
Next, define the basic types used for the vertex distances and edge weights. The choices used here illustrate that they need not be the same type and need not even be within the same domain. Indeed, for very large graphs it is likely that the accumulation of path lengths, represented by the vertex distance type, should occur within the log domain, even if it is most natural to represent the edge weights in the linear domain. This does not matter in this simple case, however, and it is possible to redefine these types in any meaningful combination, in either domain, and still obtain the same results. This illustrates the true flexibility of a generic algorithm based upon Boost.Probability: the semantics are unchanged by choosing types that best reflect the nature of the application domain.
typedef boost::log_likelihood distance_type; typedef boost::probability weight_type;
Define the type used to represent a vertex in the graph. Note that the vertices along the most likely paths are captured through the predecessor vertices for each vertex. The adjacency_list_traits<>::vertex_descriptor is used to acquire the type of the vertex descriptor in advance of defining the actual graph type.
struct vertex_type { typedef boost::adjacency_list_traits<boost::vecS, boost::vecS, boost::directedS>::vertex_descriptor vertex_descriptor; std::string name; distance_type d; vertex_descriptor predecessor; };
Define the type used to represent an edge in the graph. Note that here again the argument to the constructor is interpreted as a probability in the linear domain, regardless of the actual type used to represent it internally. This ensures consistent semantics for the code constructing the graph.
struct edge_type { typedef boost::probabilities::linear_domain linear_domain; edge_type () : weight() { } template <typename T> edge_type (T w) : weight(boost::probabilities::probability<linear_domain,T>(w)) { } weight_type weight; };
Define the type of graph used in this application.
typedef boost::adjacency_list<boost::vecS, boost::vecS, boost::directedS,
vertex_type, edge_type> graph_type;
Declare functions to simplify the main code. Although present within example0c.cpp, the implementation of graph_factory() is not given here as it adds little to the understanding of the example. It is simply responsible for constructing the particular graph used in this illustration.
static graph_type graph_factory (); static std::ostream& operator << (std::ostream&, const distance_type&);
Given the preceding code, the main() function is quite straightforward, again illustrating the power of using Boost.Probability and the generic algorithms it enables. The graph is constructed, the source vertex is identified, and the most likely paths emanating from that source vertex are discovered by the generic algorithm. Finally, those paths are reported through the sequence of predecessor vertices. Note the use of named arguments for most_likely_paths().
int main () { graph_type g (graph_factory()); graph_type::vertex_descriptor source (1); boost::probabilities::most_likely_paths (g, source, weight_map(get(&edge_type::weight,g)).distance_map(get(&vertex_type::d,g)) .predecessor_map(get(&vertex_type::predecessor,g)) ); const int w (12); std::cout << "source vertex: " << g[source].name << std::endl; std::cout << std::setw(w) << "destination" << std::setw(w) << "predecessor" << std::endl; std::cout << std::setw(w) << "vertex" << std::setw(w) << "vertex" << std::setw(w) << "likelihood" << std::endl; boost::graph_traits<graph_type>::vertex_iterator vi , vi_end; for (tie(vi, vi_end) = vertices(g); vi != vi_end; ++vi) std::cout << std::setw(w) << g[*vi].name << std::setw(w) << g[g[*vi].predecessor].name << std::setw(w) << g[*vi].d << std::endl; return 0; }
The left shift operator (<<) is used to report the likelihood of the paths. Recall, however, that the vertex distance type in this case is within the logarithm domain. Thus, it is cast into the linear domain prior to reporting it. Consequently, the value will be reported correctly regardless of whether the likelihood is accumulated in the linear or log domain.
std::ostream& operator << (std::ostream& s, const distance_type& d) { typedef boost::probabilities::linear_domain linear_domain; return s << boost::probabilities::value_cast<linear_domain>(d); }
This simple example illustrates the clarity with which both libraries (e.g., generic algorithms) and application code can be written when using Boost.Probability. The semantics are clearly expressed directly in the code, while the underlying types ensure safety and correctness. Development, maintenance, and verification may be divided among distinct groups without loss of quality due to confusing overloads of native types, e.g., double, to represent a diversity of semantically distinct entities.
This example also illustrates how easy the construction of generic algorithms for probabilities and likelihoods can be. The one presented, although very short and clear, solves a broad class of problems encounted in the domain of scientific and statistical modeling. Creating a larger set of graph-based generic algorithms based upon Boost.Probability would be both straightforward and valuable, as it would encapsulate solutions to a great diversity of practical problems. This represents a future direction of growth for Boost.Probability.
and likelihood values within the semi-open domain
, both as native quantities and as logarithms.double is sufficient, however, the simple default types make the task of instantiating these quantities easier.double to represent the value type.Greater control is provided by the underlying set of types within the boost::probabilities namespace. Here are a set of four analogous types.
double) that should model a real number with suitable arithmetic operations and the logarithm and exponential functions. Using these templates, one can construct probability and logarithm types based on any appropriate value type. All of the type safety associated with managing quantities in the linear or log domain are available.
As with the simple types itemized above, each of the template types is available as the native quantity (boost::probabilities::linear_domain) and as the logarithm of the native quantity (boost::probabilities::log_domain). By default, in order to maximize performance, no validation of the values within these domains occurs. However, alternative validators may be used. The boost::probabilities::range_validator restricts probabilty and likelihood quantities to lie within their appropriate domains of
and
, respectively. The boost::probabilities::truncating_validator enables the selection of a non-zero tolerance around the natural bounds to accommodate potential fixed precision round-off errors near the boundaries.
All the valid arithmetic expressions involving probabilities and likelihoods are available in the library. Because domain transformations, which require evaluation of exponential or logarithm functions, are expensive, arithmetic operations between quantities within the same domain are preferred. However, as suggested by the motivational example above, there are situations in which mixed domain operations should be allowed. As a result, implicit domain transformations will occur with the available mixed-domain operations. Whenever possible, implicit transformation from the linear to the log domain is preferred, because a wider array of values may be represented in the log domain and an oft-encountered issue in statistical computations is underflow resulting from the product of many small quantities.
All logical comparisons involving probabilities and likelihoods are available. In the case of mixed-domain comparisons, the arguments are converted into the log domain prior to comparison.
Of the many mathematical functions available, only the power function is sensible for probabilities and likelihoods, and that only for exponents that are natural numbers. For other exponents and other functions, the return value no longer retains the properties of probabilities and likelihoods. Consequently, this is the only mathematical function provided by the library.
Finally, a set of casts are available for interconverting among the quantities and for converting them to real numbers. The boost::probabilities::domain_cast free function and corresponding member functions boost::probabilities::likelihood::domain_cast and boost::probabilities::probability::domain_cast yield a new quantity transformed into the appropriate domain. In contrast, the boost::probabilities::value_cast free function and corresponding member functions boost::probabilities::likelihood::value_cast and boost::probabilities::probability::value_cast yield the real number value of the quantity transformed into the appropriate domain. The latter is useful, for example, to extract a real number representation for other manipulations that should not be performed within the probability or likelihood domains. It is also useful, for example, to implement output functions.
The first illustrates use of the default types to accumulate the likelihood of some independent observations. In what follows, only the most relevant portions of the code are described; the full source is available in example1.cpp.
First, the relevant header files are included.
#include <boost/likelihood.hpp> #include <boost/probability.hpp>
Some types are defined to simplify the subsequent code.
typedef boost::probabilities::linear_domain linear_domain; typedef boost::probabilities::log_domain log_domain; typedef boost::likelihood likelihood; typedef boost::log_likelihood log_likelihood; typedef boost::probability probability;
A useful probability model based upon the Poisson distribution is declared. Note that the implementation of the model used here is for illustration only and does not represent a robust version for production use.
static probability poisson (int i); // probability model
After constructing a series of observations, the overall likelihood is calculated. Because the type used to accumulate the intermediate values is within the log domain, the probabilities of each observation are automatically converted to the log domain before being multiplied (i.e., logarithms added) into the accumulator.
log_likelihood l; // product of likelihoods across a series of independent observations for (observations::const_iterator i = obs.begin(); i != obs.end(); ++i) l *= poisson(*i);
Note that even though the probability model (poisson() in this case) returns probabilities as native quantities (i.e., in the linear domain), the accumulation of them across independent observations in the form of likelihoods is performed in the log domain. Nevertheless, the natural multiplication notation (i.e., using operator*=() in this case) is available to make the intent clearer and easier to maintain. This represents a typical use case in which log likelihoods are used for convenience or to avoid underflow issues, but the fundamental aspect of a product across likelihoods is still expressed clearly.
Finally, output the accumulated likelihood in both domains. Notice the use of boost::probabilities::domain_cast to convert the value of a log_likelihood into the linear domain.
std::cout << boost::probabilities::domain_cast<linear_domain>(l)
<< std::endl;
std::cout << l << std::endl;
The following is an example output function for likelihoods in the linear domain. Note the use of boost::probabilities::value_cast to obtain access to the underlying value.
template <typename Value, typename Validator> std::ostream& operator << (std::ostream& s, const boost::probabilities:: likelihood<linear_domain,Value,Validator>& l) { return s << "l=" << boost::probabilities::value_cast<linear_domain>(l); }
The second example illustrates the use of the more flexible types to select the value type used internally to represent the quantities. Again, only the most relevant portions of the code are described; the full source is available in example2.cpp.
First, the relevant header files are included.
#include <boost/likelihoods.hpp> #include <boost/probabilities.hpp>
Some types are defined to simplify the subsequent code. In this case, the value type of each probability and likelihood quantity is explicitly selected via the second template argument. While the example illustrates the simple use of float as the value type, any type that corresponds to a real number value may be used. The type must support the standard arithmetic operators as well as the exponential functions exp(), log(), and log1p().
typedef float value_type; typedef boost::probabilities::linear_domain linear_domain; typedef boost::probabilities::log_domain log_domain; typedef boost::probabilities::likelihood<linear_domain,value_type> likelihood; typedef boost::probabilities::likelihood<log_domain,value_type> log_likelihood; typedef boost::probabilities::probability<linear_domain,value_type> probability;
With these new types, the remainder of the example code is identical to that described above for the example using the default types.
The third example illustrates the use of the more flexible types to select the validator used to assure the validity of the internal representation of the quantities. In this example, the boost::probabilities::range_validator is used, which enforces values to lie within an appropriate range. Again, only the most relevant portions of the code are described; the full source is available in example3.cpp.
First, the relevant header files are included.
#include <boost/likelihoods.hpp> #include <boost/probabilities.hpp>
Some types are defined to simplify the subsequent code. Note the definition of the validator types and their use as the third template argument in the final typedefs.
typedef float value_type; typedef boost::probabilities::linear_domain linear_domain; typedef boost::probabilities::log_domain log_domain; typedef boost::probabilities::likelihood_tag likelihood_tag; typedef boost::probabilities::probability_tag probability_tag; typedef boost::probabilities::range_validator<likelihood_tag,value_type> likelihood_validator; typedef boost::probabilities::range_validator<probability_tag,value_type> probability_validator; typedef boost::probabilities::likelihood<linear_domain,value_type, likelihood_validator> likelihood; typedef boost::probabilities::likelihood<log_domain,value_type, likelihood_validator> log_likelihood; typedef boost::probabilities::probability<linear_domain,value_type, probability_validator> probability;
With these new types, the remainder of the example code is identical to that described above for the example using the default types. In this case, however, run-time validation occurs for the likelihood and probability quantities.
The final example illustrates the use of generic algorithms with the likelihood and probability types. Specifically, it takes advantage of the fact that the semantics underlying the operators is identical regardless of the domain of operation. As with the preceding examples, only the most relevant portions of the code are described; the full source is available in example4.cpp.
First, the relevant header files are included.
#include <boost/likelihood.hpp> #include <boost/probability.hpp>
Some types are defined to simplify the subsequent code. As in the first example, the default types are used.
typedef boost::probabilities::linear_domain linear_domain; typedef boost::probabilities::log_domain log_domain; typedef boost::likelihood likelihood; typedef boost::log_likelihood log_likelihood; typedef boost::probability probability;
Define a generic algorithm for the Poisson distribution. Since the calculations are done in the linear domain but the Probability template argument may correspond to any domain, the domain specifier must be given as an extra argument to the constructor. Note that this implementation is purely for illustration and does not represent a robust version for production use.
// generic, but simple, Poisson distribution template <typename Probability, typename IntegralType> Probability poisson (IntegralType i) { typedef typename Probability::value_type value_type; const value_type lambda (2); return Probability(exp(-lambda) * pow(lambda, i) / factorial(i), linear_domain()); }
Define a generic algorithm that accumulates the likelihood across a series of independent observations. The observations are in a type, Observations, that models a sequential container. The likelihood is accumulated within the domain corresponding to the Likelihood template type. Note that in this case, the Poisson distribution returns values in the linear domain, as determined by the probability template argument.
// calculate the likelihood of a set of observations template <typename Likelihood, typename Observations> Likelihood analyze (const Observations& obs) { Likelihood l; // product of likelihoods across a series of independent observations for (typename Observations::const_iterator i = obs.begin(); i != obs.end(); ++i) l *= poisson<probability>(*i); return l; }
Define a generic algorithm for outputing a likelihood quantity, preceded by a specific label.
// output a label and a likelihood template <typename Likelihood> void output (const std::string& label, const Likelihood& l) { std::cout << label << " " << l << std::endl; }
Instantiate and use the generic algorithms with different Likelihood types. In this case the accumulation of likelihoods in the linear domain suffers from underflow because of the large number of small quantities that are multiplied. In contrast, the same accumulation in the log domain yields the correct result. In both cases, however, the same generic algorithm shown above is used. This illustrates the value of maintaining a consistent set of semantics (e.g., consistently defining operator*=() to denote accumulation as a product of likelihoods) regardless of the domain of operation.
output("underflow: ",analyze<likelihood,observations>(obs)); output("no underflow:",analyze<log_likelihood,observations>(obs));
Memory utilization. With respect to the use of memory, the layout of the likelihood and probability types is identical to that used for the underlying value type. This is true for single quantities, structures of the quantities, and 1- and 2-dimensional arrays of the quantities. This property is tested in the test cases corresponding to the two primary types.
|
|
Figure 1. Performance evaluation based on the expression LHS *= RHS for unvalidated data types. |
Runtime performance. With respect to runtime, performance.cpp implements a simple benchmark. The test times a series of LHS *= RHS expressions similar to those outlined in the examples above. Alternative versions of the test are generated for different combinations of the LHS and RHS types using a template to maximize consistency of the code among test cases. Figure 1 illustrates the results of a test based on gcc version 3.3.3 (NetBSD nb3 20040520) with an optimization level of -O2. As expected, the time required for a large numer of operations increases linearly with the number of operations. There is also a large amount of variation evident in these figures; however, note that the vertical axis is truncated. In fact, the residual coefficient of variation for these data is less than 0.5%.
The lines in the figures summarize analyses of covariance with the LHS or RHS data type being the covariate in each case. The data are well-approximated by parallel lines. The differences in intercept, which are summarized in Table 1, measure the effect of the data type on performance.
| Covariate | Data Type | Unvalidated | Validated |
|---|---|---|---|
| Left hand side | Likelihood | 0.12% | 2.60% |
| Left hand side | Probability | -0.03% | 2.64% |
| Right hand side | Probability | 0.36% | 4.25% |
Table 1. Effect of the covariates on the intercept relative to the expression double *= double for unvalidated and validated data types. |
Clearly, the performance effect of using the unvalidated types provided by Boost.Probability is measurable. However, it is small enough to be largely insignificant. It is important to note that these tests are based on embedding the probability and likelihood calculations within a tight loop. Consequently, the bulk of the operations involve the probability and likelihood quantities. For use cases that mix a greater proportion of other calculations among those for probability and likelihood quantities, the effect of these types on performance will be even less.
Note that the results illustrated in Figure 1 are obtained using the default boost::probabilities::null_validator, which performs no runtime checks on the values. Alternative validators, such as boost::probabilities::range_validator, which do perform checks, necessarily incur a greater runtime penalty. Results of a similar analysis of covariance of data obtained from the range validated versions of the same types are also summarized in Table 1. Both validators are provided so that end-users may select between greater runtime performance and greater compliance with the strict bounds imposed by the definition of probabilities and likelihoods.
Conclusion These performance results suggest that the benefits of type safety, consistent semantics, and support for generic algorithms obtained by use of Boost.Probability are associated with a small and measurable, but likely insignificant, performance difference. Users may choose validators based on their need for higher performance or strict range validation.
Installation of either version of the library is straightforward. Once the library is downloaded and untarred, simply copy the files in the boost directory into the normal location for header files, e.g., /usr/local/include, and use them as you would other header files. Because this library is entirely implemented as template classes within the header files, there is no need to compile any libraries prior to use.
The Boost.Build framework may be used to test the library. Run bjam in the libs/probability/test directory as you would for any other Boost library. Be sure that the environment variable BOOST_ROOT points to the location of the Boost source code.
Alternatively, the standalone version of the library may be tested by compiling and running the programs under either the test or example directories. This may be accomplished by running sh configure, make, and make check in the top level directory.
The existing make system provides all the standard targets (e.g., all, clean, depend, etc.) for building software. Additional targets include check, which runs the test and example programs. Note that the provided Makefiles assume the standard BSD make program and its associated system-provided Makefile fragments, which are traditionally located in /usr/share/mk. Both are available through pkgsrc, a platform-independent packaging system that includes, among many others, several Boost packages among those for development utilities. The two relevant packages are:
-I compiler option), your existing compilation system should work fine.
Copyright (c) 2007 Brook Milligan Boost Software License - Version 1.0 - August 17th, 2003 Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following: The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
1.4.5