Saturday, February 21, 2009

NUMBER vs NUMBER(p, s) (Oracle 11g)

Or how to choose one of them for the numeric fields of your tables...
 
As I am sure you know, Oracle (11g RDBMS) offers the NUMBER datatype as the main choice to store numerical values in your tables (equivalent to decimal/numeric type of SQL-Server).

Although this is not the main purpose of this article, I assume that you know the difference between decimal precision datatypes (such as NUMBER) and binary precision datatypes, such as BINARY_FLOAT and BINARY_DOUBLE, also present in Oracle, and that you have decided that the real numbers of your application domain need a decimal precision storage -as you can read on Oracle's Reference, binary precision enables faster arithmetic calculations and "usually" (this is one of the key tip of this article) reduces storage requirements. But BINARY_FLOAT and BINARY_DOUBLE are approximate numeric datatypes and they store approximate representations of decimal values, rather than exact representations. For example none of them can exactly represent the value 0.1 ( don't believe it? Try it ,please) and perhaps this cannot be acceptable for the banking or tax application you are developing.

As a decimal type, NUMBER allows you to indicate the precision (total number of digits) and scale (number of digits to the right of the decimal point) when defining a field of this type (NUMBER(p, s)). 

Let's view some examples:

NUMBER(9, 2): Nine significant digits in total (precision) of which 2 (scale) may be used for the decimal part of the value (digits to the right of the decimal point). 
NUMBER(9): Nine significants digits in total, none of them for the decimal part. Yes, that's the way to restrict your fields for integer values storage.
NUMBER: "I will save whatever you give me" with an accuracy of up to 38 significant digits. 
NUMBER(*,2): You set no limit to the precision but reducing (rounding) the decimal part to two digits. 
NUMBER(9,-2): Nine digit for the integer part which will be "rounded" at the last two digits (interesting), i.e.: 987,654,321 -> 987,654,300.

Most of people would stay happy with the lazy definition of NUMBER (without p or s). But this is not our case, and when defining the accuracy of a NUMBER, we will be considering at least two objectives:

a) To restrict the entry of data: If we specify precision and scale, we are adding a restriction that allows us to establish a greater shielding on the data (the more "downstream" the better, and the shield will apply to any application developed over this database).

Problem: It is vital to know precisely in advance the needs of the field, which  is not sometimes easy. For a field which is, for instance, intended to hold the surface of a construction in a cadastral application, precision and scale could be set without further problems (usually a two digit scale for area values in square meters). 

But what precision and scale should be assigned to a coefficient K that can be fixed arbitrarily by a per year shifting taxation law? Perhaps what today is a ratio of two decimal digits, tomorrow will have six, causing to have to redefine the structure of the table every year with the usual associated  impact in a productive environment.

b) The saving of disk space: It is common thinking that if you reduce precision the needs of storage cost will be reduced in the same meassure, and therefore you will save disk space. Is this true?
 
According to Oracle's Reference:
Oracle Database stores numeric data in variable-length format. Each value is stored in scientific notation, with 1 byte used to store the exponent and up to 20 bytes to store the mantissa. The resulting value is limited to 38 digits of precision. Oracle Database does not store leading and trailing zeros. For example, the number 412 is stored in a format similar to 4.12 x 10², with 1 byte used to store the exponent(2) and 2 bytes used to store the three significant digits of the mantissa(4,1,2). Negative numbers include the sign in their length.
 
Taking this into account, the column size in bytes for a particular numeric data value NUMBER(p), where p is the precision of a given value, can be calculated using the following formula:
 
ROUND((length(p)+s)/2))+1

where s equals zero if the number is positive, and s equals 1 if the number is negative.
Zero and positive and negative infinity (only generated on import from Oracle Database, Version 5) are stored using unique representations. Zero and negative infinity each require 1 byte; positive infinity requires 2 bytes.

That is, the size in bytes (see above in red) is variable and depends on of value stored in each case!!!

Let's try it:
SQL> create table tbl1 (
as_number number(12)
);

Table created.

SQL> insert into tbl1 values(20000000);
SQL> insert into tbl1 values (12345678 );

SQL> select as_number, vsize(as_number) from tbl1;
....

AS_NUMBER VSIZE(AS_NUMBER)
--------------  ----------------------
20000000      2
12345678      5
 
It is interesting to see that storing a value like 20000000 only requires 2 bytes, one for storing the mantissa (2) and another for the exponent of 10 (in this case 7). As said, the number of bytes used is dependent on the stored value.

Therefore, the first conclusion to be obtained is that the consumed amount of bytes is a function of the stored values. And as a result of the above you'd probably think that if you don't specify precision and/or scale, it will not have negative effects regarding to disk occupation  (and neither on the performance, especially if the field has an associated index), uhm... really?

Be careful, this can be accepted as valid when stored values are integer, but it is not always valid when stored values are real numbers, because, for instance, the result of a division made in an UPDATE sentence between two fields holding real numbers, which must be stored in another NUMBER field, could lead to a full occupation (38 digits) of its available size in case that the result of that division produced further decimal digits (unlikely to be needed). Of course, this can be avoided using a Rounding function, but we don't want to rely on every current or future programmers who will evolve the system.

You can read about this in this article, in which it is explained in a detailed manner. 

In this reading you will see that specifying the scale is highly recommended, because it allows to refine the decimal part, the main source of potential space "leaks".

Even when the type of the values to store is integer, it is a good practice to define such fields as NUMBER(*, 0), if we do not want put a limit on the precision or as NUMBER(p, 0) if we want to limit the number of input digits, as we ensure that so defined fields will never allow anything but integers (and that they will spend only the needed disk space to store the specific values that are inserted into them).

So, in order to avoid space leaks you should specify scale (although as said, if  you can guarantee that your values will be integer, the problem won't exist)... provided, of course, that you have all the information needed to decide how to specify it...
 
General summary: 
Whenever it's needed to shield what can be admited in a field in regard to maximum values and/or decimal digits, I recommend doing so in the definition of the field, indicating precision and scale NUMBER(p, s).

Regarding disk storage saving, it is important to set the scale specially for the results of real number calculations, because otherwise you are in risk of occupying all the capacity of NUMBER fields. However, if the data to be introduced is known to be always of integer type (a primary key based on a sequence (+1), for instance) we will not suffer such negative impact.

Note: This article is a refactored English version of a previous post on Tracasa's wiki by the same poster.

No comments: