Understanding Collation

The physical storage of character strings in the supported versions of Microsoft SQL Server and Microsoft SQL Server Express databases is controlled by collations. A collation specifies the bit patterns that represent each character and the rules by which characters are sorted and compared.

Microsoft SQL Server supports objects that have different collations being stored in a single database. Separate Microsoft SQL Server collations can be specified down to the level of columns. Each column in a table can be assigned different collations.

In a computer, characters are represented by different patterns of bits being either ON or OFF. A program that uses one byte (eight bits) to store each character can represent up to 256 different characters. A program that uses two bytes (16 bits) can represent up to 65,536 characters.

Single-byte code pages are definitions of the characters mapped to each of the 256 bit patterns possible in a byte. Code pages define bit patterns for uppercase and lowercase characters, digits, symbols, and special characters such as !, @, #, or %. Each European language, such as German or Spanish, has its own single-byte code page. Although the bit patterns used to represent the Latin alphabet characters A through Z are the same for all the code pages, the bit patterns used to represent accented characters (such as é and á) vary from one code page to the next. If data is exchanged between computers running different code pages, all character data must be converted from the code page of the sending computer to the code page of the receiving computer. If the source data has extended characters that are not defined in the code page of the receiving computer, data is lost. When a database serves clients from many different countries, it is difficult to pick a code page for the database that contains all the extended characters required by all the client computers. Also, a lot of processing time is spent doing the constant conversions from one code page to another.

Single-byte character sets are also inadequate to store all the characters used by many languages. For example, some Asian languages have thousands of characters, so they must use two bytes per character. Double-byte character sets have been defined for these languages. Still, each of these languages have their own code page, and there are difficulties in transferring data from a computer running one double-byte code page to a computer running another.

For information about synchronizing collation settings with another Windows locale, see the following Microsoft site: http://msdn2.microsoft.com/en-us/library/aa176553.aspx.