About 18 months ago, Google published a graph presenting that Unicode on the web had just exceeded all other encodings of text on the web. The increase since then has been even more dramatic.

Web pages can use a diversity of different character encodings, like ASCII, Latin-1, or Windows 1252 or Unicode. Most encodings can only symbolize a few languages, but Unicode can represent thousands: from Arabic to Chinese to Zulu. Google had long used Unicode as the internal format for all the text they search: any other encoding is first converted to Unicode for processing.

This graph is from Google internal data, based on our indexing of web pages, and thus may differ somewhat from what other search engines find. However, the trends are pretty obvious, and the constant rise in use of Unicode makes it even easier to do the dispensation for the many languages that we cover.This graph is from Google internal data, based on our indexing of web pages, and thus may differ somewhat from what other search engines find. However, the trends are pretty obvious, and the constant rise in use of Unicode makes it even easier to do the dispensation for the many languages that Google covers.

This graph is from Google internal data, based on our indexing of web pages, and thus may differ somewhat from what other search engines find. However, the trends are pretty obvious, and the constant rise in use of Unicode makes it even easier to do the dispensation for the many languages that we cover.

Searching for "nancials"?

Unicode is growing both in practice and in character coverage. Google recently upgraded to the newest version of Unicode, version 5.2 (via ICU and CLDR). This adds over 6,600 new characters: some of mostly academic attention, such as Egyptian Hieroglyphs, but many others for living languages.

Google is constantly improving thier handling of existing characters. For example, the characters "fi" can either be represented as two characters ("f" and "i"), or a special display form "fi". A Google search for [financials] or [office] used to not see these as equivalent — to the software they would just look like *nancials and of*ce. There are thousands of characters like this, and they occur in surprisingly many pages on the web, particularly generated PDF documents.

But no longer — after widespread testing, Google just newly turned on support for these and thousands of other characters; your searches will now also find these documents. Further steps in Google’s mission is to organize the world's information and make it universally accessible and useful.

0 comments:

Post a Comment