Byte: Difference between revisions
imported>Pat Palmer No edit summary |
mNo edit summary |
||
(41 intermediate revisions by 10 users not shown) | |||
Line 1: | Line 1: | ||
{{subpages}} | |||
[[Image:Hexer.png|300px|right|thumb|The [[Hexer]] hex editor displaying the [[Linux kernel]] version 2.6.20.6; this image illustrates the value of bytes composing a program as they appear in the hexadecimal format]] | [[Image:Hexer.png|300px|right|thumb|The [[Hexer]] hex editor displaying the [[Linux kernel]] version 2.6.20.6; this image illustrates the value of bytes composing a program as they appear in the hexadecimal format]] | ||
In [[computer science]], a '''byte''' is a unit of [[data]] consisting of eight [[binary numeral system|binary]] digits, each of which is called a [[bit (computing)|bit]]. The 8-bit '''byte''' is the smallest addressable unit of information in the [[instruction set architecture]] (ISA) of most electronic computers today. In the history of computing, various computers have used other byte sizes, such as 9-bit bytes, and some machines have not had byte addressing at all, only addressing at the word level. However, today byte addressing and 8-bit bytes are standard. | |||
When grouped together, bytes can contain the information to form a document, such as a photograph or a book. All information stored on a computer is composed of bytes, from e-mails and pictures, to programs and data stored on a [[hard drive]]. Although initially it may appear to be a simple concept, the actual definition is far more complex and profound. | |||
A byte is | A byte is a [[binary numeral system|binary]] number, but the [[semantics]], or meaning assigned to a given byte, is a matter defined within the [[instruction set architecture]] (ISA) of each type of computer. Many different encodings have been tried for text, for integers, and for various non-integer numbers. A discussion of the merits of the various representations is complex and falls under the field of [[computer architecture]]. The topic of [[text encoding]] is particularly tricky. Various systems for encoding printable characters have been used, including [[American Standard Code for Information Interchange|ASCII]], [[Extended Binary Coded Decimal Interchange Code|EBCDIC]], and various flavors of [[Unicode|Unicode Character Encoding]]. | ||
In the | ==Definition of byte== | ||
In electronics, information is represented by the toggle of two states, usually referred to as 'on' and 'off'. To represent this state, computer scientists use the values of 0 (off) and 1 (on); we refer to this value as a ''bit''. The term "bit" was coined from mathematician [[Claude Shannon]]’s research in the 1940s<ref name="Shannon2">{{cite web|url=http://www.thocp.net/biographies/shannon_claude.htm|title=" Claude Elwood Shannon" (biography)|publisher=The History of Computing Project|year=2005|accessdate=2007-05-12}}</ref>. | |||
Half of a byte (four bits) is referred to as a '''[[nybble]]'''. A '''[[word]]''' is a standard number of bytes that memory is addressed with. Memory can only be addressed by multiples of the size of a word, and the size of a word is dependent on the architecture. For example: a 16-bit processor has words consisting of two bytes (8 x 2 = 16), a 32-bit processor has words that consist of four bytes (4 x 8 = 32), etc. | |||
The eight bits making up each byte can represent any number from 0 to 255. We obtain this number of possible values, which is 256 when including the 0, by raising the possible values of a bit (two) to the power of the length of a byte (eight); thus, 2<sup>8</sup> = 256 possible values in a byte. | |||
While there are many different ways to express the value of a byte, the three main types are [[hexadecimal]], binary, and [[decimal]]. Hexadecimal is probably the most common method for explicit editing of a binary file, due largely to the fact that the system can express 256 different numbers in only two characters. For example, ff is the largest hexadecimal number, with a value of 255, while 00 is the lowest, having the value of 0. | |||
In the screenshot of the Hexer binary editor at the beginning of this article, the first eight bytes are {66,ea,08,00,00,00,c0,07}. This sequence of numbers could be expressed in the decimal system as {102,234,8,0,0,0,192,7}. | |||
In computers, [[plain text]] came to mean a string, file, or byte array that is ''printable'', consisting only of standard [[alphanumeric]] bytes and a few ''control'' bytes such as tab, carriage return, or line feed. Plain text was not supposed to include any bytes that a printer would not know how to handle. The actual value of each character has varied in years past. Today, however, we have the [[American Standard Code for Information Interchange]] (ASCII), which allows data to be readable when being transmitted through different mediums, such as from one [[operating system]] to another. For instance, a user who typed a plain text document in [[Linux]] would usually be able to view or print the same file on a [[Macintosh]] computer. One example of ASCII would be the capital letters of the English language, which range from 101 for "A" to 127 for "Z". | Hexadecimal numbers are often expressed with 0x preceding the number, such as 0xf8, in order to denote the fact that the number is in the hexadecimal format. This method is most often used in calculators and [[compiler|compilers]], where the computer need to be told what format it is supposed to be reading. If the computer is not told this, then it may interpret the number as a variable or decimal number, resulting in a build-time error or a bug in the software. | ||
Bytes can be used to represent many of data types, from characters in a [[string (programming)|string]] of text, to the assembled and linked machine code of a [[binary executable]] file, which is the language that programs use to tell the computer how to act. Every file, sector of [[Random Access Memory|system memory]], and network stream is composed of bytes. | |||
In computers, [[plain text]] came to mean a string, file, or byte array that is ''printable'', consisting only of standard [[alphanumeric]] bytes, punctuation, and a few ''control'' bytes such as tab, carriage return, or line feed. Plain text was not supposed to include any bytes that a printer would not know how to handle. The actual value of each character has varied in years past. Today, however, we have the [[American Standard Code for Information Interchange]] (ASCII), which allows data to be readable when being transmitted through different mediums, such as from one [[operating system]] to another. For instance, a user who typed a plain text document in [[Linux]] would usually be able to view or print the same file on a [[Macintosh]] computer. One example of ASCII would be the capital letters of the [[English language]], which range from 101 for "A" to 127 for "Z". | |||
===Endianness=== | ===Endianness=== | ||
When multiple contiguous bytes represent a single number, there are two possible opposite "orderings" of the bytes; the | When multiple contiguous bytes represent a single number, there are two possible opposite "orderings" of the bytes; the particular ordering used is called [[endianness]]. Just as some natural [[language|languages]] are [[written language|written]] from left to right, such as English, while others are written from right to left, such as [[Hebrew]], bytes can be arranged "big end first", or ''Big Endian'' (with the most significant byte at the lowest memory address) or "little end first", or ''Little Endian'' (with the least significant byte at the lower memory address). The names are derived from the book ''Gulliver's Travels'', in which the Lilliputians' forefront political concern was whether eggs should be opened from the little end or the big end.<ref>{{cite web | ||
| url=http://www.webopedia.com/TERM/b/big_endian.html | | url=http://www.webopedia.com/TERM/b/big_endian.html | ||
| title=What is big-endian? - A Word Definition From the Webopedia Computer Dictionary | | title=What is big-endian? - A Word Definition From the Webopedia Computer Dictionary | ||
| date=Accessed April 15th, 2007 | | date=Accessed April 15th, 2007 | ||
}}</ref> | }}</ref>. A similar ordering decision exists for the bits within a byte; it tends to be called simply ''bit ordering''. | ||
Differences in endianness arose among various [[computer architecture]]s. For example, the Intel [[x86]] architecture used for [[IBM compatible PC]]s is little endian, whereas [[SPARC]] (Solaris) architectures are big endian. Even programming languages which run on virtual machines, such as [[Java]] or [[C sharp]], have endianness. Java, which was developed first on Unix machines that tended to be big endian, uses a ''big'' endian [[virtual machine]] (a.k.a. [[runtime]]), whereas [[C sharp]], developed by Microsoft for Intel [[x86]] computers, uses a ''little'' endian runtime. | |||
Differences in endianness can be a hazard when transferring information between two computers using different architectures, and errors can arise due to mistakes in the ordering. translating bytes and bits incorrectly. Since the late 1990's, endianness has receded as a problem due to the widespread adoption of ''eXtensible Markup Language'' ([[XML]]) as a kind of [[lingua franca]] for transferring information between computers. XML is a standardized way of representing numbers, and indeed any kind of information at all, as strings of plain text. | |||
===Word origin and ambiguity=== | ===Word origin and ambiguity=== | ||
Line 43: | Line 49: | ||
==Sub-units== | ==Sub-units== | ||
Because [[files]] are normally many thousands or even billions of times larger than a byte, other terms designating larger byte quantities are used to increase readability. Metric [[Prefix|prefixes]] are added to the word byte, such as ''kilo'' for one thousand bytes (kilobyte), ''mega'' for one million (megabyte), ''giga'' for one billion (gigabyte), and even ''tera'', which is one trillion (terabyte). One thousand megabytes compose a terabyte, and even the largest consumer hard drives today are only three-fourths a terabyte (750 'gigs' or gigabytes). The rapid pace of technological advancement may make the terabyte commonplace in the future, however. | |||
===Conflicting definitions=== | ===Conflicting definitions=== | ||
Traditionally, the computer world has often used a value of 1024 instead of 1000 when referring to a kilobyte. This was done because programmers needed a number compatible with the base of 2, and 1024 is equal to 2 to the 10th [[Exponentiation|power]]. Typically, storage space is measured with a base of 2, whereas data ''rates'' generally uses a base of 10. Thus, engineers in one field of computer science may use the same term when referring to different units of measurement (numbers of bytes). | Traditionally, the computer world has often used a value of 1024 instead of 1000 when referring to a kilobyte. This was done because programmers needed a number compatible with the base of 2, and 1024 is equal to 2 to the 10th [[Exponentiation|power]]. Typically, storage space is measured with a base of 2, whereas data ''rates'' generally uses a base of 10. Thus, engineers in one field of computer science may use the same term when referring to different units of measurement (numbers of bytes). | ||
Due to the large confusion between these two meanings, an effort has been made by the [[International Electrotechnical Commission]] (IEC) to remedy this problem. They have standardized a new system called the '[[binary prefix]]', which replaces the word 'kilobyte' with 'ki'''bi'''byte', abbreviated as KiB. This solution has since been approved by the [[IEEE]] on a trial-use basis, and may prove to one day become a true standard.<ref>{{cite web | Due to the large confusion between these two meanings, an effort has been made by the [[International Electrotechnical Commission]] (IEC) to remedy this problem. They have standardized a new system called the '[[binary prefix]]', which replaces the word 'kilobyte' with 'ki'''bi'''byte', abbreviated as KiB, to mean 1024 bytes. This solution has since been approved by the [[IEEE]] on a trial-use basis, and may prove to one day become a true standard.<ref>{{cite web | ||
| url=http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&isnumber=26611&arnumber=1186538&punumber=8450 | | url=http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&isnumber=26611&arnumber=1186538&punumber=8450 | ||
| title=IEEE Trial-Use Standard for Prefixes for Binary Multiples | | title=IEEE Trial-Use Standard for Prefixes for Binary Multiples | ||
Line 89: | Line 93: | ||
==Related topics== | ==Related topics== | ||
*[[American Standard Code for Information Interchange|American Standard Code for Information Interchange (ASCII)]] | *[[American Standard Code for Information Interchange|American Standard Code for Information Interchange (ASCII)]] | ||
*[[Extended Binary Coded Decimal Interchange Code|Extended Binary Coded Decimal Interchange Code (EBCDIC)]] | *[[Extended Binary Coded Decimal Interchange Code|Extended Binary Coded Decimal Interchange Code (EBCDIC)]] | ||
Line 95: | Line 98: | ||
==References== | ==References== | ||
{{reflist}}[[Category:Suggestion Bot Tag]] | |||
[[Category: | |||
Latest revision as of 11:00, 22 July 2024
In computer science, a byte is a unit of data consisting of eight binary digits, each of which is called a bit. The 8-bit byte is the smallest addressable unit of information in the instruction set architecture (ISA) of most electronic computers today. In the history of computing, various computers have used other byte sizes, such as 9-bit bytes, and some machines have not had byte addressing at all, only addressing at the word level. However, today byte addressing and 8-bit bytes are standard.
When grouped together, bytes can contain the information to form a document, such as a photograph or a book. All information stored on a computer is composed of bytes, from e-mails and pictures, to programs and data stored on a hard drive. Although initially it may appear to be a simple concept, the actual definition is far more complex and profound.
A byte is a binary number, but the semantics, or meaning assigned to a given byte, is a matter defined within the instruction set architecture (ISA) of each type of computer. Many different encodings have been tried for text, for integers, and for various non-integer numbers. A discussion of the merits of the various representations is complex and falls under the field of computer architecture. The topic of text encoding is particularly tricky. Various systems for encoding printable characters have been used, including ASCII, EBCDIC, and various flavors of Unicode Character Encoding.
Definition of byte
In electronics, information is represented by the toggle of two states, usually referred to as 'on' and 'off'. To represent this state, computer scientists use the values of 0 (off) and 1 (on); we refer to this value as a bit. The term "bit" was coined from mathematician Claude Shannon’s research in the 1940s[1].
Half of a byte (four bits) is referred to as a nybble. A word is a standard number of bytes that memory is addressed with. Memory can only be addressed by multiples of the size of a word, and the size of a word is dependent on the architecture. For example: a 16-bit processor has words consisting of two bytes (8 x 2 = 16), a 32-bit processor has words that consist of four bytes (4 x 8 = 32), etc.
The eight bits making up each byte can represent any number from 0 to 255. We obtain this number of possible values, which is 256 when including the 0, by raising the possible values of a bit (two) to the power of the length of a byte (eight); thus, 28 = 256 possible values in a byte.
While there are many different ways to express the value of a byte, the three main types are hexadecimal, binary, and decimal. Hexadecimal is probably the most common method for explicit editing of a binary file, due largely to the fact that the system can express 256 different numbers in only two characters. For example, ff is the largest hexadecimal number, with a value of 255, while 00 is the lowest, having the value of 0.
In the screenshot of the Hexer binary editor at the beginning of this article, the first eight bytes are {66,ea,08,00,00,00,c0,07}. This sequence of numbers could be expressed in the decimal system as {102,234,8,0,0,0,192,7}.
Hexadecimal numbers are often expressed with 0x preceding the number, such as 0xf8, in order to denote the fact that the number is in the hexadecimal format. This method is most often used in calculators and compilers, where the computer need to be told what format it is supposed to be reading. If the computer is not told this, then it may interpret the number as a variable or decimal number, resulting in a build-time error or a bug in the software.
Bytes can be used to represent many of data types, from characters in a string of text, to the assembled and linked machine code of a binary executable file, which is the language that programs use to tell the computer how to act. Every file, sector of system memory, and network stream is composed of bytes.
In computers, plain text came to mean a string, file, or byte array that is printable, consisting only of standard alphanumeric bytes, punctuation, and a few control bytes such as tab, carriage return, or line feed. Plain text was not supposed to include any bytes that a printer would not know how to handle. The actual value of each character has varied in years past. Today, however, we have the American Standard Code for Information Interchange (ASCII), which allows data to be readable when being transmitted through different mediums, such as from one operating system to another. For instance, a user who typed a plain text document in Linux would usually be able to view or print the same file on a Macintosh computer. One example of ASCII would be the capital letters of the English language, which range from 101 for "A" to 127 for "Z".
Endianness
When multiple contiguous bytes represent a single number, there are two possible opposite "orderings" of the bytes; the particular ordering used is called endianness. Just as some natural languages are written from left to right, such as English, while others are written from right to left, such as Hebrew, bytes can be arranged "big end first", or Big Endian (with the most significant byte at the lowest memory address) or "little end first", or Little Endian (with the least significant byte at the lower memory address). The names are derived from the book Gulliver's Travels, in which the Lilliputians' forefront political concern was whether eggs should be opened from the little end or the big end.[2]. A similar ordering decision exists for the bits within a byte; it tends to be called simply bit ordering.
Differences in endianness arose among various computer architectures. For example, the Intel x86 architecture used for IBM compatible PCs is little endian, whereas SPARC (Solaris) architectures are big endian. Even programming languages which run on virtual machines, such as Java or C sharp, have endianness. Java, which was developed first on Unix machines that tended to be big endian, uses a big endian virtual machine (a.k.a. runtime), whereas C sharp, developed by Microsoft for Intel x86 computers, uses a little endian runtime.
Differences in endianness can be a hazard when transferring information between two computers using different architectures, and errors can arise due to mistakes in the ordering. translating bytes and bits incorrectly. Since the late 1990's, endianness has receded as a problem due to the widespread adoption of eXtensible Markup Language (XML) as a kind of lingua franca for transferring information between computers. XML is a standardized way of representing numbers, and indeed any kind of information at all, as strings of plain text.
Word origin and ambiguity
Although the origin of the word 'byte' is unknown, it is believed to have been coined by Dr. Werner Buchholz of IBM in 1964. It is a play on the word 'bit', and originally referred to the number of bits used to represent a character.[3] This number is usually eight, but in some cases (especially in times past), it can be any number ranging from as few as 2 to as many as 128 bits. Thus, the word 'byte' is actually an ambiguous term. For this reason, an eight bit byte is sometimes referred to as an 'octet'.[4]
Sub-units
Because files are normally many thousands or even billions of times larger than a byte, other terms designating larger byte quantities are used to increase readability. Metric prefixes are added to the word byte, such as kilo for one thousand bytes (kilobyte), mega for one million (megabyte), giga for one billion (gigabyte), and even tera, which is one trillion (terabyte). One thousand megabytes compose a terabyte, and even the largest consumer hard drives today are only three-fourths a terabyte (750 'gigs' or gigabytes). The rapid pace of technological advancement may make the terabyte commonplace in the future, however.
Conflicting definitions
Traditionally, the computer world has often used a value of 1024 instead of 1000 when referring to a kilobyte. This was done because programmers needed a number compatible with the base of 2, and 1024 is equal to 2 to the 10th power. Typically, storage space is measured with a base of 2, whereas data rates generally uses a base of 10. Thus, engineers in one field of computer science may use the same term when referring to different units of measurement (numbers of bytes).
Due to the large confusion between these two meanings, an effort has been made by the International Electrotechnical Commission (IEC) to remedy this problem. They have standardized a new system called the 'binary prefix', which replaces the word 'kilobyte' with 'kibibyte', abbreviated as KiB, to mean 1024 bytes. This solution has since been approved by the IEEE on a trial-use basis, and may prove to one day become a true standard.[5]
While the difference between 1000 and 1024 may seem trivial, one must note that as the size of a disk increases, so does the margin of error. The difference between 1TB and 1TiB, for instance, is approximately 10%. As hard drives become larger, the need for a distinction between these two prefixes will grow. This has been a problem for hard disk drive manufacturers in particular. For example, one well known disk manufacturer, Western Digital, has recently been taken to court for their use of the base of 10 when labeling the capacity of their drives. This is a problem because labeling a hard drive's capacity with the base of 10 implies a greater storage capacity when the consumer may assume it refers to the base of 2. [6]
Table of prefixes
Metric (abbr.) | Value | Binary (abbr.) | Value | Difference* | Difference in bytes |
---|---|---|---|---|---|
byte (B) | 100 = 10000 | byte (B) | 20 = 10240 | 0 | |
kilobyte (KB) | 103 = 10001 | kibibyte (KiB) | 210 = 10241 | 24 | |
megabyte (MB) | 106 = 10002 | mebibyte (MiB) | 220 = 10242 | 48,576 | |
gigabyte (GB) | 109 = 10003 | gibibyte (GiB) | 230 = 10243 | 73,741,824 | |
terabyte (TB) | 1012 = 10004 | tebibyte (TiB) | 240 = 10244 | 99,511,627,776 | |
petabyte (PB) | 1015 = 10005 | pebibyte (PiB) | 250 = 10245 | 125,899,906,842,624 | |
exabyte (EB) | 1018 = 10006 | exbibyte (EiB) | 260 = 10246 | 152,921,504,606,846,976 | |
zettabyte (ZB) | 1021 = 10007 | zebibyte (ZiB) | 270 = 10247 | 180,591,620,717,411,303,424 | |
yottabyte (YB) | 1024 = 10008 | yobibyte (YiB) | 280 = 10248 | 208,925,819,614,629,174,706,176 |
*Increase, rounded to the nearest tenth
Related topics
- American Standard Code for Information Interchange (ASCII)
- Extended Binary Coded Decimal Interchange Code (EBCDIC)
- Unicode Character Encoding
References
- ↑ " Claude Elwood Shannon" (biography). The History of Computing Project (2005). Retrieved on 2007-05-12.
- ↑ What is big-endian? - A Word Definition From the Webopedia Computer Dictionary (Accessed April 15th, 2007).
- ↑ Dave Wilton (2006-04-8). Wordorigins.org; bit/byte.
- ↑ Bob Bemer (Accessed April 12th, 2007). Origins of the Term "BYTE".
- ↑ IEEE Trial-Use Standard for Prefixes for Binary Multiples (Accessed April 14th, 2007).
- ↑ Nate Mook (2006-06-28). Western Digital Settles Capacity Suit.