Archive Builders: (310) 937-7000 SteveGilheany@ArchiveBuilders.com
This pamphlet provides managers with document size estimates for use in discussing, planning, designing, and implementing a document management system. The average computer file sizes for many types of scanned (digitized) documents are listed and explained.
All of the estimates are chosen to be representative of the scanned document images and to produce easily useable, round figures, when multiplied. For example, 50 thousand bytes (50 Kilobytes) is close to the average size of scanned pages and also yields an estimate of exactly one million bytes (a MegaByte) for 20 pages, exactly one billion bytes (a GigaByte) for 20 thousand pages, and exactly one trillion bytes (a TeraByte) for 20 million pages.
These estimates have been chosen to tend toward over- estimation rather than under-estimation of storage requirements. The estimates therefore provide a small safety margin. It is best to make certain that all estimates and estimating procedures tend toward conservative estimates. Then, when all assumptions are factored in together, aggregated, the overall estimate is conservative (safe).
The United States became a metric country in 1866 (Act of July 28, 1866; 14 Stat. 339) when the US customary measures were defined in metric terms. For example, an inch is defined to be exactly 25.4 millimeters. While an inch, as a measure, is thus hard metric, it is not a round metric number. When physical items are hard metric, the quantity and size of the items are in round metric numbers. In this paper, US customary measures are converted to metric sizes and quantities by applying the above methods for constructing round numbers and ensuring a slight over estimation. These methods are applied to the quantities and sizes of documents listed once for the US customary units, and a second, independent time for metric units. The resulting metric measures are then chosen to be appropriate to the corresponding US customary measures. This means that the conversion process produces similar conceptual values, but almost never produces numerically or physically equal values (e.g. -40 degrees Celsius = -40 degrees Fahrenheit) for the quantity or size of a given item.
If everyone uses the same estimates, it is easier to discuss and compare document imaging systems. Managers can also benefit from reports and articles describing previously implemented systems.
Because the estimates are industry standard, less time can be spent evaluating estimating methods, and more time can be spent understanding how the system will be used and whether the system design will accommodate the planned use.
De facto means 'by common usage', in the same way a dictionary records the way in which a language is actually used, rather than being the rule by which the language is written and spoken. These estimates are defacto estimates, estimates commonly used in the document management industry.
De jure means 'by some agreement or authority'. An example is the ISO (International Standards Organization) standards process where there are policies and procedures for accrediting standards setting bodies and for the functioning of those bodies.
Making the assumption that your documents are similar to the industry average documents usually produces very small variances. Because the cost of storage is very low as a percentage of overall system cost, and is dropping rapidly, an error of a few percent in an estimate has very little effect on the overall system cost. If round estimates speed up the understanding and discussion process, the benefit of rounding far out-weighs the cost of the slight variances.
After one percent of the documents have been scanned into a system, an actual average page image size can be calculated. This actual average page image size will provide the small correction necessary to adjust previous estimates. This is the system sizing method used in almost all system implementations.
Managers need ball park figures to size a document management system, to identify what system elements are bigger (or smaller) than a breadbox (system or system component). Managers need a rough order of magnitude (ROM) estimate. An order of magnitude is a power of 10 so that a rough order of magnitude estimate (ROM) is one in which the largest reasonable estimate is about ten times the size of the smallest reasonable estimate.
The accompanying list of document sizes is intended to assist in creating ROM estimates of storage requirements. Possible system decision outcomes of making these estimates are: "There will be no problem (the system is a little too large).", "We will never be able to afford the system.", and "We budgeted the right amount.".
All of the figures given for scanned images are for compressed file sizes. All imaging systems compress their image files for storage and transmission. Compressing removes the redundancy from the files, making the files smaller. These compressed page files have an average size of approximately 50 thousand bytes per page. Rarely is a given compressed page file exactly 50 thousand bytes.
Image files created by scanning at 200, 300, and 400 dpi (dots per inch) all have the same information content as the original image. The higher resolutions merely increase the redundancy in the image file. It is this redundancy that compression removes. In general, higher resolution scans of an image are slightly larger than lower resolution scan of the same image because higher resolution scans pick up more noise (pseudo-information). (Noise is something like digital dirt on the image.)
This variation between the compressed image sizes of different resolutions is within the variation range of document image sizes in general. In almost all cases, measuring the actual sizes of the first one percent of scanned images will easily adjust for this variation without requiring significant system changes.
Legal size page images are not much larger than letter size page images when scanned and compressed. The pathological case is the eight and one-half by twenty- five inch contract set in six point type, created to be true to the statement that the entire contract is on just one page (almost always two sided). Even in these cases, a good system design can be arrived at using letter size page image estimates. Measuring the size of actual scanned pages after the first one percent of the legal size documents have been scanned will produce the same high level of accuracy produced by measuring the size of the first one percent of any type of scanned document images. With this foundation of industry de facto standards, adjustments can be made for even the worst pathological cases, and any desired degree of estimating accuracy can be achieved.
A standard records storage carton (box) is about 12 inches wide by 15 inches long by 9 1/2 inches deep. It is designed to store letter size documents in manila folders against the 12 inch side and legal documents in legal folders against the 15 inch side. The standard fan- folded, greenbar, tractor fed, 11 by 14 inch, computer paper can be placed flat on the bottom of the standard records storage carton. In all three cases, the total computer storage required to store the scanned and compressed images of the documents in the box is the storage required to store 2,500 letter size page images.
By counting a standard record container that contains about 2,000 legal pages, as having 2,500 sheets of letter size documents, the effect of the slight difference between legal and letter pages can be further reduced.
Because the mainframe style programs that produce fan- folded greenbar output use very simply typography, the pages are simple and compress well, to about the same size as an average letter size page. Fan-folded documents are on good paper and must be handled carefully or they quickly become unmanageable. For these reasons, the nine or so inches that will fit in a standard records storage carton are fairly dense and constitute about 2,500 sheets.
Unfortunately, the document imaging industry has blurred, and then finally eliminated, the difference in meaning between the words 'page' and 'document'. This happened because there is a desire to make every system seem as large as possible, so every stored page is said to be a document.
To recover the use of the word 'document', always separately list the number of documents, the number of pages, and the average number of pages per document. This will allow a discussion of pages and documents to continue without losing track of whether or not pages are the same as documents.
Simplex means one sided pages and duplex means two sided pages. Simplex is always assumed. In the same way that pages and documents can be confused, so can the count of two page images on one duplex page. To avoid this, always separately list the number of pages and the number of page images.
Computer storage is given in bytes and transmission speeds are given in bits. To avoid confusion, always spell out the word bit and byte in all planning documents. It is very common for bits to turn into bytes and for bytes to turn into bits during discussions and phone conversations. This results in plans that are either eight times too large or eight times too small (or too slow, or too expensive).
The following is a narrative of the information in the table of document type storage requirements that is at the end of the article.
Using an estimate of 2,500 pages per file drawer and four file drawers per file cabinet, one can estimate that scanning one four drawer file cabinet full of documents (ten thousand single sided pages) will fill one CD ROM disc. Similarly, the scanned contents of two file cabinets will fill one GigaByte of magnetic disk storage.
With these figures, a simple count of the file cabinets in an organization will produce an estimate of the amount of storage required. At an even coarser level, if two file cabinets are assumed for each employee, the number of GigaBytes required is equal to the number of employees.
Files in file cabinets and on linear feet of open shelving are assumed to have some open space for file growth and for ease of access. The number of pages estimated for these two storage methods takes this into account.
Standard record storage cartons (boxes) are assumed to be more tightly packed because documents in boxes are placed there for storage. Access to documents in boxes is assumed to be less frequent than to documents stored in active file cabinets and on open shelves. The estimates also take these assumptions into account.
If your storage facilities are tightly packed or are otherwise different than these estimates, you can apply a correction factor to adjust for you facility's differences. For example, a tightly packed facility might have an adjustment factor of 1.1 because there might be ten percent more pages per linear foot or drawer than in the industry standard density.
The adjustment factor is not required for ROM (ballpark) estimates. Also, the adjustment factor is easy to apply at any stage in the process. If after weeks of work, a storage estimate of 100 GigaBytes is arrived at, the decision to use an adjustment factor can be made, and the storage estimate can simply be adjusted to 1.1 times 100 GigaBytes yielding an estimate of 110 GigaBytes.
If you have double length boxes, such as 12 x 30 inches instead of the more standard 12 x 15 inch boxes, simply multiply the number of pages per box by 2.
If you have 200 foot rolls of microfilm instead of the listed 100 foot rolls, simply multiply the number of pages by 2.
A 100 ft. role of 16 mm microfilm of 24X images actually has closer to 2,400 images rather than the listed 2,500 images. This slightly overstates the digital storage requirements, making it more conservative. This estimate must be adjusted slightly if used conversely, because it overstates the capacity of the microfilm, making slightly over-optimistic for estimating the number of rolls required for a project.
- - - Description continued after - - -
- - - 4 page pullout section. - - -
1 scanned page (8 ˝ by 11 inches, A4) = 50 KiloBytes (KByte) (on average, black & white, CCITT G4 compressed)
1 file cabinet (4 drawer) (10,000 pages on average) = 500 MegaBytes (MByte) = 1 CD (ROM or WORM)
2 file cabinets = 1,000 MBytes = 1 GigaByte (GByte); 10 file cabinets = 1 DVD (WORM) (see below)
2,000 file cabinets = 1,000 GigaBytes = 1 TeraByte (TByte); 2,000 file cabinets = 200 DVDs
1 box (in inches: 12 wide x 15 long x 9.5 deep) (300 x 375 x 250 mm) (2,500 pages) = 1 file drawer = 2 linear feet of files = 125 MBytes
8 boxes = 16 linear feet = 2 file cabinets = 1 GByte; 8,000 boxes = 16,000 linear feet = 1,000 GBytes = 1 TByte
1 E size drawing (48 inches by 36 inches) = 16 letter size pages (8 ˝ by 11 inches, A4) = 800 KBytes
D size = 8 letter size pages; C size = 4 pages; B size = 2 pages; A size = 1 page; new E size = 44 in. x 34 in. (Scanners have to accommodate the old E size of 48 in. x 36 in.), (A0 size is the ISO metric size equivalent nomenclature for E size), D size (metric A1 size) = 34 in x 22 in (old D size = 36 in x 24 in), C size (A2) = 22 x 17 (24 x 18), B size (A3) = 11 x 17 (12 x 18), A size (A4) = 8 ˝ x 11 (9 x 12) [105 mm microfiche is the metric A6 size]; F size = 28 x 40, Roll sizes: G size = 11 x 22 ˝ to 11 x 90, H size = 28 x 44 to 28 x 143, J size = 34 x 56 to 34 x 176, K size = 40 x 56 to 40 x 143; Newspapers: A double truck (center fold) full broadsheet is 24 in x 36 in, equivalent to an old D size drawing.
1 roll of 16 mm microfilm (100 ft, ~30 meters) (24X reduction) = 2,500 letter size images = 1 box = 1 file cabinet drawer = 125 MegaBytes
1 roll of 35 mm microfilm (100 ft) (12X reduction, open spacing, normal scan) = 1,000 letter size images = 50 MegaBytes
1 microfiche (105 mm film) (24X reduction) = 100 letter size images = 5 MegaBytes (average); 200 microfiche = 20,000 images = 1 GigaByte
In many record series, each microfiche contain only a few images because each fiche represents a single record in the series (e.g. one fiche per person in a personnel record series). In this case filming breaks on records, rather than being continuous. To a lesser extent this is also true for roll film. In these cases, the amount of storage required depends on the number of images on the film, not the number of microfiche or the number of rolls of film. A full, standard 24X microfiche has 7 rows of 14 letter size (8 1/2 x 11 or A4) images for a total of 98 images.
Scanned aperture card images require the same storage as the document or drawing in the aperture. This is true for any microform.
1 check (2 sided) (remittance) = 50 KBytes per item, 25 KBytes (1 sided), less if no patterns are present
1 credit card receipt (long: 3 1/4 x 7 7/16 inches, 2 sided) (remittance) = 35 KiloBytes, (short: 3 1/4 x 5 in., 2 sided) = 25 KBytes
[The long size credit card receipt is the same as an 80 column punch card, which was based on the older 90 column, round hole, punched card, which in 1890 was based on the size of the old US dollar bill (before 1929). US dollar bills are now 6.14 x 2.61 inches (~156 x ~66 mm), before 1929, US dollar bills were 7.4218 x 3.125 inches (~189 x ~79 mm).]
1 library book (average, scanned in black and white) = 10 MBytes; 50 books = 500 MBytes = 1 CD; 100 books = 1 GByte
Modem = 56 Kbit (Kilobits) per second = 3 pages per minute (about ~ US$ 30.00 per month for a standard phone line)
ISDN (2 voice channels) = 128 Kbit per second = 10 pages per minute (~ US$ 100.00 per month) (ISDN charge)
Cable (TV) modem =~ 500 Kbits per second = 1 page per second (about ~ US$ 50.00 per month)
T1 (24 voice channels) = 1.544 Mbit (Megabit) per second = 3 pages per second (~ US$ 1,000.00 per month)
Ethernet (CSMA/CD) = 1 Mbit per second (effective) or 10 Mbit per second (nominal) = 2 pages per second
OC3 ATM (Optical Carrier, Asynchronous Transfer Mode) = 155 Mbit per second = 300 pages per second
OC192 (SONET: Synchronous Optical NETwork fiber) = 10 Gbit per second = 20,000 pages (2 file cabinets) per second
Dense Wavelength Division Multiplexing (DWDM) with OC192 = 320 Gigabits per second = 64 file cabinets per second
Optical carrier frequency (1,300 nm) = 230 THz (TeraHertz) (about 20,000 cycles are used for every OC192 bit transmitted)
1 scanned page (8 1/2 by 11 inches, A4) = 100 KiloBytes (KByte) (on average, office color, including grayscale, compressed)
1 file cabinet (4 drawer) (10,000 pages on average) = 1 GigaByte (GByte) = 2 CDs (ROM or WORM)
5 file cabinets = 1 DVD (WORM) (see below)
1,000 file cabinets = 1,000 GigaBytes = 1 TeraByte (TByte); 1,000 file cabinets = 200 DVDs
1 box (in inches: 12 wide x 15 long x 9.5 deep) (2,500 pages) = 1 file drawer = 2 linear feet of files = 250 MegaBytes
4 boxes = 8 linear feet = 1 file cabinets = 1 GigaByte; 4,000 boxes = 8,000 linear feet = 1,000 GigaBytes = 1 TeraByte
In general, when compressed, the digital files for document images scanned in office quality color are about twice the size of document images scanned in a bi-tonal, black and white format, and then G4 compressed. In office quality color scanning, the scanned color differences aid in reading a document and in increasing the quality of OCR (Optical Character Recognition). Office quality color scanning is generally at a lower resolution than black and white scanning. Office quality color includes (has subsumed) the process of grayscale scanning which can increase OCR accuracy when using low resolution scanning.
1 E size drawing (48 inches by 36 inches) = 16 letter size pages (8 1/2 by 11 inches, A4) = 1,600 KiloBytes
D size = 8 letter size pages; C size = 4 pages; B size = 2 pages; A size = 1 page
1 check (2 sided) (remittance) = 100 KiloBytes per item, 50 KiloBytes (1 sided), less if no patterns are present.
1 credit card receipt (long: 3 1/4 x 7 7/16 inches, 2 sided) (remittance) = 75 KiloBytes, (short: 3 1/4 x 5 in., 2 sided) = 50 KiloBytes
1 library book (average, scanned in office color) = 20 MegaBytes; 50 books = 1 GigaByte = 2 CDs
1 roll of 16 mm microfilm (100 ft) = 2,500 letter size images (24X reduction) = 1 box = 1 file cabinet drawer = 250 MegaBytes
1 roll of 35 mm microfilm (100 ft) = 1,000 letter size images (12X reduction, open spacing, normal scan) = 100 MegaBytes
1 microfiche (105 mm film) = 100 letter size images = 10 MegaBytes (average); 100 microfiche = 10,000 images = 1 GigaByte
1 hour of compressed color video = 2 GigaBytes (DVD, MPEG 2) (image quality dependent)
1 hour of audio = 10 MBytes (dictation, answering machine) to 500 MBytes (a CD holds 74 minutes of music)
1 color picture = 10 KBytes (thumbnail) to 5 MBytes (for each of 100 photos on a 500 MByte photo CD)
The size of compressed file depends on the resolution (DPI: Dots Per Inch) and the detail (information) in the photograph. The detail in a photograph is dependent on the size of the negative and the quality of the film and the camera and lens (It is not related to the print size unless the print is smaller than the negative). The resolution of the scan should be chosen to match the detail of the photograph. For most cameras, films, and formats 35mm and smaller, the 5 MByte Photo CD format (3072 by 2048 pixels) captures all the information in the image. Note that this is in dots per image rather than dots per inch. Displays are also given in dots per image (H x V: 1024x768).
1 Chest X-ray (14 x 17 inches) = 1 MegaByte: 150 dpi (Dots Per Inch), 12 bits (compressed)
(Wavlet compression, lossless mode, has FDA 510(k) approval.) (12 bits per pixel provide 4,096 shades of gray.) // (150 dpi, 12 bit images are recommended by the American College of Radiology for primary reads.) // 14 x 17 Chest X-ray = 200 KiloBytes (For secondary reads: wavlet compression, lossy mode, has FDA 510(k) approval.) // X-rays that are originally recorded digitally rather than on film provide a resolution (image depth) of 16 bits per pixel which records 65,536 shades of gray per pixel. More shades of gray allow doctors to see very fine variations in the health of tissues, increasing the early detection of disease.
1 DVD (commonly Digital Video Disc) (same physical size as a CD ROM) = 7.9 GByte (WORM) (10 file cabinets)
DVD WORM: (Write Once, Read Many) (2 sided, 1 layer per side) 7.9 GByte (3.95 GBytes per side) DVD RW: (overwrite, ReWrite) (2 sided, 1 layer per side) 5.2 GByte; DVD ROM (Read Only Memory) (2 sided, 2 layers / side) 17 GBytes. Multimedia: 5 channel (theater quality surround sound) (5.1, Dolby AC-3) / 96 KHz audio / 24 bit audio, 8 languages tracks, 32 subtitle tracks, and about 135 minutes (long enough to accommodate 94% of all movies) of high quality video (720 horizontal lines) on each of 4 layers. DVDs support runtime editing so that all ratings of a movie are on the same DVD; 'R' rated scenes can be skipped as the DVD is played. The file format is ISO 13346 UDF (Universal Disc Format) which harmonizes all CD recording standards including ISO 9660. A future technology, 3rd generation blue lasers [sort of a blue light special, as blue light has a wavelength about half that of red light], should yield a 40 GigaByte DVD ROM for HDTV.
Most document imaging resolution measures are in pixels (PICture ELement) per inch (or per mm - millimeter), and are commonly referred to as dpi (Dots per Inch) or dpmm (Dots per mm). Most motion-picture and still-photographic resolution measures are in pixels per image. This is most commonly seen in the 525 lines of NTSC (National Television System Committee), 625 lines for PAL (Phase Alternating Line) and SECAM (Sequential Couleur Avec Memoire or Sequential Colour with Memory), resolution of television images. No matter how physical large or small an NTSC television image is displayed, there are only 525 lines of vertical resolution (480 viewable). The computer equivalent of this is 640 by 480 pixels in a standard computer image.
In pixels per image the horizontal resolution is given first. If the horizontal dimension is larger than the vertical dimension in pixels, the image is said to be landscape, if the horizontal is smaller, the image is said to be portrait.
Computer screen resolutions are chosen to have an aspect ratio (the ratio of width to height) of 4 to 3 (the ‘golden ratio’ of the art world) and to have the number of pixels be an integer multiple of a power of 2. (Powers of 2 are given here as 2**N for the Nth power of 2). When a prefix is added to the word pixel it is shortened to pel (Picture ELement). A 1 million pixel display is then a 1 MegaPel display.
IBM: CGA (Color Graphics Adapter) 320 x 200, EGA (Enhanced Graphics Adapter) 640 x 350, VGA (Video Graphics Array) 640 x 480, XGA (Extended Graphics Array) 1024 x 768.
IBM PC compatible: VGA 640 x 480, SVGA (Super VGA) 800 x 600, XVGA (Extended VGA) and UVGA (Ultra VGA) 1024 x 768; although SVGA, XVGA, and UVGA can mean anything that is more than the VGA’s 640 x 768.
Common display resolutions: (An aspect ratio of [a x b][c x d] is equal to [a * c x b * d] when expressed using matrix arithmetic.)
640 x 480 pixel resolution = [2**6 x 2**6][4 x 3][5 x 5] (VGA standard computer screen resolution)
800 x 600 = [2**3 x 2**3][4 x 3][25 x 25] (usually SVGA)
1024 x 768 = [2**9 x 2**9][4 x 3] (often XVGA or UVGA)
1280 x 1024 = [2**10 x 2**10][4 x 3] (sometimes XVGA or UVGA)
1152 x 900 (Sun Microsystems) 1152 x 870 (Mac) (1152 = 2**4 x 72 typeset points per inch). Some Sun Microsystems and Apple / Mac screen resolutions were chosen so that the actual screen resolutions were 72 dpi to match the 72 points per inch used in typesetting.
1600 x 1200 high resolution document imaging workstation [2**4 x 2**4][4 x 3][25 x 25]
1800 x 1440 high resolution document imaging workstation [72 x 72][25 x 20]
2048 x 1536 high resolution grayscale document imaging workstation [2**10 x 2**10][4 x 3]
The DVD NTSC resolution is 720 x 480 and the DVD PAL/SECAM resolution is 720 x 576.
The computer version of HDTV (High Definition TV) resolution is 1920 x 1200 (Sun Microsystems) and has the HDTV 16 to 9 aspect ratio. The 1920 x 1200 resolution is designed to match the NTSC derived HDTV video resolutions of 1920 x 1080 and 1920 x 1035 and the PAL and SECAM derived HDTV video resolution of 1920 x 1152 (1152 = 2 x 576)
Kodak PhotoCD family of resolutions: (Based on a 2 x 3 aspect ration and an integer power of 2. The multiple of the base gives the number of pixels per image relative to the base image size in pixels.) A Kodak PhotoCD contains five resolution of each image: 1/16 Base through 16 Base. (The average compressed file size containing all five resolutions is about 5 MegaBytes per image.) A Kodak Pro PhotoCD contains the five resolutions for each image found on a PhotoCD plus a sixth 64 Base resolution.
1/16 base = 128 x 192 [2 x 3][2**6 x 2 **6] (thumbnail, index print on CD cover)
1/4 base = 256 x 384 [2 x 3][2**7 x 2 **7] (largest Kodak size that is smaller than 480 x 640 for display on TV)
1 base = 512 x 768 [2 x 3][2**8 x 2 **8]
4 base = 1024 x 1536 [2 x 3][2**9 x 2 **9] (largest Kodak size that is smaller than 1920 x 1152 for HDTV)
16 base = 2048 x 3072 [2 x 3][2**10 x 2 **10] (captures all the resolution on most 35 mm film images)
64 base = 4096 x 6144 [2 x 3][2**11 x 2 **11] (for most film formats larger than 35 mm)
DVDs can be used to record audio only, with no video. DVD audio includes various still images. DVD audio is different than the audio that is used as part of DVD video.
The DVD audio standard is for up to 6 channels, a sampling rate of 48, 96, or 192 KHz, and a sample size of 16, 20, or 24 bits. With 24 bit samples taken at a 192 KHz rate, this provides a 96 KHz frequency response and a 144 dB dynamic range. DVD audio can also provide for a lossless audio compression of about 2 to 1 which would have a playing time of 120 to 140 minutes for two- channel 192 KHz / 24 bit recordings for a single layer. Each DVD disc can have up to 4 layers, 2 layers per side.
DVD audio includes various still image modes for synchronized lyrics, navigation, etc. DVD audio allows up to 16 still graphics per track and a set of limited transitions.
The audio used in DVD video can also be used without the video. This produces a stereo, DVD quality, play time of over 55 hours at 192 Kilobits per second (compressed) for a single layer and over 200 hours for a 4 layer DVD disc. Lower quality sound can be recorded as computer files on a DVD for much longer play times. At a compressed audio rate of 16 Kilobits per second (in the low range of telephony quality), this is 9 million seconds, 150 thousand minutes, 2,500 hours, 100 days, 15 weeks, or 3 months of audio on a 4 layer DVD disc.
Aerial photography uses photographs taken from the air, recording the visible electromagnetic spectrum (light), as maps of geographic areas. Remote sensing includes photographs taken from the air and from beyond the atmosphere of areas on the earth and other celestial bodies, using many segments of the electromagnetic spectrum including visible light, ultraviolet, infrared, and radar illumination. Digital orthophotography digitally rectifies the pixels of digitized aerial photographs into a continuous map, usually registered to a layer of a GIS (Geographic Information System).
For cities, 2 inch to 6 inch pixels are popular for digital orthophotography. A digital orthophotograph of a 500 square mile city using 6 inch pixels would have 4 pixels per square foot, 100 million pixels per square mile (There are approximately 25 million square feet per square mile.), for a total of 50 GigaPels (50 billion pixels). Using 24 bit color and estimating a lossless three-to-one compression, this digital orthophographic image would require 50 GigaBytes to store. If 2 inch pixels were used, a 500 square mile city would have 9 times as many pixels or 450 GigaPels requiring 450 GigaBytes to store using the same assumptions. Using 2 inch pixels a 50 square mile city would have 45 GigaPels requiring 45 GigaBytes to store using the same compression assumptions. The metric equivalents are 50 millimeter (mm) and 100 mm pixels which are respectively 400 and 100 to the square meter. For a 1 thousand square kilometer city this would be 100 GigaPels using 100 mm pixels and would require 100 GigaBytes to store. Using 50 mm pixels for a 1 thousand square kilometer city, this would require 400 GigaPels requiring 400 GigaBytes to store. A 100 square Kilometer city, using 50 mm pixels would be imaged in 40 GigaPels which would require 40 GigaBytes to store.
In digital orthophotography, in addition to color, each pixel has an associated z-axis value, the height of the pixel above sea level. When added to the x and y Cartesian coordinates of the pixel, the z values construct a digital terrain model over which the image can be mapped as a surface. This is similar to the way that images are created in virtual reality. By adding a t value, a 4 fourth dimension that represents a specific point in time, animations can be done telling a geologic story or the development history of a city.
In remote sensing (satellite imagery such as weather photographs or images for crop quality assessment or storm damage / flooding), a 24 bit color image of an area 1 thousand kilometers by 1 thousand kilometers, using 100 meter pixels (pixels that are 100 meters by 100 meters), would contain 100 million pixels. Estimating a lossless three-to-one compression this would require 100 MegaBytes to store. The pixels used can be of any size. In astronomy, a single pixel can include an entire earth type planet (10 thousand kilometer pixels = 10 Mm, 10 Megameter pixel), a sun type star (1 million Kilometer pixels = 1 Gm, 1 Gigameter pixel), or a galaxy (100 thousand light year pixels =~ 1 Zm, 1 Zettameter pixel). The largest practical pixel is a 400 Ym, 400 Yottameter pixel, the diameter of the observable universe.
Semiconductors are made using digital photographic techniques (pixels). Recently, microprocessor production processes were improved from .25 micron (.25 um, micrometer) (250 nm, nanometer) design rules to .18 um (180 nm) design rules. This means that the pixel size for semiconductor devices is now slightly less than 1/5 micron (200 nm). A micron is 1 / 1 millionth of a meter. Using 200 nanometer (nm) pixels and assuming 1/25th of the area was used for active transistors, a 1 millimeter (mm) square area (about the size of the head of a pin) could hold 25 MegaPels (25 million pixels) and 1 million transistors.
The smallest practical pixel would be a pixel used as part of a halftone dot that represented the edge of the path of a sub-atomic particle, such as a neutrino. To create a smooth path in a specific color, a printed resolution of 2540 dpi (100 dpmm) would be used. Assuming a 1 ym (yoktometer) wide path, rendered as a 10 mm wide path, the width represented by each pixel would be 1/1 thousand ym. For a superstring (2 x 10**-35 m wide), the pixel width would be 2/1 trillion ym.
1 Byte (B) is defined as the set of bits used to represent 1 character. Commonly: 1 Byte = 8 bits (b). (Byte & bit are best spelled out.) 8 bits can represent 256 different characters. 1 Unicode Byte = 16 bits = 1 character. 16 bits can represent 65,536 different characters to include most of the world's languages in the same, consistent character set.
1,000 Bytes = 1 KiloByte (exactly 1 Thousand in common and legal usage) (exactly 1,024 Bytes = 2**10 = 2 to the 10th power in computer terms); 1,000 KBytes = 1 MegaByte (exactly 1 Million in common and legal usage) (exactly 1,024 KBytes = 1,048,576 Bytes = 2**20 = 2 to the 20th power in computer terms); (Due to lawsuits in recent years only the legal terms can be used commercially.) 1,000 MBytes = 1 GigaByte (Billion); 1,000 GBytes = 1 TeraByte (Trillion); 1,000 TBytes = 1 PetaByte (Quadrillion); 1,000 PBytes = 1 ExaByte (Quintillion); 1,000 EBytes = 1 ZettaByte (Sextillion); 1,000 ZBytes = 1 YottaByte (YByte) (Septillion).
1 millisecond (ms) = 1/1,000 second; 1 microsecond (us) (u is substituted for the Greek letter mu) = 1/1,000 ms, 1 nanosecond (ns) = 1/1,000 us; 1 picosecond (ps) = 1/1,000 ns; 1 femtosecond (fs)= 1/1,000 ps; 1 attosecond (as) = 1/1,000 fs; 1 zeptosecond (zs) = 1/1,000 as; 1 yoktosecond (ys) = 1/1,000 zs
1 Hertz = 1 cycle per second (e.g. 1 clock cycle in a computer which corresponds roughly to 1 instruction execution.). A 1,000 cycle per second signal or action is called a 1 KiloHertz signal or action (a 1 KHz signal), each cycle of such a signal is a millisecond long (KHz:ms:10** +&- 3) 1,000 KHz = 1 MegaHertz (KHz:ms:3) (MHz:us:6) (GHz:ns:9) (THz:ps:12) (PHz:fs:15) (EHz:as:18) (ZHz:zs:21) (YHz:ys:24) Because light travels about 300 MegaMeters (MM) in 1 second and has a wavelength of about 400 nM for blue light (about 700 nM for red light), the frequency of light is about 750 THz for blue light (about 430 THz for red light). This is because speed (e.g.: C, the speed of light, which is a constant) = wavelength X frequency.
- - - End of 4 page pullout section. - - - - - - Continuation of description - - -
1 pulp tree (loblolly pine) = 1/10th cord of wood (.2 cubic meters of wood) = 10,000 pages = 1 File Cabinet = 4 boxes = 1/2 GigaByte = 1 CD
1 lumber tree (20 inch (500 mm) diameter, 110 ft (35 m) tall, 50 years old) = 1 cord = 10 pulp trees (8 in. (200 mm) diameter, 50 ft (15 meters) tall, 20 yrs old) = 1 cord
1 cord = 4 ft x 4 ft x 8 ft = 128 cubic feet (3.5 cubic meters) as stacked for storage (75 cubic feet of wood, 2 cubic meters of wood) = 100,000 pages = 5 GigaBytes
1 wordprocessor or OCR’ed (Optical Character Recognition) page = 5 KBytes (all pages listed above are scanned pages)
1 compressed page of COLD (Computer Output to Laser Disc) or COOL (Computer Output On-Line) (including index) = 2 KBytes for letter size statements, 4 KBytes for 11 x 14 inch fanfolded greenbar computer sheet, 10 KBytes for All Points Addressable (APA) pages such as IBM AFP (Advanced Function Printing) and Xerox Metafont.
Minimum commercial scanning cost for backfile conversion (more than 1 million pages) = about ~ 5 US cents per page
- - - End of 4 page pullout section. - - - - - - Continuation of description - - -
When folded, blueprints fit in a file drawer and have a thickness equal to the number of letter size documents that would cover the blueprints. For an E size drawing, this is 16 letter size documents because a folded E size drawing has 16 layers of paper. Using this relationship, the 50 thousand byte estimate (for a scanned page) can be used to estimate that an E size drawing would require 800 thousand bytes (16 times 50 thousand bytes) of storage. This is what is shown in the table.
Aperture cards are the microform most commonly used to store blueprints. Aperture cards are punch cards that have a hole (or aperture) cut into them that holds one 35 mm slide which reproduces one blueprint sheet in most cases. An image scanned from a blueprint's image in an aperture card requires the same amount of storage needed for an image scanned from the original full size blueprint.
Multimedia documents exist in compressed digital form, and the listing shows average sizes for these documents as well. The DVD (commonly Digital Video Disc) multimedia format standards will provide a stable foundation for working with these types of documents.
Current DVD developments are posted at http://www.VideoDiscovery.com/vdyweb/DVD/DVDfaq.html by Jim Taylor, who wrote the book: DVD Demystified: The Guidebook for DVD-Video and DVD-ROM.
Why is a computer MegaByte not exactly one million bytes? Because computers use binary arithmetic and the closest round number computers have that is near one million is two to the twentieth power. Two to the twentieth power is equal to 1,048,576. This is why many computer program displays show an exact number of bytes beside a seemingly smaller number of megabytes. (For example the Disk or file properties window in Windows 95). Similarly, a computer KiloByte is not exactly one thousand bytes, but is two to the tenth power or 1,024 bytes.
These differences are so small that they do not belong in a management discussion, but lawsuits have been brought over them. Being kind and understanding to those who bring them up is the fastest way to move on to a productive discussion.
Because of the lawsuits over the meaning of KiloByte, MegaByte, etc. only the metric meanings (based on units of 1 thousand) can be used in commercial discussions of storage capacities. The computer based terminology, based on units of 1,024 = 2 to the 10th power, will continue to be used in discussing computer configurations because computers actually use equipment based on units of 1,024. The computer units will always be converted into commercial metric (1 thousand based) units before commercial discussions take place. Document imaging and document management discussions fall into the category of commercial discussions.
There are several common communications line types available. The speed of each line type in bits per second and page images per second is given along with a rough estimate of the monthly cost of a local connection (two or three miles). This will help in confirming that the speed of access desired is possible over the communications lines proposed.
The CD (Compact Disc) development was funded by the music industry. A CD can hold about 650 MegaBytes of storage. In this brochure, a CD is conservatively estimated to hold 500 MegaBytes in actual practice. This is consistent with the goal of erring on the conservative side and slightly overestimating the storage required at every step. With these estimates, one CD can exactly hold the scanned contents of one standard file cabinet, meeting the goal of producing simple, round number, estimates.
The DVD (commonly Digital Video Disc) has been funded by the video industry, the movie industry, the music industry, and the computer industry. The DVD is about to unify the PC, TV, telephony, and document management. The DVD has several versions and options. The table entry shows the capacity of each version. Also shown are the DVD format specification for recording video and audio.
The estimate of 10 file cabinets per DVD is very conservative. It assumes considerable overhead for storing indexes, and gives a large amount of weight to the goal of creating round number estimates, using a figure of 10 rather than 12 file cabinets per DVD.
Pixel size is based on the image or object scanned, for example 1 / 300 inch (square) pixels scanned from a letter size page scanned at 300 dpi. If the image has been microfilmed at a reduction of 12X then the pixels scanned on the microfilm are scanned at 3600 dpi (1 / 3600 inch square) to have an effective resolution of 300 dpi relative to the original letter size page. If the pixels are displayed on a monitor that has a resolution of 100 dpi, then the pixels are 100 dpi (1 / 100 inch square), and have been enlarged, along with the document, by a factor of 300 percent. If the pixels are combined, 9 pixels to 1 pixel (3 pixels to 1 pixel in both dimensions of the two dimensional image), to display the letter size document at a 1 to 1, normal size, on a 100 dpi monitor, then the pixel size in increased and the image resolution is decreased.
When speaking of pixels per image, the number of pixels, and the amount of information in the pixels stays fixed, no matter what size the image is reproduced. This relationship is most clearly seen in a store selling many sizes of television sets. If the same demonstration video clip is being shown on every television set, then every set has the same picture, with all the same picture information, no matter what size the television set is.
Pixels do not have a size in the computer or when they are stored digitally. A raster of digitally stored pixels is just an array of numbers. This array of number carries all the image information.
Meta-data is the information carried with the raster (array) of sizeless pixels (that make up an image) that includes the information on the size of the pixels in the scanned (or original) object, the size of the pixels in a scanned micro- or macro-form of the object, and the size(s) of the pixels in an intended reproduction or reproductions of the object or document.
For magnetic disks, use disk with a k. Disk follows the metal disks in a harrow. For optical discs such as DVD and CD use disc with a c. Disc follows the spelling of music record discs. When referring to disks and discs collectively, disk is used in this article.
Just a note on the environmental aspects of these figures: Each full file cabinet represents one pulp tree. Pulp trees are grown fast for paper or are trees culled from among the trees that will be allowed to grow to a larger size for lumber.
When documents are imported directly in the form they were created in, they require much less storage space. Because COLD (Computer Output to Laser Disc) pages come from mainframe computers, their formatting is very simple and requires even less storage space than word processor pages.
Storage requirement estimates can be done more quickly by ignoring the storage requirements for OCR images and indices. This shortcut will not greatly affect the accuracy of storage estimates because of the small size of text and indices relative to the size of compressed scanned images. If some of the images will be on optical media, but all of the indices and OCR text will be on magnetic media, then the additional accuracy provided by the following section may be useful.
OCR (Optical Character Recognition) for full text storage produces the largest indices. At 5 thousand bytes per 50 thousand byte raster scanned page image, the OCR text takes up ten percent of the storage in a system. An additional 2 to 10 thousand bytes of are required for the actual index to the OCR text, the full text index, depending on the indexing software used.
Key word and database indices rarely have more than one hundred characters per document. For one page documents, 500 bytes is one percent of the document page image size, so the database entry is less than one-fifth of one percent of the page image size, and can be ignored for most systems.
As with scanned images, the size computer generated text files can be very accurately estimated by measuring the first one percent of the documents processed when the system goes into operation.
Storage costs are very often less than ten percent of the cost of document conversion. This estimate will provide a basis for assigning the best relative weights and values to the cost of storage and scanning in evaluating system feasibility.
These estimates are intended to assist managers in creating a dialog with all parties involved in a document imaging system. The 'I agree with that.' sometimes means 'I did not read it.' The best dialogs start with 'That's wrong, here is what it should say, and this is why it should say it.'. During implementation, there is nothing like 'My estimate says X and your report say Y, why are they different?'
From the records manager who say 'You missed two of the records storage rooms.', to the systems staff member who says 'We did not know you needed 100 Megabit ethernet.', this brochure is intended to help everyone involved in a document management project contribute to its success. [Article 003v27]
When using the information in this article, please check the website http://www.ArchiveBuilders.com for updates. The version number for this article is located at the end of the article and in the Note to Editors section below. The website also has articles that provide more details on some of the terms and concepts in this article.
Please let us know how you like this paper, or if you had any questions. What would you like to see in the future? For more, and the most recent version of this article, please visit our web site at http://www.ArchiveBuilders.com. We also have the articles in Microsoft Word format which prints on far fewer pages than the HTML version. Also, please let us know where you saw this article.
Please send your comments via email to:
Tel: +1 (310) 937-7000. Fax: +1 (310) 937-7001.
Reprinted from Archive Planning, Volume 3, number 4, 1999, Archive Builders' analysis newsletter for document management.
All trademarks are the property of their respective holders.
We will continue to update these articles as we get comments. Please contact us for the most current version before you publish and please request permission to publish the article. Permission will be given freely for most purposes. Also, please send us a copy of the publication when you publish the article. The articles are also available in a Microsoft Word format that can be printed on many fewer pages than the HTML format.
1147 Manhattan Avenue, Suite 322
Manhattan Beach, CA 90266
Tel: +1 (310) 937-7000 Fax: +1 (310) 937-7001
Steve Gilheany, BA in Computer Science, MBA, MLS Specialization in Information Science, CDIA (Certified Document Imaging System Architect), AIIM Master, and AIIM Laureate, of Information Technologies, CRM (Certified Records Manager, ARMA) has eighteen years experience in document imaging and is a Sr. Systems Engineer at Archive Builders.
Steve Gilheany is a Sr. Systems Engineer at Archive Builders. He has worked in digital document management and document imaging for seventeen years.
His experience in the application of document management and document imaging in industry includes: aerospace, banking, manufacturing, natural resources, petroleum refining, transportation, energy, federal, state, and local government, civil engineering, utilities, entertainment, commercial records centers, archives, non-profit development, education, and administrative, engineering, production, legal, and medical records management. At the same time, he has worked in product management for hypertext, for windows based user interface systems, for computer displays, for engineering drawing, letter size, microform, and color scanning, and for xerographic, photographic, newspaper, engineering drawing, and color printing.
In addition, he has nine years of experience in data center operations and database and computer communications systems design, programming, testing, and software configuration management. He has an MLS Specialization in Information Science and an MBA with a concentration in Computer and Information Systems from UCLA, a California Adult Education teaching credential, and a BA in Computer Science from the University of Wisconsin at Madison. His industry certifications include: the CDIA (Certified Document Imaging System Architect) and the AIIM Master, and AIIM Laureate, of Information Technologies (from AIIM International, the Association of Information and Image Management, (http://www.AIIM.org), and the CRM (Certified Records Manager) (from the ICRM, the Institute of Certified Records Managers, an affiliate of ARMA International, the Association of Records Managers and Administrators, (http://www.ARMA.org).
Tel: +1 (310) 937-7000
Fax: +1 (310) 937-7001
For more information, courses, and papers: