
How big is your image? A study of photo sizes when formatted for HDTV projection.


I've recently been working on a project (using a Raspberry Pi) that projects my favorite photos onto a TV screen. This is a 1080p HDTV, so I start by resizing the files to fit a 1920x1080 pixels frame and converting them to JPEG format at 90% quality (these are the Adobe Lightroom settings I use). One interesting thought that crossed my mind was: how big will the filesizes be in average? And, perhaps more interestingly, how are they distributed? Are most files of the same size or are they 'uniformly' distributed? This is more than a simple academic curiosity, as the disk size in my Raspberry Pi is actually limited (note to self: buy larger SD card next time), and as of this writing, I already uploaded close to 5000 'favorite' files to the device. Having a good handle on the distribution of file sizes allows me to make some educated guesses of how much space I need to store these images. Hopefully, this will also be of interest to others working on a similar problem.
I suspect that the average size and distribution of file sizes depends somowhat on the type of phtography you make. My work is varied and includes a fair amout of black and white photography. I often produce landscapes but also average family snapshots, street phtography and people portraits. Figure 1 gives you an idea of a typical mix as shown through Lightroom's 'Library' module. So, while I expect that the results here may be useful for other people, I wouldn't be surprised if other photographers statistics are somewhat different owing to a different photograpic style and habits. So, with that caveat in mind, let's see how I analyzed this data.
Figure 1  My Library
The first task was to extract a list with all the file sizes in my raspberry Pi 'favorites' folder. Python is a very nice language to use for this type of task by virtue of it's 'os' module. Listing 1 shows a small program that can extract all the jpeg file sizes in a folder and write them to a .csv file:
import os from os.path import join # Using r => Raw string so '\' are taken as literal characters and not escape characters mypath = r"C:\Users\Paulo\Pictures\2015\Raspberry Pi Frame" results = r"C:\Users\Paulo\Documents\Python\filesizes.csv" print('Analyzing files in:', mypath) # Select only jpegs photos = [ f for f in os.listdir(mypath) if f.endswith('.jpg') ] filesizes = [] for filename in photos: filepath = os.path.join(mypath,filename) # Get size in Kbytes rounded: size = os.path.getsize(filepath)//1024 filesizes.append(size) # Write the results to a csv file try: fhand = open(results, 'w') except: print('File cannot be opened:', fname) exit() for size in filesizes: fhand.write("%d\n" %size) fhand.close() print("DONE") 
So now that we have a nice csv file with all the file sizes (in KB) we can easily analyze it using either Excel or a statistical package like R. I chose the latter since it has very powerful plotting and statistics features. Here's a simple script in R that can import the csv file and plot the relevant histogram and statistics:
# Plots the probability density function of the file sizes # that were extracted using the extract_file_sizes_rev1.py script library(ggplot2) df = read.csv(file = file.choose(), header = FALSE) df = as.data.frame(df) names(df) < "x" # Plot combined histogram and density g <ggplot(data = df, aes(x=x)) + geom_histogram(aes(y = ..density..), color="black", fill=NA) + geom_density(color="blue", weight = 4, fill = 'blue', alpha = 0.5)+ ggtitle("File Size Distribution")+ labs(x = "File Size [KB]", y = "Frequency") print(g) # Statistics summary(df) sd(df$x) sd(df$x)*sqrt(length(df$x)) 
So here is the summary of the data provided by R:
Min. : 120.0
1st Qu.: 468.0
Median : 647.5
Mean : 692.6
3rd Qu.: 864.0
Max. :2397.0
So the average (mean) size for one of my photos when formatted for HDTV is about 693 KB. As the summary results show, there's a bit of a spread in sizes. R can also compute the standard deviation:
> sd(df$x)
[1] 308.7243
So we can see that the magnitude of the standard deviation is about half of the mean which is significant. As we will see, this is not actually that bad when trying to estimate the size on disk for a large number of photos. More on this soon. For now let's have a look at the nice histogram and density plots ggplot produced:
Figure 2  File Size Distribution
It's interesting to me to observe that, while not normal (this distribution is definitely skewed), the distribution is also far from being uniform. Most photos fall in the 400 KB to 900 KB range (roughly).
At first glance, it may seem that estimating the disk size requrements to store a number of files (let's say N) is going to be very innacurate since the file sizes vary so much. However, things actually get much better when you add the sizes of many photos together. This is know as the "Central Limit Theorem" in statistics. This beautiful result tell us that, for large numbers of random variables (in this case individual photo sizes are 'random varibles') their sum is approximately normal (gaussian). We can furthermore estimate the resulting average and standard deviation as follows:
TOTAL SIZE = N * Mean
TOTAL_SD = SD * Sqrt(N)
The thing that helps us here is that the 'uncertainty' term 'TOTAL_SD' increases only with the square root of the number of photos. Therefore, when we have a lot of photos, we can make a pretty good estimate of the required size on disk as the error term decreases dramatically relative to the mean. An approximate 95% confidence interval for the total space is:
N* Mean +/ 2* SD * Sqrt(N)
So let's look at an example to make things clearer:
Say you have 10000 photos (formatted as described) and need to estimate the required space to store them on disk. Per the formula above we get:
TOTAL SIZE = 10000 * 693 KB +/ 2* 309*sqrt(10000)
TOTAL SIZE = 6 930 000 +/ 61 800 KB
or in Megabytes:
TOTAL SIZE = 6768 +/ 60.4 MB
This is about 6 930 000/1024/1024 = 6.6 GB. As you can see, the interval is actually pretty narrow owing to the large number of photos that are being combined. In practice, you can simply estimate the size by multiplying the number of files by the 693 KB average size per file. Now on to Amazon to get me a larger SD card:)
Comments, questions, suggestions? You can reach me at: contact (at sign) paulorenato (dot) com