Appreciating Art through the Lens of Data Science
The three of us in Team Splatoon have partnered with Budget Collector to analyze relationships between dominant and secondary colors in artwork with region of origin, art period, or time. We will accomplish this by deploying various color quantization techniques that extract key color information from each piece of art. By the end of our semester, our goal is to have thoroughly studied the nuances between artwork and color and will illustrate its evolution through history with an interactive time series visualization.
Introduction to Color
The three of us in Team Splatoon were tasked with studying any existing relationships between dominant/secondary colors in artwork and their respective origins, time periods, and/or styles. We began by contemplating the actual definition of “dominant” and “secondary” colors but found that it was not a straightforward answer. Our first instinct was to simply classify the most frequently occurring color as dominant while the second most frequently occurring color was secondary. Unfortunately, we found that this approach did not always capture how humans see color values in artwork. For example, by our definition, the color black would be most dominant color in Caravaggio’s “Saint Jerome Writing” below.
This not only fails to capture the essence of the painting, but also counters many expert opinions that reds and browns are the most dominant in this painting.
We concluded that we need to investigate other means that better capture the nuances of color and art composition. One possibility is to identify the background and subject matter first and penalize the background color. Of course, this would be less useful for abstract artwork that lacks a distinct, main subject (see example below).
Other options we will explore over the next 14 weeks include evaluating cluster variances and sizes, outlier detection, and possibly edge detection to mimic how humans would group color together.
Before we could start studying the relationships between colors and other factors, we first had to extract the color information from all the pieces of artwork. This is where we will introduce the concept of “color quantization.”
Color quantization is an image processing method where the number of distinct colors in an image is reduced while preserving the quality of the image. An image on a computer can be represented as a three-dimensional array, in which each dimension represents a color channel.
The three color channels are red, green, and blue, also known as the RGB color model. Each pixel is composed of a varying degree of intensity of each RGB color, and together they form the picture we see on the screen.
A common color quantization method is the median cut method invented by Paul Heckbert in 1979. Pixels of the image are first separated into the three color channels and the color channel with the greatest range is identified. Then, pixels are sorted in order in that channel and divided into two equal parts once they are sorted. This process is repeated until the desired number of parts have been achieved, and the pixels in each part are averaged to produce the color that represents the part.
Another widely used method is the K-means clustering algorithm. K-means clustering works by identifying centroids in the dataset and assigning each data point to the centroid that is closest to them. Once all the data points have been assigned, the algorithm repeats until the sum of squared distance between all the data points and their assigned centroids is minimized. This metric is also known as the within cluster sum of squares, and it is an effective way to measure the variability of data points within each cluster. The results of this algorithm are multiple centroids, which represent the major colors of the image. The disadvantage of K-means is that it can be computationally expensive and convergence speed will depend on the quality of centroid initialization.
Since we knew that most color quantization methods were computationally heavy, we decided to limit our initial attempts to the Budget Collector dataset – since it only contained 400 images. While this limits the sample size, it allows us to experiment with potential relationships more quickly between color and artwork metadata. We intend to expand our analysis to the larger data sets available once we have a path forward. Furthermore, the Budget Collector dataset already contains valuable details on each piece of art that we can use in our analysis.
To download the data from AirTable, we exported the file as a .csv while the images were stored as URLs under a column labeled “Images”. Our code extracted the URLs from the table and used Python’s request library to download the images, using their UIDs as file names for easy reference.
The raw data table had several typographical errors among the key characteristics, so we also standardized all the spelling and naming conventions throughout the dataset. Furthermore, we decided to exclude any sculptures from our analysis as they are typically mono-chromatic and would not add appropriate insights to our analysis.
Initial Exploratory Data Analysis/Relationships
We started our analysis with a broad overview of the Budget Collector dataset.
Figure 5 indicates our dataset is skewed towards 20th-century artworks, and Europe and North America regions. These imbalances may require correction in subsequent analysis to produce usable results for visualization.
We then compared the performance between the K-means algorithm and median-cut algorithm. Figure 6 shows the results of the k-means algorithm and the importance of selecting the correct number of clusters. As you can see, increasing the number of clusters from k = 2 to k = 5 results in a more representative color palette for the original image. Furthermore, we can easily compare the weights of the colors in the image by measuring the size of each cluster.
The median cut algorithm yielded similar results in Figure 7 and the code ran significantly faster. However, this method has a significant disadvantage in that it cannot inherently rank color dominance. Each cut results in a same size cluster of pixels, so each cluster is equally weighted. Comparatively, the K-means cluster size is exactly proportional to the number of pixels assigned to the color.
So far, we have determined K-means is useful, as it allows us to “weigh” the colors of an image based on how large the cluster size is. We suspect this may perform better for abstract paintings but may do worse for portraits or pieces with a focused subject, where dominant color may be more subjective.
Our goal for the next couple weeks is to explore this theory and experiment with algorithms that will either re-weight some of these colors or identify pops of color via outlier detection.
Stay tuned for part 2 of our blog where we will continue trying to replicate the human experience with art and explore some story telling time series visualizations.
To be continued…Meet the team!
My name is Cindy Tran and I’m from Houston, Texas. My background is in chemical engineering, and I am currently an optimization analyst for a petrochemical company. I chose to pursue a Masters in Analytics because I think the knowledge could be leveraged well in the petrochemical industry. I have been loving the program so far and how applicable it is to my current role. In my free time I love watching movies and working out.
My name is Michelle and I’m based in San Diego. I’m a software developer at a fintech company with a background in Accounting and Economics. I started pursuing my Data Analytics degree back in 2020, and so far, it has been one of the best career decisions I have made! In my free time I enjoy being outdoors and surrounded by nature.
I’m Matthew, living a couple of hours south of Atlanta, Georgia. I am an aerospace engineer working in the defense industry. I have enjoyed the OMSA program so far, and am looking forward to leveraging the knowledge against real world problems to gain actionable predictions. I enjoy time with my family, working outdoors, and playing ultimate frisbee.