Appreciating Art through the Lens of Data Science
In part 1 of this blog, Team Splatoon introduced the Georgia Tech Spring 2023 Practicum Project where we partner with Budget Collector to determine relationships between colors and their regions, time periods, or other data of interest. We experimented with color quantization methods via k-means and median cut. We found that while it quantified the most dominant colors, it did not accurately reflect how humans would experience the art. In part 2, we will explore how we rectify the issues from our initial attempt at color extraction and discuss some potential data visualization options.
Mimicking the Human Art Perspective
We found our initial attempts at color extraction fell short because our definition of “dominant” did not match how people truly experience art. For example, in blog #1, we illustrated how our k-means algorithm chose black as the most dominant color for Caravaggio’s “Saint Jerome Writing” despite many art experts agreeing that the most prominent colors should be reds and browns.
We considered several methods to improve the probability of our algorithm identifying dominant colors in line with a human observer. One option was to evaluate the within-cluster variance with the k-means algorithm to help identify the background color of an image. The idea is that cluster groups, that are more uniform, are more likely to be background colors.
Our idea showcased well in Figure 1 below, where a k-means algorithm was applied to Robert Falk’s Portret Van Elizabeth Sergejevna Potehinoj. The bar plot presents within-cluster variance among the eight cluster centers chosen by the algorithm. The background color of the image are green hues, which match with the clusters that have lowest within-cluster variance. Unfortunately, this technique falters when considering more complex artwork or artwork that lacks a main subject.
- Exclude the background if an art piece has a main subject (i.e. Caravaggio’s Saint Jerome Writing) but include if the piece is more abstract or contains multiple subjects
- Simplify the color palette because human eyes cannot discern exact differences to the extent of a computer
- Be able to identify colors that “pop”
While researching how to best tweak our algorithm, we stumbled across a couple of existing python packages that satisfied our requirements. Most notably is the package Colorific, from 99designs.com.
In their blog, the authors discuss challenges in extracting colors from brand logos, eventually encountering similar issues to ours. Namely, logo color palettes would be misrepresented due to overwhelming background colors, the resulting palette would not match human perception of the colors, and colors meant to be visually striking were being washed out algorithmically by other, less interesting colors.
Colorific solves these issues by:
- Identifying the background color as the corner pixel color values. If a color in an image matched those corner pixel colors, then it is a background color and then excluded.
- Using Pillow package’s Image.convert() function to reduce the total colors to a smaller overall palette. Then, aggregate similar colors together using a distance metric. This more accurately mimics human perception of color, since the human eye is less sensitive to color differences compared to a computer.
- And finally, the Colorific package adds a saturation threshold to help identify striking colors meant to draw the viewers’ attention.
For more information on Colorific, please refer to their GitHub.
After experimenting with Colorific, it appears to address most of our concerns. Typical results are displayed in Figure 3 below.
We decided to proceed with Colorific, but we still need to tune the algorithm for our data set since we are working with a variety of different art styles.
Colorific has 6 tuning parameters:
- N_QUANTIZED – Reduced palette size
- MIN_DISTANCE – The minimum distance to consider two colors different
- MIN_PROMINENCE – The minimum presence a color must occur in the image to be considered a part of the palette
- MIN_SATURATION – Saturation threshold
- MAX_COLORS – Maximum number of colors to include in palette
- BACKGROUND_PROMINENCE – Level of prominence indicating a background color
In our next blog post, we will discuss how to approach tuning Colorific. This step is critical as the resulting palettes can drastically differ from minor changes. For example, Figure 4 below demonstrates how changing the MIN_DISTANCE parameter from 10 to 2 yields worsens the palette from an arguably fair representation to a nearly monochromatic array of blacks and grays.
Visualizing the data
The second requirement for our practicum is to develop an interactive time series visualization of our data set. Data visualization is useful in every stage of analytics – from data exploration where it allows us to detect correlations and patterns, to explaining and communicating results from machine learning algorithms such that the model can be better tuned. This technique uncovers the patterns, trends, and outliers in the data and draws meaningful conclusions through a graphical way.
So, what is time-series data, and how is it visualized? Time-series data is data that changes over time. There are two main types of time-series data – measurements gathered at regular time intervals, and at irregular time intervals. Some examples of such data include temperature for a region, chemical levels in water, etc. This type of data is collected at fixed intervals; thus, using a line graph for this type of data provides a clear idea of trends and patterns. This also allows forecasting and abnormality detection to be accomplished. When there are multiple series of data for the same timeline, multiple lines can be added onto the same graph, so they can be easily compared.
On the other hand, time-series data that are collected at irregular time intervals are usually not normally distributed, meaning that the datapoints collected are not distributed randomly as a symmetrical bell-shaped graph. The shape of these data can be skewed to one side or have multiple peaks. With non-normally distributed data, it is a common practice to transform these data as part of the data analysis process, such that standard statistical tests can be performed. To visualize this type of time-series data, scatter plots and bar charts are better options since the datapoints are less related to each other. Scatter plots are also useful for identifying outliers and clusters.
For our project, we are utilizing the dataset provided by Budget collector and identifying the top two colors of the artwork. As described above, this is a type of time-series data that is not collected at regular time intervals as the number of artworks within the dataset is not consistent across timeline. A line graph would not be appropriate as we would not be able to highlight the change in colors throughout time.
To start exploring the potential options of visualization, we decided to use Dash Plotly as our frontend framework as it integrates nicely with Python, our backend coding language. We went with a scatter plot using the time period from all pictures as x-axis. Each picture is a datapoint on the graph represented by a dot, with varying colors representing the dominant and secondary colors. To distinguish between dominant and secondary colors, we are going to vary the size of the dots where dominant colors will be represented by a larger dot, and secondary colors will be smaller. With all these components together, visually we can see the trend of color changes over time.
Below is a rough example of what we have in mind, mocked up using dummy data as a proof of concept, using Dash Plotly as frontend.
Users will be able to interact with the visualization via dropdown menus to allow them to filter the data set by features, such as region, year, or style. Users will also be able to hover over head data point to showcase the respective image and any relevant information.
As mentioned above, we need to tune the Colorific package for our application. This will require us to have a set of training data with color labels designating dominant and secondary.
We will discuss evaluation of the accuracy of the output from our machine learning models, and our testing plans.
After tuning is complete, we will finalize our color extraction and begin our systematic analysis to try to derive meaning between colors in paintings and their respective origins and styles.
For the visualization, we are also actively brainstorming more ways to enable users to explore the colors like we did in this study, such as the color palettes, and general clustering of images.
So, stay tuned for the next blog where we will share our next version of the visualization!