Blend Images is a commercial stock photography collection of approximately 100,000 images produced by over 200 photographers. This project explores how Python may be used as a tool to create separate sub-collections by searching string attributes, generating separate metadata for those collections, and moving or copying jpeg files from the original directory to new folders. Using one original directory of all the images and one CSV of metadata, I created Python scripts to identify specific subjects within the collection by searching for specific strings in the captions and keywords. From this search, new CSVs of only those collected lines of metadata were created. A separate script then reads the CSVs and copies the list of jpeg files into separate folders. Each image has string geographic data. These strings were converted to Lat-Long coordinates using the Google Geocoding API. The geographic locations of each sub-collection have been visualized using Tableau. Mapping the images allows us to further explore and understand the range of assets of these subjects in the Blend collection. Common practical uses for these scripts include being able to separate images from a specific photographer, credit name(s), or shoot location, or to move assets using only a list of filenames. Project steps (using Python):

  • Begin with one directory of jpeg image files and one CSV of all metadata associated with the images
  • Search for specific strings in captions and keywords for themes with geographic implications, count the assets in potential collections to make sure the number of photos is meaningful for an actual sub-collection
  • The 6 sub-collections are: beaches, cities, forests, landmarks, mountains, parks
  • Create separate CSVs of the metadata for each sub-collection
  • Physically move or copy the images into separate directories for visual understanding of how the images match up with location keywords and see what the collections look like
  • Any images not found during the copying process are identified and listed by filename
  • After collections have been separated and moved to different directories, combine CSVs into one CSV for the 6 collections

Data Cleaning: OpenRefine was used to clean up the locations data. This data was separated by city, state, and country in separate columns. These had been originally supplied by the submitting photographers and there was a large range of spelling errors and language discrepancies, or wrong data in wrong field. Project Visualization (using Tableau):

  • Use Google Developer geocoding tool to obtain lat-long data for images from string location names
  • Global Map Visualization of separate image locations for each sub-collection, with option to filter collections to understand the breadth of each sub-collection, as well as breadth of all the geographic collections together. Points on the map are sized according to the number of filenames at the location, shaded according to collection subject.
  • On hover, the viewer can see the location (city, state, country) and the number of images represented at that location.

Please Note: The images for this project are protected by copyright and the metadata is proprietary. The visualization may be used to explore this particular collection and the scripts may be used as a template for analyzing other image collections or similar projects.