What attributes RubyGems’ Marshal file really contain?

RubyGems started with a single marshaled file called Marshal.4.8.gz containing the array of every Gem::Specification object for every gem that has been uploaded to RubyGems directory. Since then we have new indexes for RubyGems to speed things up, but good old Marshal.4.8.gz is still around carrying important information about gems from RubyGems.org. Are you interested to know what is and what is not there and finally understand why gem specification rails -r does not give you the information on licensing even though it’s part of rails’ gemspec file?

When I started building gem-compare for comparing .gem files gem-compare would always download the .gem file and compare againts these downloaded .gem files, but since a lot can be accomplished by just downloading specifications, I wanted gem-compare to download the full .gem files only if it compares something that is not included in the available gemspec. So to say gem-compare needed to be aware of Gem::Specification attributes that are available while quering RubyGems.

Here is a little Ruby script that tries to collect all Gem::Specification attributes from Marshal.4.8.gz that has at least one valid value which means they are present:

# Check what kind of spec data are saved in http://rubygems.org/Marshal.4.8.Z
require 'rubygems'
require 'pp'

# wget http://rubygems.org/Marshal.4.8.Z if needed
gems = Marshal.load(Gem.inflate(File.read("./Marshal.4.8.Z")))

SPEC_PARAMS = %w[ author authors name platform require_paths rubygems_version summary
                  license licenses bindir cert_chain description email executables
                  extensions homepage metadata post_install_message rdoc_options
                  required_ruby_version required_rubygems_version requirements
                  signing_key has_rdoc date version ].sort
SPEC_FILES_PARAMS = %w[ files test_files extra_rdoc_files ]
DEPENDENCY_PARAMS = %w[ dependencies ]

included = []

PARAMS.each do |param|
  gems.each do |gem|
    name = gem[0]
    spec = gem[1]
    if spec.respond_to? :"#{param}"
      value = spec.send(:"#{param}")
      if value
        if value.respond_to? :'empty?'
          next if value.empty?
        included << param
puts 'Included:'
pp included
puts 'Not included:'
pp (PARAMS - included)

And the results it gives:

Not included:

That means quering againts the not included specification attributes won’t give you the right results:

$ gem specification rails -r | grep files
extra_rdoc_files: []
files: []
test_files: []

Of course you may notice that Gem::SpecFetcher which would be used in this example will now actually ask on /specs.4.8.gz index and then download separate gemspec files from /quick/Marshal.4.8/GEM-VERSION.gemspec.rz, but if you look carefully you find out that these gemspec files obviously contain the same info as the old big Marshal.4.8 which is easier to go through at once.

That is why gem compare rails 3.0.0 4.0.0 --runtime won’t download full .gem files, but when you run gem compare rails 3.0.0 4.0.0 -p 'license' gem-compare needs to download .gem files to give you the accurate information.

This hepled me to understand why quering RubyGems didn’t show me the information I would expect and I hope that cleared some things for you too.


I wrote a complete guide on web application deployment. Ruby with Puma, Python with Gunicorn, NGINX, PostgreSQL, Redis, networking, processes, systemd, backups, and all your usual suspects.

More →