Updated 09 May 2019 by Max
If you’ve got that many URLs, indexation may not be a clear view of what Google has actually seen in your site. Assuming you’ve got canonicalisation (Please tell me you’ve got canonicalisation!) there may be URLS that are correctly not appearing in the index because, after all, you’re telling Google not to put them in the index. You still want Google to visit and process them in case there’s any link equity or whatever.
I’m assuming you’re trying to get a map of URLs Googlebot ain’t hitting? The steps I’d go for are;
- Get a server log, and also a copy of all the URLS on site (probably from a database export at that scale. If there’s a neatly available table with all content in, that’d be ideal; this depends on how your site was developed though.)
- Use GREP to pull out GET requests from Googlebot in the Server log. This should give you a map of URLS Google has hit. Portent did a good post on this; https://www.portent.com/blog/seo/get-geeky-grep-seo-tool.htm
- Compare the two. At that scale you probably aren’t going to get away with a spreadsheet lookup, you’re probably going to need to put them into two database tables and then do a bit of SQL to find the URLS that don’t match.