Job Spidering

Burning Glass’s system for aggregating and reporting on online job postings is designed to populate a comprehensive database of real-time job opportunity information in a manner that provides as accurate a representation as possible of the full scope of advertised labor demand.

Burning Glass’s proprietary data collection program is the most extensively tested system, identifies jobs from the greatest number of websites with the highest level of frequency, and, consequently, generates a larger database of current job opportunities than any other in the industry.

Our proprietary data collection program has been developed and refined over nearly a decade. In fact, the U.S. Patent and Trademark Office recognized the innovativeness of the approach by granting a patent on “determining whether a web site contains employment data” and then “formatting, parsing and storing the employment data and corresponding URL into a database”. It is the most extensively tested system, identifies jobs from the greatest number of websites with the highest level of frequency, and, consequently, generates a larger database of current job opportunities than any other in the industry.

The two critical elements of online job aggregation are data collection (intelligent “spidering” programs that search the Internet for job listings) and deduplication (ensuring the integrity and consistency of the data set according to client-configured parameters).

Data Collection/Spidering

Burning Glass identifies viable websites with employment opportunity related content on a regular schedule utilizing spider technology to search those sites for employment opportunities. We maintain two kinds of spiders, which: 1) continually monitor or scout websites to identify those that include employment opportunities; and 2) continually spider and extract employment opportunity related information from a master list of websites.

The use of “scout” spiders is an important distinction between Burning Glass and other job data collectors. Other solutions rely on limited, manually-collected lists of job boards. As a result, they search fewer sites and they update their master list only occasionally as third-party data is released. By contrast, Burning Glass recognizes new sites almost as soon as they are launched and our master list is added to more often. We also add new spiders whenever a customer notifies us of a new website or our dedicated team of researchers finds a new site.

This sophisticated, two-step process enables Burning Glass to retrieve job listings from a much broader range of sources, including job boards, government agencies, educational institutions, and thousands of employers of all sizes, locations, and industries. We currently collect data from more than 17,000 sites.

It is especially significant that our spiders visit private and public employer websites directly. This enables Burning Glass to aggregate the most representative jobs database in the industry, because it includes every size of employer, from small to large. Other solutions’ exclusive reliance on job boards means that their datasets are biased against jobs posted by small- and mid-sized businesses (the primary source of economic and job growth) because the cost of job board advertisements can prove prohibitive to many employers. This is also true of sources which aggregate jobs predominantly from large corporations. While retrieving content from a wider variety of sources does increase the burden on deduplication routines (see below), Burning Glass believes that the wholesale elimination of certain categories of sources (as others do – choosing to rely primarily on data from a handful of secondary sources instead of visiting primary sources themselves) is not a statistically valid method for assuring data accuracy.

In order to ensure that our database represents the most up-to-date view of the labor market, Burning Glass’s spiders check each site at least once per week. Sites that add new postings most frequently are checked daily.

Deduplication

Because Burning Glass’s database is a full reflection of job listings posted across the Internet, robust processes are required to identify and remove duplicate listings.

Rooting out duplicates is a highly sensitive task because there can be substantial ambiguity as to what constitutes a duplicate record. For example, if an employer posts a vacancy on a job board, fills it, and then advertises an identical vacancy the following week, is this a second opening or a duplicate?

Burning Glass applies a unique approach to deduplication that results in more than half of all jobs we collect being deduplicated. This is possible because our advanced parsing engine extracts and normalizes an unparalleled number of data elements from each job listing, each of which can function as an individual duplicate screen or in concert with other variables, e.g. job title, job ID, source, posting date, employer name, location, job description text, etc.

Moreover, our unique deduplication algorithm further leverages our parsing and coding capabilities by considering the actual job functions and skills described by the employer rather than text – we focus on the content of the posting, not simply the words or basic fields. As a result, the data we deliver to our clients is not only the most comprehensive representation of online hiring but also the most reliable.

 

Want to know more?

Your email address: