Data Extraction Framework

Since the Web has become a huge virtual database of sorts, more and more often these days we find ourselves wanting to scrape data from Web sites (although this tactic may not be legal or allowed on some sites).

However, most data commonly available on the Web is in HTML format, which cannot be easily processed by software systems that do not understand HTML presentation. Most systems that extract and process data from the Web (including shopping agents, personal news bots, etc.) fall into this category.

When you extract only the required data from HTML streams, software systems designed to process that data can easily consume it — without regard to its formatting. With simple HTML streams (such as the one listed above) this is not a very difficult process. Any regular expression scanner will do nicely. With more complex files though, this quickly becomes quite difficult. The following issues are the biggest stumbling blocks.

• An entire HTML file cannot be scanned with a single expression. To get meaningful results, several expressions need to be written and need to work in tandem—with precise control over which expression gets evaluated when. In lex this would typically be done using states.

• HTML data available on the Web is volatile. One reason that a scanner built using lex would be difficult to maintain is that changes will break the scanner. Each time the HTML data changes, the scanner code will have to be regenerated using lex and binary updates will need to be made to the software system.

Stingray developers considered these two issues carefully when designing the Objective Toolkit library. With reference to the first issue, the system that we have in place does not provide much of an advantage over lex (or other such systems). We offer the same functionality in terms of states. However, we believe that we have a more object oriented, easily extensible solution for the following reasons:

• The underlying library (Regex++ by Dr. John Maddock) is very well written and maintainable.

NOTE >> With reference to the issue of HTML data volatility, our approach has an important advantage. With lex, a binary generation is required each time there is a change in the Web information. With our approach this is not strictly needed.