Scrapely, however, treats a HTML page as a stream of tokens. By building a representation of the page this way, Portia is able to handle any type of HTML no matter how badly formatted.
Scrapely reads the streams of tokens from the unannotated pages, and looks for regions that are similar to the sample’s annotations. To decide what should be extracted from new pages, it notes the tags that occur before and after the annotated regions, referred to as the prefix and suffix respectively.
Another feature we’ve began working on will let you automatically extract data from commonly found pages such as product, contact, and article pages. It works by using samples created by Portia users (and our team) to build models so that frequently extracted information will be automatically annotated.