Readability is able to fetch paginated page and combine them into one page
stores the parsed pages, key are end-slash-striped
normalizing the title
replace any break (
<br/> ) by
the amount of text that is inside a link divided by the total text in the node. archored text length / all text length
main logic for readability, using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.
compute content score
Add the score to the parent. The grandparent gets half.
Get the number of times a string s appears in the node e.
replace <>&"' to safe strings
check/ readability flags
trim and squeeze spaces and return the textContent of a node
1. clean style attribute recursively
1. Some content ends up looking ugly if the image is too large to be floated. If the image is wider than a threshold (currently 55%), no longer float it, center it instead.
post processing: add footnotes, fix floating images
1. get document.title or h1 2. normalizing the title
1. readability tracking script
add links found in the page as foot notes
nothing xhr xmlhttprequest successfulrequest ajax
find the articles base url, normalize and remove the paganation part only the path part, no query string
1. find all links 2. if already seen the link or the link is the page self 3. if on different domain, ignore 4. if match the EXTRANEOUS regex or has a long text, remove it 5. if remove the base url, and have no number in it, remove it ok, the logic is very good, just translate it to python