$ ls ~yifei/notes/

readability.js 源码阅读

Posted on:2017-08-16 13:40

Last modified:2020-05-16 03:50

Readability is able to fetch paginated page and combine them into one page.

init

start the whole process
remove event listeners
remove scripts
find next page
prep the document
get readability article components
get document directions
add readability dom to the dom
post process
scroll to top
append next page to the dom
add some smooth scrolling function

parsedPages

stores the parsed pages, key are end-slash-striped

prepDocument

create document.body if there is none
find the biggest frame (width + height)
remove all css
remove all style elements
replace   to

prepArticle

prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous  tags, etc.
clean styles
clean unwanted tags
if only have one h2, that must be the title, but we already have title, so remove it
remove empty  s

getArticleTitle

get document.title or h1
normalizing the title

killBreaks

replace any break (  ) by  

cleanTags

clean child tags of given element
cleanConditionally

getLinkDensity

the amount of text that is inside a link divided by the total text in the node. archored text length / all text length

grabArticle

main logic for readability, using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.

get nodes to score:

get all nodes
remove unlikely candidates by find specific patterns in classname and id
add p, td, pre to nodesToScore
turn all divs that don't have children block level elements into p's, and add it to nodesToScore

get candidates

loop through all paragraphs, and assign a score to them based on how content-ish they look. Then add their scores to their parent nodes.
if the text is too short(less than 25), ignore them
initialize parent and grand parent nodes ?
compute content score
1. base score 1
2. add score by comma numbers, note: only english commas are counted
3. for every 100 characters in this paragraph, add another point. Up to 3 points.
4. add the score to the parent. The grandparent gets half.

getCharCount

Get the number of times a string s appears in the node e.

htmlspecialchars

replace <>&"' to safe strings

flagisActive/addFlag/removeFlag

check/ readability flags

removeScripts

remove all javascripts found one the page

getInnerText

trim and squeeze spaces and return the textContent of a node

cleanStyles

clean style attribute recursively

fixImageFloats

Some content ends up looking ugly if the image is too large to be floated. If the image is wider than a threshold (currently 55%), no longer float it, center it instead.

postProcessContent

post processing: add footnotes, fix floating images

getArticleTools

get document.title or h1
normalizing the title

getSuggestedDirection

getArticleFooter

readability tracking script

addFootNotes

add links found in the page as foot notes

useRdbTypekit

nothing

findbaseUrl

find the articles base url, normalize and remove the paganation part, only the path part, no query string

findnextpage

find all links
if already seen the link or the link is the page self
if on different domain, ignore
if match the EXTRANEOUS regex or has a long text, remove it
if remove the base url, and have no number in it, remove it

ok, the logic is very good, just translate it to python

appendNextPage

append next page to current page

All contents are under the CC-BY-NC-SA license, if not otherwise specified.

Opinions expressed here are solely my own and do not express the views or opinions of my employer.

友情链接: MySQL 教程站