$ ls ~yifei/notes/

readability.js 源码阅读

Posted on:

Last modified:

Readability is able to fetch paginated page and combine them into one page.  


  1. start the whole process
  2. remove event listeners
  3. remove scripts
  4. find next page
  5. prep the document
  6. get readability article components
  7. get document directions
  8. add readability dom to the dom
  9. post process
  10. scroll to top
  11. append next page to the dom
  12. add some smooth scrolling function


stores the parsed pages, key are end-slash-striped


  1. create document.body if there is none
  2. find the biggest frame (width + height)
  3. remove all css
  4. remove all style elements
  5. replace <br/><br> to </p><p>




  1. prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous <p> tags, etc.
  2. clean styles
  3. clean unwanted tags
  4. if only have one h2, that must be the title, but we already have title, so remove it
  5. remove empty <p> s




  1. get document.title or h1
  2. normalizing the title



  replace any break (<br/> ) by <br />  



  1. clean child tags of given element
  2. cleanConditionally



  the amount of text that is inside a link divided by the total text in the node. archored text length / all text length  


  main logic for readability, using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.

get nodes to score:

  1. get all nodes
  2. remove unlikely candidates by find specific patterns in classname and id
  3. add p, td, pre to nodesToScore
  4. turn all divs that don't have children block level elements into p's, and add it to nodesToScore

get candidates

  1. loop through all paragraphs, and assign a score to them based on how content-ish they look. Then add their scores to their parent nodes.
  2. if the text is too short(less than 25), ignore them
  3. initialize parent and grand parent nodes ?
  4. compute content score
    1. base score 1
    2. add score by comma numbers, note: only english commas are counted
    3. for every 100 characters in this paragraph, add another point. Up to 3 points.
    4. add the score to the parent. The grandparent gets half.



  Get the number of times a string s appears in the node e.  


  replace <>&"' to safe strings  


  check/ readability flags


  remove all javascripts found one the page  


  trim and squeeze spaces and return the textContent of a node  



  1. clean style attribute recursively



  1. Some content ends up looking ugly if the image is too large to be floated. If the image is wider than a threshold (currently 55%), no longer float it, center it instead.



  post processing: add footnotes, fix floating images  



  1. get document.title or h1
  2. normalizing the title






  1. readability tracking script


  add links found in the page as foot notes  




  find the articles base url, normalize and remove the paganation part, only the path part, no query string  



  1. find all links
  2. if already seen the link or the link is the page self
  3. if on different domain, ignore
  4. if match the EXTRANEOUS regex or has a long text, remove it
  5. if remove the base url, and have no number in it, remove it

  ok, the logic is very good, just translate it to python  


append next page to current page

WeChat Qr Code

© 2016-2022 Yifei Kong. Powered by ynotes

All contents are under the CC-BY-NC-SA license, if not otherwise specified.

Opinions expressed here are solely my own and do not express the views or opinions of my employer.

友情链接: MySQL 教程站