Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A key problem with extracting article context is that there are so many distinct sources.

That said, power laws and Zipf functions apply, and a large fraction of HN front-page articles come from a relatively small set of domains. There's further aggregation possible when underlying publishing engines can be identified, e.g., Wordpress, CMSes used by a large number of news organisations, Medium, Substack, Github, Gitlab, Fediverse servers, and a number of static site generators (Hugo, Jekyll, Pelican, Gatsby, etc.).

I suspect you're aware of most of this.

I have a set of front-page sites from an earlier scraping project:

(For the life of me I cannot remember what the 3rd column represents, though it may be a miscalculated cumulative percentage. The "category" field was manually supplied by me, every site with > 17 appearances has one, as well as several below that threshold which could be identified by other means, e.g., regexes on blogging engines, GitHub pages, etc.)

  Rank  Count    ???  Site :::: Category
  ------------------------------------------------------------- 
     1  7294   5.175  n/a :::: n/a
     2  3803   7.873  nytimes.com :::: general news
     3  3495  10.352  techcrunch.com :::: tech news
     4  1580  11.473  arstechnica.com :::: tech news
     5  1344  12.426  bloomberg.com :::: business news
     6  1288  13.340  wired.com :::: tech news
     7  1171  14.171  wsj.com :::: business news
     8  1099  14.951  youtube.com :::: video
     9  1026  15.678  wikipedia.org :::: general info (wiki)
    10   921  16.332  bbc.com :::: general news
    11   911  16.978  bbc.co.uk :::: general news
    12   893  17.612  theguardian.com :::: general news
    13   866  18.226  washingtonpost.com :::: general news
    14   846  18.826  reuters.com :::: general news
    15   829  19.414  economist.com :::: business news
    16   781  19.968  theatlantic.com :::: general interest
    17   631  20.416  arxiv.org :::: academic / science
    18   628  20.862  npr.org :::: general news
    19   622  21.303  nature.com :::: academic / science
    20   614  21.738  newyorker.com :::: general interest
    21   505  22.097  eff.org :::: law
    22   475  22.434  stanford.edu :::: academic / science
    23   471  22.768  ieee.org :::: technology
    24   456  23.091  reddit.com :::: general discussion
    25   448  23.409  amazon.com :::: corporate comm.
    26   445  23.725  microsoft.com :::: technology
    27   416  24.020  theverge.com :::: tech news
    28   410  24.311  venturebeat.com :::: business news
    29   408  24.600  quantamagazine.org :::: academic / science
    30   407  24.889  cnn.com :::: general news
17,782 sites in total, if I'm reading my past notes correctly.

More on that project in an HN search: <https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...>

(Individual comments/posts seem presently unreachable due to an HN site bug.)



Further thoughts on article extraction: one idea that comes to mind is including extraction rules in the source selection metadata.

I'm using something along these lines right now to process sections within a given source, where I define the section-distinguishing-element from a headline URL, as well as the plaintext, position (within my generated page), lines of context, and maximum age (days) I'm interested in.

That could be extended or paired with a per-source rule that identifies the htmlq specifiers which pull out title, dateline, and byline elements from the source.

A further challenge is that such specifiers have a tendency to change as the publisher's back-end CMS varies, and finding ways to identify those is ... difficult.

But grist for the mill, at any rate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: