Locale: en-US
Page: Media Sources

Page Type:

Alias Page To:

Page Border:

Table of Contents:

Title:

Author:

Meta Robots:

Meta Description:

Meta Properties (such as Open Graph)

One line per property in format: name|content

Header Page Name:

Footer Page Name:

'''Media Sources''' are used to specify how Yioop should handle news feeds, podcast, and trending value sites. The '''Add Media Source''' form lets you add new media sources. What this form looks like depends on the '''Type''' dropdown chosen. Below we describe the form for each of the possible choices of type:

An '''RSS media source''' can be used to add an RSS or Atom feed (it auto-detects which kind) to the list of feeds which are downloaded hourly when Yioop's Media Updater is turned on. Besides the name you need to specify the URL of the feed in question. The Category field search usually be left at news. If you want to specify additional categories such as weather or sports, you typically want to create a mix that searches the default index with the keyword media:your_category injects, and then make a new subsearch with that mix.
This will allow your new category to show up on the Tools/More/Other Searches page.

An '''HTML media source''' is a web page that has feed articles like an RSS page that you want the Media Updater to scrape on an hourly basis. To specify where in the HTML page the news items appear you specify different XPath information. For example,
<pre>
 Name: Cape Breton Post
 URL: http://www.capebretonpost.com/News/Local-1968
 Language: English
 Category: news
 Channel: //div[contains(@class, "channel")]
 Item: //article
 Title: //a
 Description: //div[contains(@class, "dek")]
 Link: //a
</pre>
The Channel field is used to specify the tag that encloses all the news items. Relative to this as the root tag, //article says the path to an individual news item. Then relative to an individual news item, //a gets the title, etc. Link extracts the href attribute of that same //a .

A '''JSON media source''' is a used to scrape feed articles from JSON data as may be provided by a websites API. To handle a JSON media source you provide the same information as with an HTML media source. Internally, Yioop converts all JSON sources to xml before processing. The root objects maps to /html/body.
A property ''foo'' of the root object would get mapped to a tag <foo>. Array elements are mapped to a sequence of elements enclosed in <item> tags. The process is recursively applied until the JSON object is completely converted to an xml page. Once this is done the XPaths that a user provides are used to extract the feed items in the same way as how HTML feeds are extracted. As an example, Yioop search results and discussion groups can be output as JSON. To take Yioop's news feed and use it as a JSON media source in your search engine, you could use the settings:
<pre>
 Name: Yioop News
 URL: https://www.yioop.com/s/news?f=json
 Language: English
 Category: news
 Channel: //channel
 Item: //item
 Title: //title
 Description: //description
 Link: //link
</pre>

A '''Regex media source''' is a source of feed articles presented in some kind of non-tag based text format.
For example, the US National Weather Service has a text-based page for weather forecasts of major US cities
at
<pre>
 http://forecast.weather.gov/product.php?site=NWS&
 issuedby=04&product=SCS&format=txt&
 version=1&glossary=0
</pre>
changing the 04 above to 03, 02, 01 varies the group of cities. Most of the data on this page appears in a pre tag as text. ''Channel'' in this case would be a regex whose first capture group corresponds to the contents of this pre tag. We might want to get one item per line from the pre tag as that would correspond to the weather for one city. The ''Item Separator'' is a regex used to split the results of the Channel operation into items. Finally, ''Title'', ''Description'', and ''Link'' are regexes each with one capture group used to get these respective feed item components out of an item given after the splitting process above. Hence, a reasonable choice of values for the weather service page might be:
<pre>
 Name: National Weather Service 04
 URL: http://forecast.weather.gov/product.php?
 site=NWS&issuedby=04&product=SCS&format=txt&
 version=1&glossary=0
 Language: English
 Category: weather
 Channel: /<pre(?:.+?)>([^<]+)/m
 Item: /
/
 Title: /^(.+?)\s\s\s+/
 Description: /\s\s\s+(.+?)$/
 Link: http://www.weather.gov/
</pre>
Notice in the above that the Link element is http://www.weather.gov/. If you have a feed
and it doesn't provide links for individual items. You can always provide a link to some
fixed site by directly entering a URL in the Link field.

Not all feeds use the same tag to specify the image associated with a news item. The Image XPath allows you to specify relative to a news item (either RSS or HTML) where an image thumbnail exists. If a site does not use such thumbnail one can prefix the path with ^ to give the path relative to the root of the whole file to where a thumb nail for the news source exists. Yioop automatically removes escaping from RSS containing escaped HTML when computing this. For example, the following works for the feed:
<pre>
 https://feeds.wired.com/wired/index
 //description/div[contains(@class, "rss_thumbnail")]/img/@src
</pre>

A '''Feed Podcast source''' is an RSS or Atom source where each item contains a link to a podcast or video podcast. For example,
 http://feed.cnet.com/feed/podcast/all/hd.xml
The '''Alternative Link Tag''' field is used to say the XPath within the feed item to the link for the audio or video file. For the CNet example, this is:
 enclosure
If it is blank the default link tag is used. The media updater job when run checks if any items in the feed are new. If so, it downloads them to the wiki resource folder of the wiki page provided in the '''Wiki Destination''' field. This page is given in the format GroupName@PageName. If you give just PageName, the Public group is assumed. The '''Expires''' field controls how long a feed item is kept before it is deleted.
For example, if we wanted to download the popular Ted talk podcasts into the Ted subfolder of the resource folder of the Example Podcast wiki page of the Public group, where we have podcasts expire after after 1 month, we could do:
<pre>
 Name: Ted
 URL: https://pa.tedcdn.com/feeds/talks.rss
 Language: English
 Expires: One Month
 Alternative Link Tag: enclosure
 Wiki Destination: Library@News and Podcasts/Ted/%Y-%m-%d %F
</pre>
Notice the string has "%Y-%m-%d %F" in it. This portion of the destination gives the format of the filename to use when storing a downloaded podcast file. It says name the file as the current year hyphen month hyphen day space the filename as given in the URL. %F is for the filename, other % modifiers can be standard date formatting instructions.

Yioop supports the downloading of single video or audio file sources, as well as more complicated stream sources such as m3u8 streams.

A '''Scrape podcast source''' is like a '''Feed Podcast source''', but where one has a HTML or XML page which has a periodically updated link to a video or audio source. For example, it might be an evening news web site.
The '''URL''' field should be the page with the periodically updated link. The '''Aux Url XPaths''' field, if not blank, should be a sequence of XPaths or Regexes one per line. The first line will be applied to the page to obtain a next url to download. The next line's XPath or Regex is applied to this file and so on. The final url generated should be to the HTML or XML page that contains the media source for that day. Finally, on the page for the given day, '''Download XPath''' should be the XPath of the url of the video or audio file to download.
If a regex is used rather than an XPath, then the first capture group of the regex should give the url. A regex can be followed by json| to indicate the first capture group should be converted to a json object. To reference a path of through sub-objects of this object to a url. As an example, consider the following, which at some point, could download the Daily News  Scrape Podcast to a wiki group:

Type: Scrape Podcast
 Name: Daily News Podcast
 URL: https://www.somenetwork.com/daily-news
 Language: English
 Aux Url XPaths:
 /(https\:\/\/cdn.somenetwork.com\/daily-news\/video\/daily-[^\"]+)\"/
 /window\.\_\_data\s*\=\s*([^\]+\}\;)/json|video|current|0|publicUrl
 Download XPath: //video[contains(@height,'540')]
 Wiki Destination: My Private Group@Podcasts/%Y-%m-%d.mp4

The initial page to be download will be: https://www.somenetwork.com/daily-news. On this page, we will use the first Aux Path to find a string in the page that matches /(https\:\/\/www.somenetwork.com\/daily-news\/video\/daily-[^\"]+)\"/. The contents matching between the parentheses is the first capture group and will be the next url to download. SO for example, one might get a url:
 https://cdn.somenetwork.com/daily-news/video/daily-safghdsjfg
This url is then downloaded and a string matching  the pattern /window\.\_\_data\s*\=\s*([^
]+\}\;)/ is found. The capture group portion of this string consists of what matches ([^
]+\}\;) is then converted to a JSON object, because of the json| in the Aux Url XPath. From this JSON object, we look at the video field, then the current subfields, its 0 subfield, and finally, the publicUrl field. This is the url we download next. Lastly, the download XPath is then used to actually get the final video link from this downloaded page.
Once this video is downloaded, it is stored in the Podcasts page's resource folder of the the My Private Group wiki group in a file with a name in the format: %Y-%m-%d.mp4.

A '''Trending value source''' is a value on a web page that one would like to track using Yioop's trending search mechanism. The Name field is the name to use for the trending value. The URL field should be the page with the periodically updated value. '''Category''' should be the trends category (a collection of trending values) one would like to track this value with. '''Group Within Category''' is the default name of the key that will be associated with the value found on this page. '''Trend Value Regex''' is a regular expression to match against the downloaded URL. If it matches and the expression has one capture group, then tat capture group will be used as the value for a particular download time. If it has two or more capture groups, the first two capture groups are used to give a key name, value pair for a particular download time. As an example,

Name: Yioop Ticker
 URL: https://my-great-stock-quotes/yioop
 Language: English
 Category: stocks
 Group Within Category: Yioop Price
 Trend Value Regex: /Yioop\:\s+(\d+\.\d+)/

Here there is only one capture group (\d+\.\d+), so searching on trending:stocks, one would see all the hour, weekly, etc values for the trending values with that category. One such row would be Yioop Price whose values would be computed based on the numbers extracted according to this regex's (\d+\.\d+) capture group.

A '''Description Source''' is used to update the description of wiki page resources based on the resource's name. The '''Name''' field is used to give a name to this search source. The '''URL''' field is used to provide the url of web page along with any required query parameters in order to look up resource using its name. The '''Language''' field is used to specify the locale to be used at search site given they support it. The '''Path Terms''' field is used to specify a comma separated list of terms to check against the resource. If any of the path terms are contained in the wiki page name, resource path, or resource item's mimetype (both major and whole mimetype), the description source will be used. The '''Info XPaths''' field is used to specify the details of HTML tags containing the required information to be collected as the description of the resources. The '''Item XPath''' field is used to specify tag name and optionally attribute with value that aids to uniquely identify the HTML elements that completely contain all the details of a single search result, mostly this will be a <tr> tag. The '''Title XPath''' field is used to specify the details of HTML tag within the '''Item XPath''' that contains the text representing the title of search result in the similar format as '''Item XPath'''. The '''Url XPath''' field is used to specify details of HTML tag within '''Item XPath''' that contains the URL of details page about the search result. The '''Test Values''' field is used to provide test values to be used while in the test mode of search source. Below is the example of search source for IMDB site
<pre>
 Name: IMDB
 URL: https://www.imdb.com/find?q=
 Language: English
 Path Terms: TV Shows, Movies, Video
 Info Xpaths:
 Year/Rating | //ul[contains(@data-testid,'hero-title-block__metadata')]/li/a
 Plot | //span[contains(@data-testid,'plot-l')]
 Genres | //a[contains(@class,'ipc-chip')]
 Item Xpath: //li[contains(@class,'find-result-item')]
 Title Xpath: //a[contains(@class,'pc-metadata-list-summary-item__t')]
 Url Xpath: //a[contains(@class,'ipc-metadata-list-summary-item__t')]/@href
 Test Values:
 Brahmastra Part One.mp4
 House of the Dragon.mp4
</pre>