[
Skip Navigation]
≡
β©οΈ
π£οΈ
-
π
Help
:
Wiki
:
Media Sources
≡
Welcome
Signin
Media Sources@Help
View
Source
History
Discussion
Help Group
Create/Find Pages
Group Feed
My Groups
π
Locale: en-US
Page: Media Sources
β
ποΈ
Page Type:
Standard
Page and Feedback
Page Alias
Media List
Presentation
Url Shortener
Share Wall
Alias Page To:
Page Border:
Solid
Dashed
None
Table of Contents:
Title:
Author:
Meta Robots:
Meta Description:
Meta Properties (such as Open Graph)
One line per property in format: name|content
Header Page Name:
Footer Page Name:
'''Media Sources''' are used to specify how Yioop should handle news feeds, podcast, and trending value sites. The '''Add Media Source''' form lets you add new media sources. What this form looks like depends on the '''Type''' dropdown chosen. Below we describe the form for each of the possible choices of type: <br /> An '''RSS media source''' can be used to add an RSS or Atom feed (it auto-detects which kind) to the list of feeds which are downloaded hourly when Yioop's Media Updater is turned on. Besides the name you need to specify the URL of the feed in question. The Category field search usually be left at news. If you want to specify additional categories such as weather or sports, you typically want to create a mix that searches the default index with the keyword media:your_category injects, and then make a new subsearch with that mix. This will allow your new category to show up on the Tools/More/Other Searches page. <br /> An '''HTML media source''' is a web page that has feed articles like an RSS page that you want the Media Updater to scrape on an hourly basis. To specify where in the HTML page the news items appear you specify different XPath information. For example, <pre> Name: Cape Breton Post URL: http://www.capebretonpost.com/News/Local-1968 Language: English Category: news Channel: //div[contains(@class, "channel")] Item: //article Title: //a Description: //div[contains(@class, "dek")] Link: //a </pre> The Channel field is used to specify the tag that encloses all the news items. Relative to this as the root tag, //article says the path to an individual news item. Then relative to an individual news item, //a gets the title, etc. Link extracts the href attribute of that same //a . <br /> A '''JSON media source''' is a used to scrape feed articles from JSON data as may be provided by a websites API. To handle a JSON media source you provide the same information as with an HTML media source. Internally, Yioop converts all JSON sources to xml before processing. The root objects maps to /html/body. A property ''foo'' of the root object would get mapped to a tag <foo>. Array elements are mapped to a sequence of elements enclosed in <item> tags. The process is recursively applied until the JSON object is completely converted to an xml page. Once this is done the XPaths that a user provides are used to extract the feed items in the same way as how HTML feeds are extracted. As an example, Yioop search results and discussion groups can be output as JSON. To take Yioop's news feed and use it as a JSON media source in your search engine, you could use the settings: <pre> Name: Yioop News URL: https://www.yioop.com/s/news?f=json Language: English Category: news Channel: //channel Item: //item Title: //title Description: //description Link: //link </pre> <br /> A '''Regex media source''' is a source of feed articles presented in some kind of non-tag based text format. For example, the US National Weather Service has a text-based page for weather forecasts of major US cities at <pre> http://forecast.weather.gov/product.php?site=NWS& issuedby=04&product=SCS&format=txt& version=1&glossary=0 </pre> changing the 04 above to 03, 02, 01 varies the group of cities. Most of the data on this page appears in a pre tag as text. ''Channel'' in this case would be a regex whose first capture group corresponds to the contents of this pre tag. We might want to get one item per line from the pre tag as that would correspond to the weather for one city. The ''Item Separator'' is a regex used to split the results of the Channel operation into items. Finally, ''Title'', ''Description'', and ''Link'' are regexes each with one capture group used to get these respective feed item components out of an item given after the splitting process above. Hence, a reasonable choice of values for the weather service page might be: <pre> Name: National Weather Service 04 URL: http://forecast.weather.gov/product.php? site=NWS&issuedby=04&product=SCS&format=txt& version=1&glossary=0 Language: English Category: weather Channel: /<pre(?:.+?)>([^<]+)/m Item: / / Title: /^(.+?)\s\s\s+/ Description: /\s\s\s+(.+?)$/ Link: http://www.weather.gov/ </pre> Notice in the above that the Link element is http://www.weather.gov/. If you have a feed and it doesn't provide links for individual items. You can always provide a link to some fixed site by directly entering a URL in the Link field. <br /> Not all feeds use the same tag to specify the image associated with a news item. The Image XPath allows you to specify relative to a news item (either RSS or HTML) where an image thumbnail exists. If a site does not use such thumbnail one can prefix the path with ^ to give the path relative to the root of the whole file to where a thumb nail for the news source exists. Yioop automatically removes escaping from RSS containing escaped HTML when computing this. For example, the following works for the feed: <pre> https://feeds.wired.com/wired/index //description/div[contains(@class, "rss_thumbnail")]/img/@src </pre> <br /> A '''Feed Podcast source''' is an RSS or Atom source where each item contains a link to a podcast or video podcast. For example, http://feed.cnet.com/feed/podcast/all/hd.xml The '''Alternative Link Tag''' field is used to say the XPath within the feed item to the link for the audio or video file. For the CNet example, this is: enclosure If it is blank the default link tag is used. The media updater job when run checks if any items in the feed are new. If so, it downloads them to the wiki resource folder of the wiki page provided in the '''Wiki Destination''' field. This page is given in the format GroupName@PageName. If you give just PageName, the Public group is assumed. The '''Expires''' field controls how long a feed item is kept before it is deleted. For example, if we wanted to download the popular Ted talk podcasts into the Ted subfolder of the resource folder of the Example Podcast wiki page of the Public group, where we have podcasts expire after after 1 month, we could do: <pre> Name: Ted URL: https://pa.tedcdn.com/feeds/talks.rss Language: English Expires: One Month Alternative Link Tag: enclosure Wiki Destination: Library@News and Podcasts/Ted/%Y-%m-%d %F </pre> Notice the string has "%Y-%m-%d %F" in it. This portion of the destination gives the format of the filename to use when storing a downloaded podcast file. It says name the file as the current year hyphen month hyphen day space the filename as given in the URL. %F is for the filename, other % modifiers can be standard date formatting instructions. <br /> Yioop supports the downloading of single video or audio file sources, as well as more complicated stream sources such as m3u8 streams. <br /> A '''Scrape podcast source''' is like a '''Feed Podcast source''', but where one has a HTML or XML page which has a periodically updated link to a video or audio source. For example, it might be an evening news web site. The '''URL''' field should be the page with the periodically updated link. The '''Aux Url XPaths''' field, if not blank, should be a sequence of XPaths or Regexes one per line. The first line will be applied to the page to obtain a next url to download. The next line's XPath or Regex is applied to this file and so on. The final url generated should be to the HTML or XML page that contains the media source for that day. Finally, on the page for the given day, '''Download XPath''' should be the XPath of the url of the video or audio file to download. If a regex is used rather than an XPath, then the first capture group of the regex should give the url. A regex can be followed by json| to indicate the first capture group should be converted to a json object. To reference a path of through sub-objects of this object to a url. As an example, consider the following, which at some point, could download the Daily News Scrape Podcast to a wiki group: Type: Scrape Podcast Name: Daily News Podcast URL: https://www.somenetwork.com/daily-news Language: English Aux Url XPaths: /(https\:\/\/cdn.somenetwork.com\/daily-news\/video\/daily-[^\"]+)\"/ /window\.\_\_data\s*\=\s*([^\]+\}\;)/json|video|current|0|publicUrl Download XPath: //video[contains(@height,'540')] Wiki Destination: My Private Group@Podcasts/%Y-%m-%d.mp4 The initial page to be download will be: https://www.somenetwork.com/daily-news. On this page, we will use the first Aux Path to find a string in the page that matches /(https\:\/\/www.somenetwork.com\/daily-news\/video\/daily-[^\"]+)\"/. The contents matching between the parentheses is the first capture group and will be the next url to download. SO for example, one might get a url: https://cdn.somenetwork.com/daily-news/video/daily-safghdsjfg This url is then downloaded and a string matching the pattern /window\.\_\_data\s*\=\s*([^ ]+\}\;)/ is found. The capture group portion of this string consists of what matches ([^ ]+\}\;) is then converted to a JSON object, because of the json| in the Aux Url XPath. From this JSON object, we look at the video field, then the current subfields, its 0 subfield, and finally, the publicUrl field. This is the url we download next. Lastly, the download XPath is then used to actually get the final video link from this downloaded page. Once this video is downloaded, it is stored in the Podcasts page's resource folder of the the My Private Group wiki group in a file with a name in the format: %Y-%m-%d.mp4. A '''Trending value source''' is a value on a web page that one would like to track using Yioop's trending search mechanism. The Name field is the name to use for the trending value. The URL field should be the page with the periodically updated value. '''Category''' should be the trends category (a collection of trending values) one would like to track this value with. '''Group Within Category''' is the default name of the key that will be associated with the value found on this page. '''Trend Value Regex''' is a regular expression to match against the downloaded URL. If it matches and the expression has one capture group, then tat capture group will be used as the value for a particular download time. If it has two or more capture groups, the first two capture groups are used to give a key name, value pair for a particular download time. As an example, Name: Yioop Ticker URL: https://my-great-stock-quotes/yioop Language: English Category: stocks Group Within Category: Yioop Price Trend Value Regex: /Yioop\:\s+(\d+\.\d+)/ Here there is only one capture group (\d+\.\d+), so searching on trending:stocks, one would see all the hour, weekly, etc values for the trending values with that category. One such row would be Yioop Price whose values would be computed based on the numbers extracted according to this regex's (\d+\.\d+) capture group. <br /> A '''Description Source''' is used to update the description of wiki page resources based on the resource's name. The '''Name''' field is used to give a name to this search source. The '''URL''' field is used to provide the url of web page along with any required query parameters in order to look up resource using its name. The '''Language''' field is used to specify the locale to be used at search site given they support it. The '''Path Terms''' field is used to specify a comma separated list of terms to check against the resource. If any of the path terms are contained in the wiki page name, resource path, or resource item's mimetype (both major and whole mimetype), the description source will be used. The '''Info XPaths''' field is used to specify the details of HTML tags containing the required information to be collected as the description of the resources. The '''Item XPath''' field is used to specify tag name and optionally attribute with value that aids to uniquely identify the HTML elements that completely contain all the details of a single search result, mostly this will be a <tr> tag. The '''Title XPath''' field is used to specify the details of HTML tag within the '''Item XPath''' that contains the text representing the title of search result in the similar format as '''Item XPath'''. The '''Url XPath''' field is used to specify details of HTML tag within '''Item XPath''' that contains the URL of details page about the search result. The '''Test Values''' field is used to provide test values to be used while in the test mode of search source. Below is the example of search source for IMDB site <pre> Name: IMDB URL: https://www.imdb.com/find?q= Language: English Path Terms: TV Shows, Movies, Video Info Xpaths: Year/Rating | //ul[contains(@data-testid,'hero-title-block__metadata')]/li/a Plot | //span[contains(@data-testid,'plot-l')] Genres | //a[contains(@class,'ipc-chip')] Item Xpath: //li[contains(@class,'find-result-item')] Title Xpath: //a[contains(@class,'pc-metadata-list-summary-item__t')] Url Xpath: //a[contains(@class,'ipc-metadata-list-summary-item__t')]/@href Test Values: Brahmastra Part One.mp4 House of the Dragon.mp4 </pre>
X
(c) Hobby
GOOTII.COM
We use cookies to implement this site's user functionality, social media features, and traffic analytics.
Privacy Policy Details
.
Allow Cookies