Чпу, роутинг, единая точка входа на php

Loading Url

Loading a url is very similar to the way you would load the html from a file.


// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadFromUrl('http://google.com');
$html = $dom->outerHtml;

// or
$dom->loadFromUrl('http://google.com');
$html = $dom->outerHtml; // same result as the first example

loadFromUrl will, by default, use an implementation of the to do the HTTP request and a default implementation of to create the body of the request. You can easely implement your own version of either the client or request to use a custom HTTP connection when using loadFromUrl.

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;
use App\Services\MyClient;

$dom = new Dom;
$dom->loadFromUrl('http://google.com', null, new MyClient());
$html = $dom->outerHtml;

As long as the client object implements the interface properly it will use that object to get the content of the url.

Options

You can also set parsing option that will effect the behavior of the parsing engine. You can set a global option array using the method in the object or a instance specific option by adding it to the method as an extra (optional) parameter.

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;
use PHPHtmlParser\Options;

$dom = new Dom;
$dom->setOptions(
    // this is set as the global option level.
    (new Options())
        ->setStrict(true)
);

$dom->loadFromUrl('http://google.com', 
    (new Options())->setWhitespaceTextNode(false) // only applies to this load.
);

$dom->loadFromUrl('http://gmail.com'); // will not have whitespaceTextNode set to false.

At the moment we support 12 options.

Strict

Strict, by default false, will throw a if it find that the html is not strictly compliant (all tags must have a closing tag, no attribute with out a value, etc.).

whitespaceTextNode

The whitespaceTextNode, by default true, option tells the parser to save textnodes even if the content of the node is empty (only whitespace). Setting it to false will ignore all whitespace only text node found in the document.

enforceEncoding

The enforceEncoding, by default null, option will enforce an character set to be used for reading the content and returning the content in that encoding. Setting it to null will trigger an attempt to figure out the encoding from within the content of the string given instead.

cleanupInput

Set this to to skip the entire clean up phase of the parser. If this is set to true the next 3 options will be ignored. Defaults to .

removeScripts

Set this to to skip removing the script tags from the document body. This might have adverse effects. Defaults to .

removeStyles

Set this to to skip removing of style tags from the document body. This might have adverse effects. Defaults to .

preserveLineBreaks

Preserves Line Breaks if set to . If set to line breaks are cleaned up as part of the input clean up process. Defaults to .

removeDoubleSpace

Set this to if you want to preserve whitespace inside of text nodes. It is set to by default.

removeSmartyScripts

Set this to if you want to preserve smarty script found in the html content. It is set to by default.

htmlSpecialCharsDecode

By default this is set to . Setting this to will apply the php function too all attribute values and text nodes.

selfClosing

This option contains an array of all self closing tags. These tags must be self closing and the parser will force them to be so if you have strict turned on. You can update this list with any additional tags that can be used as a self closing tag when using strict. You can also remove tags from this array or clear it out completly.

noSlash

This option contains an array of all tags that can not be self closing. The list starts off as empty but you can add elements as you wish.

Urls Explained

What’s a URI?

Uniform Resource Identifiers (URIs) are used to identify ‘names’ or ‘resources’. They come in 2 varieties: URNs and URLs. In fact, a URI can be both a name and a locator!

What’s a URL?

Uniform Resource Locators (URLs) provide a way to locate a resource using a specific scheme, most often but not limited to HTTP. Just think of a URL as an address to a resource, and the scheme as a specification of how to get there.

What’s the syntax of a URI?

scheme:scheme-specific-part?query#fragment
  • Examples from the RFC:
  • ftp://ftp.is.co.za/rfc/rfc1808.txt
  • http://www.ietf.org/rfc/rfc2396.txt
  • ldap:///c=GB?objectClass?one
  • news:comp.infosystems.www.servers.unix
  • tel:+1-816-555-1212
  • telnet://192.0.2.16:80/
  • urn:oasis:names:specification:docbook:dtd:xml:4.1.2

What’s the syntax of a URL?

scheme://username:password@subdomain.domain.tld:port/path/file-name.suffix?query-string#hash
  • Examples:
  • http://www.google.com
  • http://foo:bar@w1.superman.com/very/long/path.html?p1=v1&p2=v2#more-details
  • https://secured.com:443
  • ftp://ftp.bogus.com/~some/path/to/a/file.txt

What’s the syntax of a URN?

urn:namespame-identifier:namespace-specific-string
  • Examples from Wikipedia:
  • urn:isbn:0451450523
  • urn:ietf:rfc:2648
  • urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66

What’s the ‘userinfo’ in a URL?

The userinfo part of a URL is made of the username and/or the password. They are optional and used for authentication purposes. The userinfo has the format username:password and is followed by the @ character and the host. The password is optional, often resulting in a prompt by the user interface for a password.

  • Examples:
  • ftp://username:password@host.com/
  • ftp://username@host.com/

What’s the ‘authority’ in a URL?

The authority of a URL is made of the userinfo, the hostname and the port. The userinfo and port are optional. When the port is not present, a default port for the specific scheme is assumed. For example port 80 for http or 443 for https.

  • Examples:
  • username:password@host.com/
  • subdomain.domain.com
  • www.superaddress.com:8080

What’s the ‘fragment’ in a URL?

Also known as a hash, the fragment is a pointer to a secondary resource with the first resource. It follows the # character.

http://www.foo.bar/?listings.html#section-2

What’s the ‘path’ in a URL?

The path of a URL is made of segments that represent a structured hierarchy. Each segment is separated by a the / character. You can think of a path as a directory structure.

  • Example:
  • http://www.foo.bar/segment1/segment2/some-resource.html
  • http://www.foo.bar/image-2.html?w=100&h=50
  • ftp://ftp.foo.bar/~john/doe?w=100&h=50

What’s the ‘query string’ in a URL?

The query contains extra information that is usually in the key-pair format. Each pair is usually separated by an ampersand & character. It follows the ? character.

  • Examples:
  • http://www.foo.bar/image.jpg?height=150&width=100
  • https://www.secured.com:443/resource.html?id=6e8bc430-9c3a-11d9-9669-0800200c9a66#some-header

Парсинг вопросов-ответов в результатах поиска

Вопросы/ответы можно извлекать и вручную из результатов поиска. Но зачем, если есть шаблон от Hannah Rampton?


Это один из шаблонов, который мы используем при поиске идей для контента и постановке ТЗ копирайтерам. Анализ вопросов, связанных с основным запросом, позволяет углубиться в тему и создать интент-ориентированный контент (подробнее — в нашей статье об алгоритме Neural Matching).

Для выгрузки вопросов/ответов:

  • создайте копию шаблона Google Q&A Extraction_v2;
  • установите бесплатное расширение Scraper для Chrome (оно парсит данные с веб-страниц с помощью XPath);
  • измените в настройках поисковика язык с русского на английский (это нужно для корректной работы формул в шаблоне).

Приступаем к парсингу вопросов/ответов:

в открывшемся окне в блоке «Selector» выбираем «XPath», вводим в поле запрос для парсинга раскрывающихся списков с вопросами/ответами: //g-accordion-expander (обратите внимание, чтобы блок Columns был заполнен так же, как на скриншоте); нажимаем «Scrape»;

  • после парсинга нажимаем «Copy to clipboard»;
  • открываем шаблон, переходим на лист «Google Questions and Answers», наводим курсор на ячейку А10 и нажимаем Ctrl+Shift+V.

Если все сделано верно, то поля с вопросами, ответами и URL заполнятся автоматически.

На листе «Clean Data» та же информация представлена в юзабельном текстовом формате (кроме того, здесь исключены дубли).

На листе «Search by Keyword» вы можете найти вопросы по заданному ключевому слову (или его части).

Также вы можете выбрать вопросы по домену — для этого на листе «Search by Domain» введите полный URL или его часть.

Таким образом, вы быстро и бесплатно найдете релевантные вопросы по вашей тематике.

📋 Example

// Dependencies
const parseUrl = require("parse-url")

console.log(parseUrl("http://ionicabizau.net/blog"))
// { protocols: ,
//   protocol: 'http',
//   port: null,
//   resource: 'ionicabizau.net',
//   user: '',
//   pathname: '/blog',
//   hash: '',
//   search: '',
//   href: 'http://ionicabizau.net/blog' }

console.log(parseUrl("http://domain.com/path/name?foo=bar&bar=42#some-hash"))
// { protocols: ,
//   protocol: 'http',
//   port: null,
//   resource: 'domain.com',
//   user: '',
//   pathname: '/path/name',
//   hash: 'some-hash',
//   search: 'foo=bar&bar=42',
//   href: 'http://domain.com/path/name?foo=bar&bar=42#some-hash' }

// If you want to parse fancy Git urls, turn off the automatic url normalization
console.log(parseUrl("git+ssh://git@host.xz/path/name.git", false))
// { protocols: ,
//   protocol: 'git',
//   port: null,
//   resource: 'host.xz',
//   user: 'git',
//   pathname: '/path/name.git',
//   hash: '',
//   search: '',
//   href: 'git+ssh://git@host.xz/path/name.git' }

console.log(parseUrl("git@github.com:IonicaBizau/git-stats.git", false))
// { protocols: [],
//   protocol: 'ssh',
//   port: null,
//   resource: 'github.com',
//   user: 'git',
//   pathname: '/IonicaBizau/git-stats.git',
//   hash: '',
//   search: '',
//   href: 'git@github.com:IonicaBizau/git-stats.git' }

Unknown URLs.

Overview of unknown URL parsing.

Unknown scheme

The parser will attempt to parse any type of URL it encounters based on its scheme. However, not all URLs are parsable, for example . In this case, the following URL object is returned;

{
    scheme: "spotify:",
    url: "track:<trackid>"
}

The unknown URL object will always contain the scheme (if present), for filtering purposes, and also contains a toString() method, which will convert the URL object back to the original URL string.

mailto

«mailto:» URLs are parsable in the same manner as a regular HTTP URL. For example, the following URL object is returned for a URL with a «mailto:» scheme;

{
    scheme: "mailto:"
    user: "username",
    password: "",
    host: "www.example.com",
    port: "",
    path: "",
    query: "?subject=subject&body=body",
    fragment: ""
}

Therefore, «mailto:» URLs can be fully parsed using this parser, but note that it is not possible to set the password, port or fragment strings on a «mailto:» URL.

javascript

«javascript» URLs are parsable in the same manner as a regular HTTP URL. For example, the following URL object is returned for a URL with a «javasrcipt:» scheme;

{
    scheme: "javascript:"
    user: "",
    password: "",
    host: "www.example.com",
    port: "",
    path: "/",
    query: "",
    fragment: "",
    javascript: "alert('!');"
}

Therefore, «javascript:» URLs can be fully parsed using this parser, but note that the current «document.location.href» will always be parsed/returned as the main URL object.

Alternatives

If you don’t like jurlp, you may like one of these:

  • URI.js (thanks for the alternatives list!)
  • jQuery-URL-Parser
  • URL.js
  • furl (Python)
  • URI.js by Gary Court

Usage

You can find many examples of how to use the dom parser and any of its parts (which you will most likely never touch) in the tests directory. The tests are done using PHPUnit and are very small, a few lines each, and are a great place to start. Given that, I’ll still be showing a few examples of how the package should be used. The following example is a very simplistic usage of the package.

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadStr('<div class="all"><p>Hey bro, <a href="google.com">click here</a><br /> :)</p></div>');
$a = $dom->find('a')[];
echo $a->text; // "click here"

The above will output «click here». Simple no? There are many ways to get the same result from the dome, such as or which can all be found in the tests or in the code itself.

URL naming scheme

A quick quide to URL nomenclature in this plugin.

Throughout this plugin, URLs are segmented and refered to in the following manner;

  • Scheme — Contains the protocol identifier (i.e. «https://», «ftp://»).
  • User — Conains the username to use when connecting to the host server. This segment may be empty.
  • Password — Contains the password to use in conjunction with the username when connecting to the remote server. This segment may be empty (and cannot be set without a user name).
  • Port — Contains the listening port number for the host server (i.e. «80», or «8080»). Note that an empty port value implies the default port (80).
  • Path — Contains the file path (i.e. «/index.html», or «/»).
  • Query — Contains any parameters passed in the query (i.e. «?param1=value1&param2=value2»). This segment may be empty.
  • Fragment — Contains any anchors/hash tags (i.e. «#elementname»). This segment may be empty.

Manual shortcuts

If your URL can’t be matched with a page name, a manual page is searched for your query. This is the case for the https://www.php.net/preg_match URL. The following pages are searched for in the manual:

  • Chapter pages (e.g. https://www.php.net/installation)
  • Reference pages (e.g. https://www.php.net/imap)
  • Function pages (e.g. https://www.php.net/join)
  • Class pages (e.g. https://www.php.net/dir)
  • Feature pages (e.g. https://www.php.net/safe_mode)
  • Control structure pages (e.g. https://www.php.net/while)
  • Other language pages (e.g. https://www.php.net/oop)

Since there are several manual pages that could potentially match the query (extension, class, function name..) you are encouraged to use their prefix/suffix:

  • Extension TOC: https://www.php.net/book.extname (e.g. https://www.php.net/book.dom).
  • Extension intro pages: https://www.php.net/intro.extname (e.g. https://www.php.net/intro.array).
  • Extension setup TOC: https://www.php.net/extname.setup (e.g. https://www.php.net/intl.setup).
  • Extension install chapter: https://www.php.net/extname.installation (e.g. https://www.php.net/apc.installation).
  • Extension configuration: https://www.php.net/extname.configuration (e.g. https://www.php.net/session.configuration).
  • Extension resources: https://www.php.net/extname.resources (e.g. https://www.php.net/mysql.resources).
  • Extension constants: https://www.php.net/extname.constants (e.g. https://www.php.net/image.constants).
  • Class synopsis: https://www.php.net/class.classname (e.g. https://www.php.net/class.xmlreader).
  • Class method: https://www.php.net/classname.methodname (e.g. https://www.php.net/pdo.query).
  • Functions: https://www.php.net/function.functionname (e.g. https://www.php.net/function.strpos).

This kind of URL will bring up the manual page in your preferred language. You can always override this setting by explicitly providing the language you want to get to. You can embed the language in the URL before the manual search term. https://www.php.net/hu/sort will bring up the Hungarian manual page for sort() for example.

Определение «скрытых» данных на уровне ключевых слов

В Google Analytics есть возможность подгрузить данные из Search Console. Но вы не увидите ничего нового — все те же страницы, CTR, позиции и показы. А было бы интересно посмотреть, какой процент отказов при переходе по тем или иным ключевым словам и, что еще интересней, сколько достигнуто целей по ним.

Тут поможет шаблон от Sarah Lively, который описан в статье для MOZ.

Для начала работы установите дополнения для Google Sheets:

  • Google Analytics Spreadsheet Add-on;
  • Search Analytics for Sheets (если вы использовали первые два шаблона, то это дополнение у вас уже есть).

Шаг 1. Настраиваем выгрузку данных из Google Analytics

Создайте новую таблицу, откройте меню «Дополнения» / «Google Analytics» и выберите пункт «Create new report».

Заполняем параметры отчета:

  • Name — «Organic Landing Pages Last Year»;
  • Account — выбираем аккаунт;
  • Property — выбираем ресурс;
  • View — выбираем представление.

Нажимаем «Create report». Появляется лист «Report Configuration». Вначале он выглядит так:

Но нам нужно, чтобы он выглядел так (параметры выгрузки вводим вручную):


Просто скопируйте и вставьте параметры отчетов (и удалите в поле Limit значение 1000):

Report Name Organic Landing Pages Last Year Organic Landing Pages This Year
View ID //здесь будет ваш ID в GA!!! //здесь будет ваш ID в GA!!!
Start Date 395daysAgo 30daysAgo
End Date 365daysAgo yesterday
Metrics ga:sessions, ga:bounces, ga:goalCompletionsAll ga:sessions, ga:bounces, ga:goalCompletionsAll
Dimensions ga:landingPagePath ga:landingPagePath
Order -ga:sessions -ga:sessions
Filters    
Segments sessions::condition::ga:medium==organic sessions::condition::ga:medium==organic

После этого в меню «Дополнения» / «Google Analytics» нажмите «Run reports». Если все хорошо, вы увидите такое сообщение:

Также появится два новых листа с названиями отчетов.

Шаг 2. Выгрузка данных из Search Console

Работаем в том же файле. Переходим на новый лист и запускаем дополнение Search Analytics for Sheets.

Параметры выгрузки:

  • Verified Site — указываем сайт;
  • Date Range — задаем тот же период, что и в отчете «Organic Landing Pages This Year» (в нашем случае — последний месяц);
  • Group By — «Query», «Page»;
  • Aggregation Type — «By Page»;
  • Results Sheet — выбираем текущий «Лист 1».

Выгружаем данные и переименовываем «Лист 1» на «Search Console Data». Получаем такую таблицу:

Для приведения данных в сопоставимый с Google Analytics вид меняем URL на относительные — удаляем название домена (через функцию замены меняем домен на пустой символ).

После изменения URL должны иметь такой вид:

Шаг 3. Сводим данные из Google Analytics и Search Console

Копируем шаблон Keyword Level Data. Открываем его и копируем лист «Keyword Data» в наш рабочий файл. В столбцы «Page URL #1» и «Page URL #2» вставляем относительные URL страниц, по которым хотим сравнить статистику.

По каждой странице подтягивается статистика из Google Analytics, а также 6 самых популярных ключей, по которым были переходы. Конечно, это не детальная статистика по каждому ключу, но все же это лучше, чем ничего.

При необходимости вы можете доработать шаблон — изменить показатели, количество выгружаемых ключей и т. п. Как это сделать, детально описано в оригинальной статье.

Как видите, для работы с ключами не обязательно сразу доставать кошелек. Есть немало простых решений. Следите за нашими публикациями — мы еще не раз поделимся полезностями.

Loading Files

You may also seamlessly load a file into the dom instead of a string, which is much more convenient and is how I except most developers will be loading the html. The following example is taken from our test and uses the «big.html» file found there.

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadFromFile('tests/data/big.html');
$contents = $dom->find('.content-border');
echo count($contents); // 10

foreach ($contents as $content)
{
	// get the class attr
	$class = $content->getAttribute('class');
	
	// do something with the html
	$html = $content->innerHtml;

	// or refine the find some more
	$child   = $content->firstChild();
	$sibling = $child->nextSibling();
}

This example loads the html from big.html, a real page found online, and gets all the content-border classes to process. It also shows a few things you can do with a node but it is not an exhaustive list of methods that a node has available.

URL

The URL object is the primary object for working with a URL.

The URL constructor

URL constructor throws

use Rowbot\URL\URL;

// Construct a new URL object.
$url = new URL('https://example.com/');

// Construct a new URL object using a relative URL, by also providing the constructor with the base URL.
$url = new URL('path/to/file.php?query=string', 'http://example.com');
echo $url->href; // Outputs: "http://example.com/path/to/file.php?query=string"

// You can also pass an existing URL object to either the $url or $base arguments.
$url = new URL('https://example.org:123');
$url1 = new URL('foo/bar/', $url);
echo $url1->href; // Outputs: "https://example.org:123/foo/bar/"

// Catch the error when URL parsing fails.
try {
    $url = new URL('http://2001::1]');
} catch (\Rowbot\URL\Exception\TypeError $e) {
    echo 'Invalid URL';
}

URL Members

Note: As a convience, both the and methods will throw an if you try to get or set an invalid property.

The getter returns the serialization of the URL. The setter will parse the entire string updating all the components of the URL with the new values. Providing an invalid URL will cause the setter to throw a .

The member is readonly. Its output is in the form of . If a URL does not have a port, then that will be excluded from the output.

The getter, also known as a scheme, returns the protocol of the URL, such as http, ftp, or ssh. The setter is used to change the URLs protocol.

The getter returns the username portion of the URL, or an empty string if the URL does not contain a username. The setter changes the URLs username.

The getter returns the password portion of the URL, or an empty string if the URL does not contain a password. The setter changes the URLs password.

The getter returns the combination of and . The output would look like . If the URL does not have a port, then the port is not present in the output. The setter allows you to change both the and at the same time.

The getter returns the hostname of the URL. For example, the hostname of would be . The setter will change the hostname portion of the URL.

The getter returns an integer as a string representing the URLs port. If the URL does not have a port, the empty string will be returned instead. The setter updates the URLs port.

The getter returns the URLs path. The setter updates the URLs path.

The getter returns the URLs query string. The setter updates the URLs URLSearchParams list.

Returns the URLSearchParams object associated with this URL allowing you to modify the query parameters without having to clobber the entire query string. This will always return the same object.

The getter, also known as a URLs fragment, returns the portion of the URL that follows the «#» character. The setter updates the portion of the URL that follows the «#».

Returns a JSON encoded string of the URL. Note that this method escapes forward slashes, which is not the default for PHPs , but matches the default behavior of JavaScripts . If you wish to control the serialization, then pass the URL obect to the function.

The URL object implements the interface allowing you to pass the object as a whole to the json_encode() function.

Returns the serialization of the URL.

See

Regular Expressions

Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.


Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it’s not properly written. You should know what you are doing before using RegEx on HTML.

HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.

You , but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

Also see Parsing Html The Cthulhu Way

Структура URL адресов в админке

Обычно URL адреса в админке формируются по одной из следующих схем:

И сразу рассмотрим простой пример:

Итак, мы видим, что модулем здесь является products, а действием, к примеру, add. Что теперь с этим делать?

Если вы знакомы с ООП и MVC, тогда модулем для вас будет название класса, а действием — метод этого класса, который нужно запустить. Если действие не указано, то принято запускать метод под названием index.

Если вы ничего не поняли — воспринимайте модуль как название файла, который нужно подключить, а действие — как, собственно, действие, которое нужно выполнить.

Перепишем пример, написанный нами в единой точке входа, под новую схему URL:

Итак, мы берём 1-ый фрагмент URL и проверяем, существует ли в папке pages файл с таким названием.

Т.е. при переходе на страницу /test/test2 скрипт проверит существование файла /pages/test.php. Если файл есть — PHP выполнит этот файл, в противном случае выполнится файл /pages/404.php.

Как видите, при таком подходе нам больше не нужно прописывать соответствие URL-адресов и PHP-файлов. PHP сам будет искать нужный файл в папке pages по первому фрагменту URL.

Теперь осталось только создать файл pages/products.php. Сделаем небольшую заготовку:

Вот так выглядит обработка действий. Мы смотрим на второй фрагмент URL и ищем обработчик этого действия. Для каждого действия (add, update, delete) нужно прописать отдельный блок elseif.

Внутри обработчика add мы смотрим на то, каким методом пришёл запрос, GET или POST. Если GET — отображаем форму, если POST — добавляем товар.

Если вам не нравится вложенная проверка метода, можно сделать иначе. В файле index.php сохраним метод в отдельную переменную:

products.php

Готово. Да, если вам не нравится, что в коде 2 раза встречается одно и то же действие, только с разными методами, можете использовать немного упрощённую схему URL-адресов из фреймворка Laravel:

Добавление префикса /admin/ в URL

Немного изменим код index.php:

Теперь при запросе страницы /admin/products PHP будет искать файл с названием не products.php, а admin_products.php.

Переименуйте файл и не забудьте заменить в нём все $segments на $segments, поскольку в $segments теперь лежит модуль, а в $segments действие.

Callbacks

During the call, the callbacks set in will be executed. The parser maintains state and never looks behind, so buffering the data is not necessary. If you need to save certain data for later usage, you can do that from the callbacks.

There are two types of callbacks:

  • notification Callbacks: on_message_begin, on_headers_complete, on_message_complete.
  • data Callbacks: (requests only) on_url, (common) on_header_field, on_header_value, on_body;

Callbacks must return 0 on success. Returning a non-zero value indicates error to the parser, making it exit immediately.

For cases where it is necessary to pass local information to/from a callback, the object’s field can be used. An example of such a case is when using threads to handle a socket connection, parse a request, and then give a response over that socket. By instantiation of a thread-local struct containing relevant data (e.g. accepted socket, allocated memory for callbacks to write into, etc), a parser’s callbacks are able to communicate data between the scope of the thread and the scope of the callback in a threadsafe manner. This allows to be used in multi-threaded contexts.

Example:

 typedef struct {
  socket_t sock;
  void* buffer;
  int buf_len;
 } custom_data_t;


int my_url_callback(http_parser* parser, const char *at, size_t length) {
  /* access to thread local custom_data_t struct.
  Use this access save parsed data for later use into thread local
  buffer, or communicate over socket
  */
  parser->data;
  ...
  return ;
}

...

void http_parser_thread(socket_t sock) {
 int nparsed = ;
 /* allocate memory for user data */
 custom_data_t *my_data = malloc(sizeof(custom_data_t));

 /* some information for use by callbacks.
 * achieves thread -> callback information flow */
 my_data->sock = sock;

 /* instantiate a thread-local parser */
 http_parser *parser = malloc(sizeof(http_parser));
 http_parser_init(parser, HTTP_REQUEST); /* initialise parser */
 /* this custom data reference is accessible through the reference to the
 parser supplied to callback functions */
 parser->data = my_data;

 http_parser_settings settings; /* set up callbacks */
 settings.on_url = my_url_callback;

 /* execute parser */
 nparsed = http_parser_execute(parser, &settings, buf, recved);

 ...
 /* parsed information copied from callback.
 can now perform action on data copied into thread-local memory from callbacks.
 achieves callback -> thread information flow */
 my_data->buffer;
 ...
}

In case you parse HTTP message in chunks (i.e. request line from socket, parse, read half headers, parse, etc) your data callbacks may be called more than once. guarantees that data pointer is only valid for the lifetime of callback. You can also into a heap allocated buffer to avoid copying memory around if this fits your application.

Reading headers may be a tricky task if you read/parse headers partially. Basically, you need to remember whether last header callback was field or value and apply the following logic:

Parsing URL strings directly.

How to directly parse, modify or monitor an arbitrary URL string.

// Get an interface for parsing the document URL...
var url = $.jurlp();

// .. or get an interface for parsing your own URL.
url = $.jurlp("www.example.com");

// Parse the URL to an object.
url.url();

// Get the URL scheme.
url.scheme();

// Get the URL host.
url.host();

// Get the URL port.
url.port();

// Get the URL path.
url.path();

// Get the URL query.
url.query();

// Get a specific parameter value from the URL query.
url.query().parameter;

// Get the URL fragment.
url.fragment();

// Create a watch for new URLs that contain "example.com" in the host name
var watch = $.jurlp("example.com").watch(function(element, selector){
    console.log("Found example.com URL!", element, selector);
});

// We can even apply filters to the watch to be sure!
watch.jurlp("filter", "host", "*=", "example.com");

// Append a new URL, which will trigger the watch
$("body").append("<a href=\"www.example.com\"></a>");

// Stop watching for "example.com" URLs.
watch.jurlp("unwatch");

Usage

One object is used per TCP connection. Initialize the struct using and set the callbacks. That might look something like this for a request parser:

http_parser_settings settings;
settings.on_url = my_url_callback;
settings.on_header_field = my_header_field_callback;
/* ... */

http_parser *parser = malloc(sizeof(http_parser));
http_parser_init(parser, HTTP_REQUEST);
parser->data = my_socket;

When data is received on the socket execute the parser and check for errors.

size_t len = 80*1024, nparsed;
char buf;
ssize_t recved;

recved = recv(fd, buf, len, );

if (recved < ) {
  /* Handle error. */
}

/* Start up / continue the parser.
 * Note we pass recved==0 to signal that EOF has been received.
 */
nparsed = http_parser_execute(parser, &settings, buf, recved);

if (parser->upgrade) {
  /* handle new protocol */
} else if (nparsed != recved) {
  /* Handle error. Usually just close the connection. */
}

needs to know where the end of the stream is. For example, sometimes servers send responses without Content-Length and expect the client to consume input (for the body) until EOF. To tell about EOF, give as the fourth parameter to . Callbacks and errors can still be encountered during an EOF, so one must still be prepared to receive them.

Scalar valued message information such as , , and the HTTP version are stored in the parser structure. This data is only temporally stored in and gets reset on each new message. If this information is needed later, copy it out of the structure during the callback.

The parser decodes the transfer-encoding for both requests and responses transparently. That is, a chunked encoding is decoded before being sent to the on_body callback.

URL Objects

URL object definition.

{
    scheme: "http://"
    user: "username",
    password: "password",
    host: "www.example.com",
    port: "8080",
    path: "/path/file.name",
    query: "?query=string",
    fragment: "#anchor"
}

Therefore, wherever URLs are supplied as a parameter to the plugin via the url or proxy methods, either a string or object representation or the URL may be supplied.

URL objects that have been returned via the parser interface can easily be converted to a string by calling the objects toString() method.

Example:

// Parse the document.location.href URL, and convert it back to a string again.
$(document).jurlp("url").toString();

С этим читают