The latest posts in full-text for feed readers.
I wanted to attend the IndieWebCamp Nuremberg this month, just as I did last year. While browsing the page for information, under "Participating" I saw a link to the "Code of Conduct" that one has to adhere to when attending the event.
There has been much talk about CoCs in the last years, and I generally try to ignore such things as much as possible, just like CLAs and NDAs. But now it was that I should forced to follow one, and asked some questions about it in IRC.
My understanding of rules in societies is that there are two levels:
If you break the law, police will come and arrest or fine you.
If you do not follow good sense, people will yell at and/or avoid you.
So, why do we need a third level? A "Code of Conduct", which also could be called "house rules"?
If you add a Code of Conduct, you think that level 1 (law) does not help and level 2 (good sense) is not available/adhered to.
In the IRC discussion, Rosemary Orchard gave a couple of reasons for a CoC:
Reason 1, "people feel safe", follows the same reasoning that states follow when flooding public spaces with video surveillance.
But just feeling safe does not actually make you safe. Video cameras do not make your life more safe, neither does a Code of Conduct.
I'd have put this under "good sense", but that's obviously not enough.
The premise is that banning someone based on some written text is easier than referring to some nebulous common sense.
I did realize that in the end, every Code of Conduct only exists to achieve one goal: Make it easy to ban people from some space, be it an online community or a conference.
This seems to be an easy argument: Because of diverse social backgrounds, members of an international community cannot assume that other members share the same common and good sense.
If you follow this reasoning, then the rules written down in a Code of Conduct have to be very clear, so that people with different backgrounds can understand them without ambiguities.
And this is where it all breaks: Instead of clear and unequivocal rules, the IndieWebCamp Code of Conduct (and probably all others, too) is full of soft words that can be bent in every direction:
Respectful behavior
- Be considerate, kind, constructive, and helpful.
- Avoid demeaning, discriminatory, harassing, hateful, or physically threatening behavior, speech, and imagery.
If the organizers determine that an event participant is behaving disrespectfully, the organizers may take any action they deem appropriate, up to and including expulsion and exclusion from the event without warning or refund.
So what actually is "demeaning"? It's a very soft word that has no singular definition, and will mean totally different things depending on your background.
The same applies to "discriminatory" and "harassing". Almost every joke discriminates some group, be it guests in a restaurant (German: Ober-Witze), types of animals or groups of people that are on the losing side of a joke.
The Wikipedia definition of harassment refers to common sense, which we can't rely on because of reason #4:
It is commonly understood as behavior that [...] embarrasses a person
IndieWebCamps have hacking days where people code together. Now when I point out some stupid bug in someone else's code, this might embarrass the person who wrote it.
This already covers the Code of Conduct's definition of "disrespectfully", and bam, I'm kicked from the conference.
Together with reason #2 ("somebody will care") this will eventually lead to overreaction: When someone complains based on the CoC, the organizers will know that people expect them to do something, because they themselves put their conference under the Code of Conduct. Common sense will be less likely to be applied in such situations.
A Code of Conduct is a set of rules to ban people.
It is needed because people have so diverse backgrounds that no common sense exists.
People with different backgrounds understand the rules differently, because they are soft instead of explicit.
I will not attend the IndieWebCamp this year.
Other people can express the issues better than I:
Bad things happen because of CoCs:
Published on 2018-10-16 in bigsuck, conference, indieweb, politik
h-feed is a set of rules to add CSS classes to HTML tags so that normal HTML pages can be parsed automatically by feed readers. Indieweb proponents like Tantek Çelik prefer it over Atom feeds and have a list of criticisms:
As a duplicate of information already in the HTML of a web page, feed files are an example of the usual DRY violations.
This tells only half of the story. Most websites are split up in two parts: Index pages that list articles with their titles and a short summary, and article pages that contain the full article text.
In that case, the premise of information already [available] in the HTML
is not correct, and the h-feed is
more than 2 times larger
then the full-text Atom feed.
Higher maintenance (requiring a separate URL, separate code path to generate, separate format understanding)
This is true for the initial setup/implementation.
However, when the site gets a new layout/redesign, the Atom feed can stay untouched and will not break, while extra care and testing is needed to keep an h-feed working.
Feed files become out of date with the visible HTML page (often because of broken separate code path), e.g.: [...]
When reading up the indieweb chat logs I saw the following and had a very good laugh:
aaronpk: Whoops tantek the name on your event on your home page is a mess, i'm guessing implied p-name? It's fine on your event permalink
Following all these indieweb feeds is making these markup issues super obvious now.tantek: Even when the data is visible, consuming it and presenting it in a different way can reveal issues!
If you're still around I think I have a fix for the p-name problem you found.
Seems to work locally
Alright, deployed
!tell aaronpk try tantek.com h-feed again, p-name issue(s) should be fixed. e-content too.
Tantek
added h-feed
because he feared that the Atom "side file" could break silently
since invisible
.
Now his h-feed failed silently, and it needed a feed reader user to tell him - just like it would have been the case when his Atom feed would have been broken (except that you can validate an Atom feed automatically).
Published on 2018-03-12 in html, indieweb, web
Last year I wanted to backup a friend's instagram account, and chose a local Known instance as target.
Posting "normal" text and photo posts into a blog is standardized now: Simply use Micropub with a client of your choice. I wrote shpub in that time, which is a micropub client for the shell, neatly packaged up into a single .phar file.
With Known and shpub in place, I wrote a script that regularly checked the Instagram account site, extracted text, image and geo coordinates of new posts and pushed them into Known.
One thing was missing: Social reactions - comments and likes.
So I sat down and extended Known's micropub plugin to support receiving likes, comments and RSVPs. At the time this patch got merged, my instagram backup project was sunken into deep sleep and I never got around making the script import the reactions as well.
Some weeks ago I wanted to write a blog post about the comments-via-micropub functionality and saw that it .. did not work at all. My patch had a serious flaw (! at the wrong place, a debug leftover) and nobody had noticed it :/ So now that problem is patched in latest git and will land in the version that follows Known 0.99.
Instead of sending a h=entry micropub post type, h=annotation with a couple of extra parameters has to be sent:
$ curl -X POST\
-H 'Content-Type: application/x-www-form-urlencoded'\
-H 'Authorization: Bearer deadbeefcafe'\
-d 'h=annotation'\
-d 'url=http://example.org/some-blog-post'\
-d 'type=reply'\
-d 'username=barryf'\
-d 'userurl=http://example.org/~barryf'\
-d 'userphoto=http://example.org/~barryf/avatar.jpg'\
-d "content=There is a typo in paragraph 1 - 'Fou' should be 'Foo'"\
'http://example.org/micropub/endpoint'
Alternatively you can use shpub:
$ ./bin/shpub.php x annotation\
-x url=http://example.org/some-blog-post\
-x type=reply\
-x username=barryf\
-x userurl=http://example.org/~barryf\
-x userphoto=http://example.org/~barryf/avatar.jpg\
-x content="There is a typo in paragraph 1. 'Fou' should be 'Foo'"
$ curl -X POST\
-H 'Content-Type: application/x-www-form-urlencoded'\
-H 'Authorization: Bearer deadbeefcafe'\
-d 'h=annotation'\
-d 'url=http://example.org/some-blog-post'\
-d 'type=like'\
-d 'username=barryf'\
-d 'userurl=http://example.org/~barryf'\
-d 'userphoto=http://example.org/~barryf/avatar.jpg'\
'http://example.org/micropub/endpoint'
$ curl -X POST\
-H 'Content-Type: application/x-www-form-urlencoded'\
-H 'Authorization: Bearer deadbeefcafe'\
-d 'h=annotation'\
-d 'url=http://example.org/some-blog-post'\
-d 'type=like'\
-d 'username=barryf'\
-d 'userurl=http://example.org/~barryf'\
-d 'userphoto=http://example.org/~barryf/avatar.jpg'\
-d 'content=yes'\
'http://example.org/micropub/endpoint'
Published on 2017-11-10 in indieweb, php, web
Gestern Abend habe ich zum monatlichen Treffen der PHP UserGroup Leipzig einen Vortrag zu den W3C-Empfehlungen Webmention und Micropub gehalten.
Die Folien zu Webmention sind jetzt online.
Sorry Aaron for the test replies and likes to your sleep posts :)
Published on 2017-09-01 in indieweb, php
It turns out that shpub - the Micropub client for the shell - is used by people to push content into their blogs, or to POSSE data to silos via silo.pub such as Github likes.
shpub 0.5.1 was released yesterday, and it fixes a couple of bugs:
Published on 2017-08-29 in indieweb, php
The indieweb IRC channels have a permanent occupant: Loqi, a chat bot.
It does a number of things, among them saying "Good morning", replying to "a or b" questions and taking commands to edit the wiki. And it prints the page title of URLs.
I noticed that it would show the full content of github comment URLs and tried to use that to inject commands:
It turned out that Loqi would not take commands from itself, except countdowns: "$n minutes until $text":
<cweiske> https://github.com/cweiske/test/issues/1#issuecomment-311087678 <Loqi> 2 minutes until boom <Loqi> I added a countdown scheduled for 2017-06-26 8:09am PDT (#6035)
So it is possible for anyone to add countdowns to Loqi without ever talking to it directly :)
Published on 2017-06-28 in indieweb, network, web
Last week someone thought that it's a good idea to invent a new standard for feeds: JSON feed.
So in addition to the four incompatible-with-each-other and underspecified RSS formats (RSS 0.90, RSS 0.91, RSS 1.0, and RSS 2.0), the correctly spec'ed Atom format and the HTML-based h-feed we have a seventh one that future feed readers will also have to support.
One of the reasons for inventing this new format is:
For most developers, JSON is far easier to read and write than XML.
One of the problems with the XML-based feed formats is that software spits out non-wellformed XML, which cannot be read with XML libraries.
The reason for this is that people think "that looks like HTML, let's write a HTML template for the XML feed" - which breaks at the first character that needs to be escaped. This could have been prevented if those people would have simply use an XML library to generate the feed XML. And yes, every programming language has an XML lib, since 15 years.
So now the JSON feed people come, see this as problem and say: Hey, JSON is so easy to generate with libraries - let's ditch XML and use JSON.
Now guess what happens? People use the HTML templating engine to generate JSON that breaks at the first character that needs to be escaped.
Dear Brent Simmons and Manton Reece: You tried to fight human nature with a new standard, and failed.
The JSON feed spec v1 states:
JSON Feed files must be served using the same MIME type - application/json - that's used whenever JSON is served.
Congratulations, my tools now cannot differentiate between normal JSON files, JF2 feeds and JSON feeds when trying to discover feeds on a HTML page.
A proper solution would have been to use the mime/type+format schema that's already used by Atom (which has application/atom+xml): application/jsonfeed+json.
[JSON feed] reflects the lessons learned from our years of work reading and publishing feeds.
HTTP responses, HTML pages and Atom feeds have the ability to link to other resources. This is all nicely specified in RFC 5988: Web Linking.
New technologies like the realtime change-notification system WebSub rely on the ability of feeds to link to their hub. And the JSON feed people did not even think to add support for links, because in the years of publishing feeds they never wanted to notify subscribers in realtime about updates.
Published on 2017-06-01 in indieweb, php, programming, web
I wrote a plugin for the Tiny Tiny RSS feed reader that allows you to reply to blog posts directly from within the application.
The second day of IndieWebCamp Nürnberg was hack-day. In the morning everyone said what they were going to work on, and in the afternoon was demo time - people showed what they achieved during that day.
Doing something demoable in ~6 hours is hard, so I chose something easy: Writing a Micropub client for the Tiny Tiny RSS feed reader: The goal was to be able to write replies to blog posts directly in TT-RSS and posting them to my website, without ever leaving the application.
Writing the plugin went relatively well; I had actually figured out the base stuff the night before in the hotel and knew how and where to hook into TT-RSS.
At the end of the day I had a smooth demo that demonstrated registering your homepage/identity in the preferences, and posting a comment to a blog post from within the feed reader itself.
During IWC I only worked on the basic stuff to have something working for the demo - but many things were missing: Identity management (adding, removing and defaulting identities), the URL of the comment was not shown anywhere (I had to extract that from Wireshark during the demo) and nice-to-have things like bookmarking + favoriting/liking.
In the last couple of days the train ride each morning and afternoon was micropub-hacking time for me, mostly fighting against the seriously underdocumented TT-RSS API and experiencing a ban from the TT-RSS forum.
But now the plugin is ready for wider audiences and can be downloaded from my own git server or the GitHub mirror. This nicely concurs with the the fact that Micropub is a W3C Recommendation now.
Published on 2017-06-01 in indieweb, php
I wrote a search engine to be able to search my blog, website and all linked pages. It's running at search.cweiske.de using PHP, Elasticsearch and Gearman.
When looking for a way to add search functionality to my blog, I found a few hosted search providers and some existing software but none that matched my taste. I had used regain before, but found too many problems.
So I had to do it all myself, again.
My head already contained a list of must-have features:
I use PDO for SQL database access (subscriptions), Net_URL2 for URL parsing/resolving and HTTP_Request2 for doing HTTP requests. No frameworks, only libraries.
As of version 0.2.1, phinde consists of 1800 lines of PHP code and 400 lines of HTML/Twig.
Because I wanted to rank headlines higher than normal text, MySQL full text search could not be used. From phorkie's development I knew that Elasticsearch supports field boosting and settled with that.
I made a schema that contained individual fields for title, each of the headline types (h1-h6), the text and tags/keywords. The fields each got a different boost that determines their priority in search result ranking.
My blog+website index contains 3.600 documents and takes 34 MiB (mostly "normal" HTML pages). The indieweb chat search instance indexes 900.000 documents, with a size of 550 MiB (tiny documents, each a single chat log line).
Elasticsearch works well except when there are schema changes, which often happens during development. I found it easier to throw away all data after making changes to the schema, because migrating a schema is too much work. This might be different when you have a couple of million documents in ES - but for me it's easier to let half a dozen worker processes re-crawl everything, than to implement schema migration scripts.
Crawling the web is a prime example for parallelizing.
When a URL is fetched, the script extracts all linked URLs and determines if they should be followed. Each URL is then put in the job queue, together with information if it shall be crawled and/or indexed.
phinde uses Gearman as queue system. It allows me to spin up as many worker instances as I need, more instances meaning faster crawling + indexing.
The phinde-worker script is tiny; it only listens for incoming jobs and then starts a process script that does the actual work. This frees me from complicated error and exception handling, allows updating the processor without restarting the worker and makes development easy because I can run the processor from command line, just as the worker does.
At first I had two different job queues: One for crawling and one for indexing. Bugs in the crawler script would not influence the indexer and vice versa.
This allowed me to crawl many URLs quickly without the indexing overhead, but also meant I had to fetch each URL twice.
Splitting crawling and indexing also means that the code needs to handle crawled-but-not-indexed and indexed-but-not-crawled cases. I originally did not handle this, which broke data integrity a couple of times.
Now the process script handles both crawling and indexing. This means only one HTTP request, and less code because I don't have to handle different processing states when updating the Elasticsearch documents.
The indexer itself fetches the HTML and then throws away all navigation, header and footer areas. Only the content as indicated by the microformats 2 e-content class is used if it's available.
Then headings, page title, text, keywords/tags and author information are extracted and stored.
The Elasticsearch head plugin is very useful for inspecting the index.
I used bootstrap for CSS because I'm bad at layouting and Twig as templating engine because it has a nice syntax.
The home page only has a search slot and not much more, the search result page shows result document title, excerpt of the content that contains the search terms as well as the author.
On the right side sort buttons and facet filters are shown, always depending on the actual result set. Elasticsearch's aggregation feature makes that easy.
Despite the size of the chat log corpus with 900k documents, querying Elasticsearch only takes milliseconds.
I took special care of the pager and will publish a blog post with the full details of the design considerations.
Whenever a blog post is published, the search engine needs to index it. At first I triggered it manually, then I had a cronjob that checked the my blog's atom feed every hour. None of them is ideal.
Luckily we have WebSub (formerly known as PubSubHubbub), which defines a protocol for notifications on the web. My blog already sends out notifications to interested parties via my hub at phubb.cweiske.de, which means that feed readers with WebSub support already get instantly notified about new posts.
I decided to build WebSub subscriptions into phinde, and today blog posts get indexed immediately when my blog sends out update notifications.
Since 2016-02, every page on my blog has a small search box. It takes you to search.cweiske.de with the site-specific filter set to cweiske.de/tagebuch/. It provides a button to remove the site-specific filter, which then queries all indexed pages.
Since 2016-11 a second instance is running at indiechat.search.cweiske.de. It has a corpus of ~900.000 documents and lets you search every line ever posted in the indieweb IRC channels.
I'm very pleased with the results, given that the effort it took to implement it was small - thanks to the awesome libraries and Elasticsearch, which does all the hard storage + search work.
phinde is licensed under the AGPL v3 or later and is available at git.cweiske.de/phinde.git and mirrored at github.
Published on 2016-12-08 in indieweb, php, server, web
Over the last weeks I have been working on shpub, a Micropub client for the shell. It allows you to publish blog posts, replies/coments and likes from the shell or programmatically.
I wrote it because I needed a way to archive Instagram posts to a self-hosted blog. The internet-hater Instagram requires one to get approval for API clients, and only approves those that fall into a very narrow set of categories.
But thanks to the great new all-is-javascript world, Instagram data don't have to be scraped at all - simply adding ?__a=1 to one of its URLs gives you the data in JSON. Downloading all data wasn't a problem now, I only had to find a way to put the photos and videos into a blog now..
I first experimented with wp-cli, the WordPress command line interface. It kind of worked, but the theme didn't look nice, and there were some limitations I could not get around. Known on the other hand had all features I needed, a nice-looking layout - and no own API, but a Micropub endpoint.
Micropub is a protocol for creating, updating and deleting all types of content on a server: Blog posts, replies/comments, likes, bookmarks, event reservations and more. It's backed by the W3C and currently in Draft status.
The goal is to have a standardized API to post content to your website, and you may use the client that's most suited for the job.
Currently we have generic clients like Quill, feed readers like Woodwind that have in-built commenting support, very specific ones like the Pushup-counter iOS app, an XMPP bot and more. See the Micropub client list for more information.
There are even services that act as Micropub client. For example, OwnYourGram instantly posts your own Instagram images to your blog. OwnYourCheckin does the same for Fourquare checkins.
On the other hand, there are micropub endpoints that act as proxy for other websites: silo.pub allows you to use a Micropub client to write comments on Github, Facebook or Twitter.
I needed a way to send Micropub requests to the Known instance, and there was no tool for it. So I sat down and wrote shpub with the goal to make a command line interface I can use from my instagram2micropub script.
During the process I learned a lot, found many bugs in Known's micropub endpoint and in the Wordpress micropub endpoint, and got to know Known's internals.
After some documentation work, the IndieWeb wiki now has a comparison of Micropub servers and Micropub clients.
shpub was the second Micropub client to support Media endpoints, and is - as far as I know - the only one that supports updates. It's almost feature-complete and works fine with instagram2micropub.
Get it from its homepage or github.
Published on 2016-09-23 in indieweb, php