Search results for: “google”

  • Laszlo Bock on Building Google’s Culture

    Much has been written about what makes Google work so well: their ridiculously profitable advertising business model, the technology behind their search engine and data centers, and the amazing pay and perks they offer.

    Source: the book

    My experiences investing in and working with startups, however, has taught me that building a great company is usually less about a specific technical or business model innovation than about building a culture of continuous improvement and innovation. To try to get some insight into how Google does things, I picked up Google SVP of People Operations Laszlo Bock’s book Work Rules!

    Bock describes a Google culture rooted in principles that came from founders Larry Page and Sergey Brin when they started the company: get the best people to work for you, make them want to stay and contribute, and remove barriers to their creativity. What’s great (to those interested in company building) is that Bock goes on to detail the practices Google has put in place to try to live up to these principles even as their headcount has expanded.

    The core of Google’s culture boils down to four basic principles and much of the book is focused on how companies should act if they want to live up to them:

    1. Presume trust: Many of Google’s cultural norms stem from a view that people are well-intentioned and trustworthy. While that may not seem so radical, this manifested at Google as a level of transparency with employees and a bias to say yes to employee suggestions that most companies are uncomfortable with. It raises interesting questions about why companies that say their talent is the most important thing treat them in ways that suggest a lack of trust.
    2. Recruit the best: Many an exec pays lip service to this, but what Google has done is institute policies that run counter to standard recruiting practices to try to actually achieve this at scale: templatized interviews / forms (to make the review process more objective and standardized), hiring decisions made by cross-org committees (to insure a consistently high bar is set), and heavy use of data to track the effectiveness of different interviewers and interview tactics. While there’s room to disagree if these are the best policies (I can imagine hating this as a hiring manager trying to staff up a team quickly), what I admired is that they set a goal (to hire the best at scale) and have actually thought through the recruiting practices they need to do so.
    3. Pay fairly [means pay unequally]: While many executives would agree with the notion that superstar employees can be 2-10x more productive, few companies actually compensate their superstars 2-10x more. While its unclear to me how effective Google is at rewarding superstars, the fact that they’ve tried to align their pay policies with their beliefs on how people perform is another great example of deviating from the norm (this time in terms of compensation) to follow through on their desire to pay fairly.
    4. Be data-driven: Another “in vogue” platitude amongst executives, but one that very few companies live up to, is around being data-driven. In reading Bock’s book, I was constantly drawing parallels between the experimentation, data collection, and analyses his People Operations team carried out and the types of experiments, data collection, and analyses you would expect a consumer internet/mobile company to do with their users. Case in point: Bock’s team experimented with different performance review approaches and even cafeteria food offerings in the same way you would expect Facebook to experiment with different news feed algorithms and notification strategies. It underscores the principle that, if you’re truly data-driven, you don’t just selectively apply it to how you conduct business, you apply it everywhere.

    Of course, not every company is Google, and not every company should have the same set of guiding principles or will come to same conclusions. Some of the processes that Google practices are impractical (i.e., experimentation is harder to set up / draw conclusions from with much smaller companies, not all professions have such wide variations in output as to drive such wide variations in pay, etc).

    What Bock’s book highlights, though, is that companies should be thoughtful about what sort of cultural principles they want to follow and what policies and actions that translates into if they truly believe them. I’d highly recommend the book!

  • Geothermal data centers

    The data centers that power AI and cloud services are limited by 3 things:

    • the server hardware (oftentimes limited by access to advanced semiconductors)
    • available space (their footprint is massive which makes it hard to put them close to where people live)
    • availability of cheap & reliable (and, generally, clean) power

    If you, as a data center operator, can tap a new source of cheap & reliable power, you will go very far as you alleviate one of the main constraints on the ability to add to your footprint.

    It’s no small wonder, then, that Google is willing to explore partnerships with next-gen geothermal startups like Fervo in a meaningful long-term fashion.


  • The IE6 YouTube conspiracy

    An oldie but a goodie — the story of how the YouTube team, post-Google acquisition, put up a “we won’t support Internet Explorer 6 in the future” message without any permission from anyone. (HT: Eric S)


    A Conspiracy to Kill IE6
    Chris Zacharias

  • NVIDIA to make custom AI chips? Tale as old as time

    Every standard products company (like NVIDIA) eventually gets lured by the prospect of gaining large volumes and high margins of a custom products business.

    And every custom products business wishes they could get into standard products to cut their dependency on a small handful of customers and pursue larger volumes.

    Given the above and the fact that NVIDIA did used to effectively build custom products (i.e. for game consoles and for some of its dedicated autonomous vehicle and media streamer projects) and the efforts by cloud vendors like Amazon and Microsoft to build their own Artificial Intelligence silicon it shouldn’t be a surprise to anyone that they’re pursuing this.

    Or that they may eventually leave this market behind as well.


  • Selfhosting FreshRSS

    It’s been a few months since I started down the selfhosting/home server journey. Thanks to Docker, it has been relatively smooth sailing. Today, I have a cheap mini-PC based server that:

    • blocks ads / online trackers on all devices
    • stores and streams media (even for when I’m out of the house)
    • acts as network storage (for our devices to store and share files)
    • serves as a personal RSS/newsreader

    The last one is new since my last post and, in the hopes that this helps others exploring what they can selfhost or who maybe have a home server and want to start deploying services, I wanted to share how I set up FreshRSS, a self-hosted RSS reader (on an OpenMediaVault v6 server)

    Why a RSS Reader?

    Like many who used it, I was a massive Google Reader fan. Until 2013 when it was unceremoniously shut down, it was probably the most important website I used after Gmail.

    I experimented with other RSS clients over the years, but found that I did not like most commercial web-based clients (which were focused on serving ads or promoting feeds I was uninterested in) or desktop clients (which were difficult to sync between devices). So, I switched to other alternatives (i.e. Twitter) for a number of years.

    FreshRSS

    Wanting to return to the simpler days where I could simply follow the content I was interested in, I stumbled on the idea of self-hosting an RSS reader. Looking at the awesome-selfhosted feed reader category, I looked at the different options and chose to go with FreshRSS for a few reasons:

    Installation

    To install FreshRSS on OpenMediaVault:

    • If you haven’t already, make sure you have OMV Extras and Docker Compose installed (refer to the section Docker and OMV-Extras in my previous post, you’ll want to follow all 10 steps as I refer to different parts of the process throughout this post) and have a static local IP address assigned to your server.
    • Login to your OpenMediaVault web admin panel, and then go to [Services > Compose > Files] in the sidebar. Press the button in the main interface to add a new Docker compose file.

      Under Name put down FreshRSS and under File, adapt the following (making sure the number of spaces are consistent)
      version: "2.1"
      services:
      freshrss:
      container_name: freshrss
      image: lscr.io/linuxserver/freshrss:latest
      ports:
      - <unused port number like 3777>:80
      environment:
      - TZ: 'America/Los_Angeles'
      - PUID=<UID of Docker User>
      - PGID=<GID of Docker User>
      volumes:
      - '<absolute path to shared config folder>/FreshRSS:/config'
      restart: unless-stopped
      You’ll need to replace <UID of Docker User> and <GID of Docker User> with the UID and GID of the Docker user you created (which will be 1000 and 100 if you followed the steps I laid out, see Step 10 in the section “Docker and OMV-Extras” in my initial post)

      I live in the Bay Area so I set the timezone TZ to America/Los_Angeles. You can find yours here.

      Under ports:, make sure to add an unused port number (I went with 3777).

      Replace <absolute path to shared config folder> with the absolute path to the config folder where you want Docker-installed applications to store their configuration information (accessible by going to [Storage > Shared Folders] in the administrative panel).

      Once you’re done, hit Save and you should be returned to your list of Docker compose files for the next step. Notice that the new FreshRSS entry you created has a Down status, showing the container has yet to be initialized.
    • To start your FreshRSS container, click on the new FreshRSS entry and press the (up) button. This will create the container, download any files needed, and run it.

      And that’s it! To prove it worked, go to your-servers-static-ip-address:3777 from a browser that’s on the same network as your server (replacing 3777 if you picked a different port in the configuration above) and you should see the FreshRSS installation page (see below)
    • You can skip this step if you didn’t (as I laid out in my last post) set up Pihole and local DNS / Nginx proxy or if you don’t care about having a user-readable domain name for FreshRSS. But, assuming you do and you followed my instructions, open up WeTTy (which you can do by going to wetty.home in your browser if you followed my instructions or by going to [Services > WeTTY] from OpenMediaVault administrative panel and pressing Open UI button in the main panel) and login as the root user. Run:
      cd /etc/nginx/conf.d
      ls
      Pick out the file you created before for your domains and run
      nano <your file name>.conf
      This opens up the text editor nano with the file you just listed. Use your cursor to go to the very bottom of the file and add the following lines (making sure to use tabs and end each line with a semicolon)
      server {
      listen 80;
      server_name <rss.home or the domain you'd like to use>;
      location / {
      proxy_pass http://<your-server-static-ip>:<FreshRSS port number>;
      }
      }
      And then hit Ctrl+X to exit, Y to save, and Enter to overwrite the existing file. Then in the command line run the following to restart Nginx with your new configuration loaded.
      systemctl restart nginx
      Now, if your server sees a request for rss.home (or whichever domain you picked), it will direct them to FreshRSS.

      Login to your Pihole administrative console (you can just go to pi.hole in a browser) and click on [Local DNS > DNS Records] from the sidebar. Under the section called Add a new domain/IP combination, fill out under Domain: the domain you just added above (i.e. rss.home) and next to IP Address: you should add your server’s static IP address. Press the Add button and it will show up below.

      To make sure it all works, enter the domain you just added (rss.home if you went with my default) in a browser and you should see the FreshRSS installation page.
    • Completing installation is easy. Thanks to the use of Docker, all of your PHP and files will be configured accurately so you should be able to proceed with the default options. Unless you’re planning to store millions of articles served to dozens of people, the default option of SQLite as database type should be sufficient in Step 3 (see below)


      This leaves the final task of configuring a username and password (and, again, unless you’re serving this to many users whom you’re worried will hack you, the default authentication method of Web form will work)


      Finally, press Complete installation and you will be taken to the login page:

    Advice

    Once you’ve logged in with the username and password you just set, the world is your oyster. If you’ve ever used an RSS reader, the interface is pretty straightforward, but the key is to use the Subscription management button in the interface to add RSS feeds and categories as you see fit. FreshRSS will, on a regular basis, look for new content from those feeds and put it in the main interface. You can then step through and stay up to date on the sites that matter to you. There are a lot more features you can learn about from the FreshRSS documentation.

    On my end, I’d recommend a few things:

    • How to find the RSS feed for a page — Many (but not all) blog/news pages have RSS feeds. The most reliable way to find it is to right click on the page you’re interested in from your browser and select View source (on Chrome you’d hit Ctrl+U). Hit Ctrl+F to trigger a search and look for rss. If there is an RSS feed, you’ll see something that says "application/rss+xml" and near it will usually be a URL that ends in /rss or /feed or something like that (my blog, for instance, hosted on benjamintseng.com has a feed at benjamintseng.com/rss).
      • Once you open up the feed,
    • Learn the keyboard shortcuts — they’re largely the same as found on Gmail (and the old Google Reader) but they make using this much faster:
      • j to go to the next article
      • k to go to the previous article
      • r to toggle if something is read or not
      • v to open up the original page in a new tab
    • Use the normal view, sorted oldest first — (you do this by tapping the Settings gear in the upper-right of the interface and then selecting Reading under Configuration in the menu). Even though I’ve aggressively curated the feeds I subscribe to, there is a lot of material and the “normal view” allows me to quickly browse headlines to see which ones are more worth my time at a glance. I can also use my mouse to selectively mark somethings as read so I can take a quick Inbox Zero style approach to my feeds. This allows me to think of the j shortcut as “move forward in time” and the k shortcut as “move backwards” and I can use the pulldown menu next to Mark as read button to mark content older than one day / one week as read if I get overwhelmed.
    • Subscribe to good feeds — probably a given, but here are a few I follow to get you started:

    I hope this helps you get started!

    (If you’re interested in how to setup a home server on OpenMediaVault or how to self-host different services, check out all my posts on the subject)

  • Pixel’s Parade of AI

    I am a big Google Pixel fan, being an owner and user of multiple Google Pixel line products. As a result, I tuned in to the recent MadeByGoogle stream. While it was hard not to be impressed with the demonstrations of Google’s AI prowess, I couldn’t help but be a little baffled…

    What was the point of making everything AI-related?

    Given how low Pixel’s market share is in the smartphone market, you’d think the focus ought to be on explaining why “normies” should buy the phone or find the price tag compelling, but instead every feature had to tie back to AI in some way.

    Don’t get me wrong, AI is a compelling enabler of new technologies. Some of the call and photo functionalities are amazing, both as technological demonstrations but also in terms of pure utility for the user.

    But, every product person learns early that customers care less about how something gets done and more about whether the product does what they want it too. And, as someone who very much wants a meaningful rival to Apple and Samsung, I hope Google doesn’t forget that either.


  • Setting Up Pihole, Nginx Proxy, and Twingate with OpenMediaVault

    I recently shared how I set up a (OpenMediaVault) home server on a cheap mini-PC. After posting it, I received a number of suggestions that inspired me to make a few additional tweaks to improve the security and usability of my server.

    Read more if you’re interested in setting up (on an OpenMediaVault v6 server):

    • Pihole, a “DNS filter” that blocks ads / trackers
    • using Pihole as a local DNS server to have custom web addresses for software services running on your network and Nginx to handle port forwarding
    • Twingate (a better alternative to opening up a port and setting up Dynamic DNS to grant secure access to your network)

    Pihole

    Pihole is a lightweight local DNS server (it gets its name from the Raspberry Pi, a <$100 device popular with hobbyists, that it can run fully on).

    A DNS (or Domain Name Server) converts human readable addresses (like www.google.com) into IP addresses (like 142.250.191.46). As a result, every piece of internet-connected technology is routinely making DNS requests when using the internet. Internet service providers typically offer their own DNS servers for their customers. But, some technology vendors (like Google and CloudFlare) also offer their own DNS services with optimizations on speed, security, and privacy.

    A home-grown DNS server like Pihole can layer additional functionality on top:

    • DNS “filter” for ad / tracker blocking: Pihole can be configured to return dummy IP addresses for specific domains. This can be used to block online tracking or ads (by blocking the domains commonly associated with those activities). While not foolproof, one advantage this approach has over traditional ad blocking software is that, because this blocking happens at the network level, the blocking extends to all devices on the network (such as internet-connected gadgets, smart TVs, and smartphones) without needing to install any extra software.
    • DNS caching for performance improvements: In addition to the performance gains from blocking ads, Pihole also boosts performance by caching commonly requested domains, reducing the need to “go out to the internet” to find a particular IP address. While this won’t speed up a video stream or download, it will make content from frequently visited sites on your network load faster by skipping that internet lookup step.

    To install Pihole using Docker on OpenMediaVault:

    • If you haven’t already, make sure you have OMV Extras and Docker Compose installed (refer to the section Docker and OMV-Extras in my previous post) and have a static local IP address assigned to the server.
    • Login to your OpenMediaVault web admin panel, go to [Services > Compose > Files], and press the  button. Under Name put down Pihole and under File, adapt the following (making sure the number of spaces are consistent)
      version: "3"
      services:
      pihole:
      container_name: pihole
      image: pihole/pihole:latest
      ports:
      - "53:53/tcp"
      - "53:53/udp"
      - "8000:80/tcp"
      environment:
      TZ: 'America/Los_Angeles'
      WEBPASSWORD: '<Password for the web admin panel>'
      FTLCONF_LOCAL_IPV4: '<your server IP address>'
      volumes:
      - '<absolute path to shared config folder>/pihole:/etc/pihole'
      - '<absolute path to shared config folder>/dnsmasq.d:/etc/dnsmasq.d'
      restart: unless-stopped
      You’ll need to replace <Password for the web admin panel> with the password you’ll want to use to be access the Pihole web configuration interface, <your server IP address> with the static local IP address for your server, and <absolute path to shared config folder> with the absolute path to the config folder where you want Docker-installed applications to store their configuration information (accessible by going to [Storage > Shared Folders] in the administrative panel).

      I live in the Bay Area so I set timezone TZ to America/Los_AngelesYou can find yours here.

      Under Ports, I’ve kept the port 53 reservation (as this is the standard port for DNS requests) but I’ve chosen to map the Pihole administrative console to port 8000 (instead of the default of port 80 to avoid a conflict with the OpenMediaVault admin panel default). Note: This will prevent you from using Pihole’s default pi.hole domain as a way to get to the Pihole administrative console out-of-the-box. Because standard web traffic goes to port 80 (and this configuration has Pihole listening at port 8080), pi.hole would likely just direct you to the OpenMediaVault panel. While you could let pi.hole take over port 80, you would need to move OpenMediaVault’s admin panel to a different port (which itself has complexity). I ultimately opted with keeping OpenMediaVault at port 80 knowing that I could configure Pihole and Nginx proxy (see below) to redirect pi.hole to the right port.

      You’ll notice this configures two volumes, one for dnsmasq.d, which is the DNS service, and one for pihole which provides an easy way to configure dnsmasq.d and download blocklists.

      Note: the above instructions assume your home network, like most, is IPv4 only. If you have an IPv6 network, you will need to add an IPv6: True line under environment: and replace the FTLCONF_LOCAL_IPV4:'<server IPv4 address>' with FTLCONF_LOCAL_IPV6:'<server IPv6 address>'. For more information, see the official Pihole Docker instructions.

      Once you’re done, hit Save and you should be returned to your list of Docker compose files for the next step. Notice that the new Pihole entry you created has a Down status, showing the container has yet to be initiated.
    • Disabling systemd-resolved: Most modern Linux operating systems include a built-in DNS resolver that listens on port 53 called systemd-resolved. Prior to initiating the Pihole container, you’ll need to disable this to prevent that port conflict. Use WeTTy (refer to the section Docker and OMV-Extras in my previous post) or SSH to login as the root user to your OpenMediaVault command line. Enter the following command:
      nano /etc/systemd/resolved.conf
      Look for the line that says #DNSStubListener=yes and replace it with DNSStubListener=no, making sure to remove the # at the start of the line. (Hit Ctrl+X to exit, Y to save, and Enter to overwrite the file). This configuration will tell systemd-resolved to stop listening to port 53.

      To complete the configuration change, you’ll need to edit the symlink /etc/resolv.conf to point to the file you just edited by running:
      sh -c 'rm /etc/resolv.conf && ln -s /run/systemd/resolve/resolv.conf /etc/resolve.conf'
      Now all that remains is to restart systemd-resolved:
      systemctl restart systemd-resolved
    • How to start / update / stop / remove your Pihole container: You can manage all of your Docker Compose files by going to [Services > Compose > Files] in the OpenMediaVault admin panel. Click on the Pihole entry (which should turn it yellow) and press the  (up) button. This will create the container, download any files needed, and, if you properly disabled systemd-resolved in the last step, initiate Pihole.

      And that’s it! To prove it worked, go to your-server-ip:8000 in a browser and you should see the login for the Pihole admin webpage (see below).

      From time to time, you’ll want to update the container. OMV makes this very easy. Every time you press the  (pull) button in the [Services > Compose > Files] interface, Docker will pull the latest version (maintained by the Pihole team).

    Now that you have Pihole running, it is time to enable and configure it for your network.

    • Test Pihole from a computer: Before you change your network settings, it’s a good idea to make sure everything works.
      • On your computer, manually set your DNS service to your Pihole by putting in your server IP address as the address for your computer’s primary DNS server (Mac OS instructions; Windows instructions; Linux instructions). Be sure to leave any alternate / secondary addresses blank (many computers will issue DNS requests to every server they have on their list and if an alternative exists you may not end up blocking anything).
      • (Temporarily) disable any ad blocking service you may have on your computer / browser you want to test with (so that this is a good test of Pihole as opposed to your ad blocking software). Then try to go to https://consumerproductsusa.com/ — this is a URL that is blocked by default by Pihole. If you see a very spammy website promising rewards, either your Pihole does not work or you did not configure your DNS correctly.
      • Finally login to the Pihole configuration panel (your-server-ip:8000) using the password you set up during installation. From the dashboard click on the Queries Blocked box at the top (your colors may vary but it’s the red box on my panel, see below).

        On the next screen, you should see the domain consumerproductsusa.com next to the IP address of your computer, confirming that the address was blocked.

        You can now turn your ad blocking software back on!
      • You should now set the DNS service on your computer back to “automatic” or “DHCP” so that it will inherit its DNS settings from the network/router (and especially if this is a laptop that you may use on another network).
    • Configure DNS on router: Once you’ve confirmed that the Pihole service works, you should configure the default DNS settings on your router to make Pihole the DNS service for your entire network. The instructions for this will vary by router manufacturer. If you use Google Wifi as I do, here are the instructions.

      Once this is completed, every device which inherits DNS settings from the router will now be using Pihole for their DNS requests.

      Note: one downside of this approach is that the Pihole becomes a single point of failure for the entire network. If the Pihole crashes or fails, for any reason, none of your network’s DNS requests will go through until the router’s settings are changed or the Pihole becomes functional again. Pihole generally has good reliability so this is unlikely to be an issue most of the time, but I am currently using Google’s DNS as a fallback on my Google Wifi (for the times when something goes awry with my server) and I would also encourage you to know how to change the DNS settings for your router in case things go bad so that your access to the internet is not taken out unnecessarily.
    • Configure Pihole: To get the most out of Pihole’s ad blocking functionality, I would suggest three things
      • Select Good Upstream DNS Servers: From the Pihole administrative panel, click on Settings. Then select the DNS tab. Here, Pihole allows you to configure which external DNS services the DNS requests on your network should go to if they aren’t going to be blocked and haven’t yet been cached. I would recommend selecting the checkboxes next to Google and Cloudflare given their reputations for providing fast, secure, and high quality DNS services (and selecting multiple will provide redundancy).
      • Update Gravity periodically: Gravity is the system by which Pihole updates its list of domains to block. From the Pihole administrative panel, click on [Tools > Update Gravity] and click the Update button. If there are any updates to the blocklists you are using, these will be downloaded and “turned on”.
      • Configure Domains to block/allow: Pihole allows administrators to granularly customize the domains to block (blacklist) or allow (whitelist). From the Pihole administrative panel, click on Domains. Here, an admin can add a domain (or a regular expression for a family of domains) to the blacklist (if it’s not currently blocked) or the whitelist (if it currently is) to change what happens when a user on the network accesses the DNS.

        I added whitelist exclusions for link.axios.com to let me click through links from the Axios email newsletters I receive and www.googleadservices.com to let my wife click through Google-served ads. Pihole also makes it easy to manually take a domain that a device on your network has requested to block/allow. Tap on Total Queries from the Pihole dashboard, click on the IP address of the device making the request, and you’ll see every DNS request (including those which were blocked) with a link beside them to add to the domain whitelist or blacklist.

        Pihole will also allow admins to configure different rules for different sets of devices. This can be done by calling out clients (which can be done by clicking on Clients and picking their IP address / MAC address / hostnames), assigning them to groups (which can be defined by clicking on Groups), and then configuring domain rules to go with those groups (in Domains). Unfortunately because Google Wifi simply forwards DNS requests rather than distributes them, I can only do this for devices that are configured to directly point at the Pihole, but this could be an interesting way to impose parental internet controls.

    Now you have a Pihole network-level ad blocker and DNS cache!

    Local DNS and Nginx proxy

    As a local DNS server, Pihole can do more than just block ads. It also lets you create human readable addresses for services running on your network. In my case, I created one for the OpenMediaVault admin panel (omv.home), one for WeTTy (wetty.home), and one for Ubooquity (ubooquity.home).

    If your setup is like mine (all services use the same IP address but different ports), you will need to set up a proxy as DNS does not handle port forwarding. Luckily, OpenMediaVault has Nginx, a popular web server with a performant proxy, built-in. While many online tutorials suggest installing Nginx Proxy Manager, that felt like overkill, so I decided to configure Nginx directly.

    To get started:

    • Configure the A records for the domains you want in Pihole: Login to your Pihole administrative console (your-server-ip:8000) and click on [Local DNS > DNS Records] from the sidebar. Under the section called Add a new domain/IP combination, fill out the Domain: you want for a given service (like omv.home or wetty.home) and the IP Address: (if you’ve been following my guides, this will be your-server-ip). Press the Add button and it will show up below. Repeat for all the domains you want. If you have a setup similar to mine, you will see many domains pointed at the same IP address (because the different services are simply different ports on my server).

      To test if these work, enter any of the domains you just put in to a browser and it should take you to the login page for the OpenMediaVault admin panel (as currently they are just pointing at your server IP address).

      Note 1: while you can generally use whatever domains you want, it is suggested that you don’t use a TLD that could conflict with an actual website (i.e. .com) or that are commonly used by networking systems (i.e. .local or .lan). This is why I used .home for all of my domains (the IETF has a list they recommend, although it includes .lan which I would advise against as some routers such as Google Wifi use this)

      Note 2: Pihole itself automatically tries to forward pi.hole to its web admin panel, so you don’t need to configure that domain. The next step (configuring proxy port forwarding) will allow pi.hole to work.
    • Edit the Nginx proxy configuration: Pihole’s Local DNS server will send users looking for one of the domains you set up (i.e. wetty.home) to the IP address you configured. Now you need your server to forward that request to the appropriate port to get to the right service.

      You can do this by taking advantage of the fact that Nginx, by default, will load any .conf file in the /etc/nginx/conf.d/ directory as a proxy configuration. Pick any file name you want (I went with dothome.conf because all of my service domains end with .home) and after using WeTTy or SSH to login as root, run:
      nano /etc/nginx/conf.d/<your file name>.conf
      The first time you run this, it will open up a blank file. Nginx looks at the information in this file for how to redirect incoming requests. What we’ll want to do is tell Nginx that when a request comes in for a particular domain (i.e. ubooquity.home or pi.hole) that request should be sent to a particular IP address and port.

      Manually writing these configuration files can be a little daunting and, truth be told, the text file I share below is the result of a lot of trial and error, but in general there are 2 types of proxy commands that are relevant for making your domain setup work.

      One is a proxy_pass where Nginx will basically take any traffic to a given domain and just pass it along (sometimes with additional configuration headers). I use this below for wetty.home, pi.hole, ubooquityadmin.home, and ubooquity.home. It worked without the need to pass any additional headers for WeTTy and Ubooquity, but for pi.hole, I had to set several additional proxy headers (which I learned from this post on Reddit).

      The other is a 301 redirect where you tell the client to simply forward itself to another location. I use this for ubooquityadmin.home because the actual URL you need to reach is not / but /admin/ and the 301 makes it easy to setup an automatic forward. I then use the regex match ~ /(.*)$ to make sure every other URL is proxy_pass‘d to the appropriate domain and port.

      You’ll notice I did not include the domain I configured for my OpenMediaVault console (omv.home). That is because omv.home already goes to the right place without needing any proxy to port forward.
      server {
      listen 80;
      server_name pi.hole;
      location / {
      proxy_pass http://<your-server-ip>:8000;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $host;
      proxy_set_header X-ForwardedFor $proxy_add_x_forwarded_for;
      proxy_hide_header X-Frame-Options;
      proxy_set_header X-Frame-Options "SAMEORIGIN";
      proxy_read_timeout 90;
      }
      }
      server {
      listen 80;
      server_name wetty.home;
      location / {
      proxy_pass http://<your-server-ip>:2222;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $host;
      proxy_set_header X-ForwardedFor $proxy_add_x_forwarded_for;
      }
      }
      server {
      listen 80;
      server_name ubooquity.home;
      location / {
      proxy_pass http://<your-server-ip>:2202;
      }
      }
      server {
      listen 80;
      server_name ubooquityadmin.home;
      location =/ {
      return 301 http://ubooquityadmin.home/admin;
      }
      location ~ /(.*)$ {
      proxy_pass http://<your-server-ip>:2203/$1;
      }
      }
      If you are using other domains, ports, or IP addresses, adjust accordingly. Be sure all your curly braces have their mates ({}) and that each line ends with a semicolon (;) or Nginx will crash. I use Tab‘s between statements (i.e. between listen and 80) to format them more nicely but Nginx will accept any number or type of whitespace.

      To test if your new configuration worked, save your changes (hit Ctrl+X to exit, Y to save, and Enter to overwrite the file if you are editing a pre-edited one). In the command line, run the following command to restart Nginx with your new configuration loaded.
      systemctl restart nginx
      Try to login to your OpenMediaVault administrative panel in a browser. If that works, it means Nginx is up and running and you at least didn’t make any obvious syntax errors!

      Next try to access one of the domains you just configured (for instance pi.hole) to test if the proxy was configured correctly.

      If either of those steps failed, use WeTTy or SSH to log back in to the command line and use the command above to edit the file (you can delete everything if you want to start fresh) and rerun the restart command after you’ve made changes to see if that fixes it. It may take a little bit of doing if you have a tricky configuration but once you’re set, everyone on the network can now use your configured addresses to access the services on your network.

    Twingate

    In my previous post, I set up Dynamic DNS and a Wireguard VPN to grant secure access to the network from external devices (i.e. a work computer, my smartphone when I’m out, etc.). While it worked, the approach had two flaws:

    1. The work required to set up each device for Wireguard is quite involved (you have to configure it on the VPN server and then pass credentials to the device via QR code or file)
    2. It requires me to open up a port on my router for external traffic (a security risk) and maintain a Dynamic DNS setup that is vulnerable to multiple points of failure and could make changing domain providers difficult.

    A friend of mine, after reading my post, suggested I look into Twingate instead. Twingate offers several advantages, including:

    • Simple graphical configuration of which resources should be made available to which devices
    • Easier to use client software with secure (but still easy to use) authentication
    • No need to configure Dynamic DNS or open a port
    • Support for local DNS rules (i.e. the domains I configured in Pihole)

    I was intrigued (it didn’t hurt that Twingate has a generous free Starter plan that should work for most home server setups). To set up Twingate to enable remote access:

    • Create a Twingate account and Network: Go to their signup page and create an account. You will then be asked to set up a unique Network name. The resulting address, <yournetworkname>.twingate.com, will be your Network configuration page from where you can configure remote access.
    • Add a Remote Network: Click the Add button on the right-hand-side of the screen. Select On Premise for Location and enter any name you choose (I went with Home network).
    • Add Resources: Select the Remote Network you just created (if you haven’t already) and use the Add Resource button to add an individual domain name or IP address and then grant access to a group of users (by default, it will go to everyone).

      With my configuration, I added 5 domains (pi.hole + the four .home domains I configured through Pihole) and 1 IP address (for the server, to handle the ubooquityadmin.home forwarding and in case there was ever a need to access an additional service on my server that I had not yet created a domain for).
    • Install Connector Docker Container: To make the selected network resources available through Twingate requires installing a Twingate Connector to something internet-connected on the network.

      Press the Deploy Connector button on one of the connectors on the right-hand-side of the Remote Network page (mine is called flying-mongrel). Select Docker in Step 1 to get Docker instructions (see below). Then press the Generate Tokens button under Step 2 to create the tokens that you’ll need to link your Connector to your Twingate network and resources.

      With the Access Token and Refresh Token saved, you are ready to configure Docker to install. Login to the OpenMediaVault administrative panel and go to [Services > Compose > Files] and press the  button. Under Name put down Twingate Connector and under File, enter the following (making sure the number of spaces are consistent)
      services:
      twingate_connector:
      container_name: twingate_connector
      restart: unless-stopped
      image: "twingate/connector:latest"
      environment:
      - SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
      - TWINGATE_API_ENDPOINT=/connector.stock
      - TWINGATE_NETWORK=<your network name>
      - TWINGATE_ACCESS_TOKEN=<your connector access token>
      - TWINGATE_REFRESH_TOKEN=<your connector refresh token>
      - TWINGATE_LOG_LEVEL=7
      You’ll need to replace <your network name> with the name of the Twingate network you created, <your connector access token> and <your connector refresh token> with the access token and refresh token generated from the Twingate website. Do not add any single or double quotation marks around the network name or the tokens as they will result in a failed authentication with Twingate (as I was forced to learn through experience).

      Once you’re done, hit Save and you should be returned to your list of Docker compose files. Click on the entry for Twingate Connector you just created and then press the  (up) button to initialize the container.

      Go back to your Twingate network page and select the Remote Network your Connector is associated with. If you were successful, within a few moments, the Connector’s status will reflect this (see below for the before and after).

      If, after a few minutes there is still no change, you should check the container logs. This can be done by going to [Services > Compose > Services] in the OpenMediaVault administrative panel. Select the Twingate Connector container and press the (logs) button in the menubar. The TWINGATE_LOG_LEVEL=7 setting in the Docker configuration file sets the Twingate Connector to report all activities in great detail and should give you (or a helpful participant on the Twingate forum) a hint as to what went wrong.
    • Add Users and Install Clients: Once the configuration is done and the Connector is set up, all that remains is to add user accounts and install the Twingate client software on the devices that should be able to access the network resources.

      Users can be added (or removed) by going to your Twingate network page and clicking on the Team link in the menu bar. You can Add User (via email) or otherwise customize Group policies. Be mindful of the Twingate Starter plan limit to 5 users…

      As for the devices, the client software can be found at https://get.twingate.com/. Once installed, to access the network, the user will simply need to authenticate.
    • Remove my old VPN / Dynamic DNS setup. This is not strictly necessary, but if you followed my instructions from before, you can now undo those by:
      • Closing the port you opened from your Router configuration
      • Disabling Dynamic DNS setup from your domain provider
      • “Down”-ing and deleting the container and configuration file for DDClient (you can do this by going to [Services > Compose > Files] from OpenMediaVault admin panel)
      • Deleting the configured Wireguard clients and tunnels (you can do this by going to [Services > Wireguard] from the OpenMediaVault admin panel) and then disabling the Wireguard plugin (go to [System > Plugins])
      • Removing the Wireguard client from my devices

    And there you have it! A secure means of accessing your network while retaining your local DNS settings and avoiding the pitfalls of Dynamic DNS and opening a port.

    Resources

    There were a number of resources that were very helpful in configuring the above. I’m listing them below in case they are helpful:

    (If you’re interested in how to setup a home server on OpenMediaVault or how to self-host different services, check out all my posts on the subject)

  • Why Thread is Matter’s biggest problem right now

    Stop me if you’ve heard this one before… Adoption of a technology is being impeded by too many standards. The solution? A new standard, of course, and before you know it, we now have another new standard to deal with.

    The smart home industry needs to figure out how to properly embrace Thread (and Matter). It (or something like it) will be necessary for broader smart home / Internet of Things adoption.


    Why Thread is Matter’s biggest problem right now
    Jennifer Pattison Tuohy | The Verge

  • Building an AI-powered Metasearch Concept

    Summary

    I built a metasearch engine that:

    • Uses an LLM (OpenAI GPT3.5) to (1) interpret the search intent based on a user supplied topic and then (2) generate service-specific search queries to execute to get the best results
    • Shows results from Reddit, Wikipedia, Unsplash, and Podcast episodes (search powered by Taddy)
    • Surfaces relevant images from a set of crawled images using a vector database (Pinecone) populated with CLIP embeddings
    • Was implemented in a serverless fashion using Modal.com

    The result (using the basic/free tier of many of the connected services) is accessible here (Github repository). While functional, it became apparent during testing that this approach has major limitations, in particular the latency from chaining LLM responses and the dependence on search quality from the respective services. I conclude by discussing some potential future directions.

    Motivation

    Many articles have been written about the decline in Google’s search result quality and the popularity of attempts to fix this by pushing Google to give results from Reddit. This has even resulted in Google attempting to surface authoritative blog/forum-based results in its search results.

    Large language models (LLMs) like OpenAI’s GPT have demonstrated remarkable versatility in handling language and “reasoning” problems. While at Stir, I explored the potential to utilize a large language model as a starting point for metasearch — one that would employ an LLM’s ability to interpret a user query and convert it to queries which would fulfill the user’s intent while also working well with other services (i.e., Reddit, image vector databases, etc.).

    Serverless

    To simplify administration, I employed a serverless implementation powered by Modal.com. Unlike many other serverless technology providers, Modal’s implementation is deeply integrated into Python, making it much easier to:

    1. Define the Python environment needed for an application
    2. Pass data between routines
    3. Create web endpoints
    4. Deploy and test locally
    Define the Python environment needed for an application

    Modal makes it extremely easy to set up a bespoke Python environment / container image through it’s Image API. For my application’s main driver, which required the installation of openai, requests, and beautifulsoup, this was achieved in 2 lines of code:

    image = Image.debian_slim(python_version='3.10') \
                .pip_install('openai', 'requests', 'beautifulsoup4')
    stub = Stub('chain-search', image=image)

    Afterwards, functions that are to be run serverless-ly are wrapped in Modal’s Python function decorators (stub.function). These decorators take arguments which allow the developer to configure runtime behavior and pass secrets (like API keys) as environment variables. For example, the first few lines of a function that engages Reddit’s search API:

    @stub.function(secret=Secret.from_name('reddit_secret'))
    def search_reddit(query: str):
        import requests
        import base64 
        import os 
    
        reddit_id = os.environ['REDDIT_USER']
        user_agent = os.environ['REDDIT_AGENT']
        reddit_secret = os.environ['REDDIT_KEY']
        ...
        return results

    Modal also provides a means to intelligently prefetch data for an AI/ML-serving function. The Python class decorator stub.cls can wrap any arbitrary class that defines an initiation step method (__enter__ ) as well as actual function logic. In this way, a Modal container that is still warm that is being invoked an additional time need not re-initialize variables or fetch data, as it already did so during initiation.

    Take for instance the following class which (1) loads the SentenceTransformer model stored at cache_path and initializes a connection to a remote Pinecone vector database during initiation and (2) defines a query function which takes a text string, runs it through self.model, and passes it through self.pinecone_index.query:

    # use Modal's class entry trick to speed up initiation
    @stub.cls(secret=Secret.from_name('pinecone_secret'))
    class TextEmbeddingModel:
        def __enter__(self):
            import sentence_transformers
            model = sentence_transformers.SentenceTransformer(cache_path, 
                                                              device='cpu')
            self.model = model 
    
            import pinecone
            import os 
            pinecone.init(api_key=os.environ['PINECONE_API_KEY'], 
                          environment=os.environ['PINECONE_ENVIRONMENT'])
            self.pinecone_index = pinecone.Index(os.environ['PINECONE_INDEX'])
        
        @method()
        def query(self, query: str, num_matches = 10):
            # embed the query 
            vector = self.model.encode(query)
    
            # run the resulting vector through Pinecone
            pinecone_results = self.pinecone_index.query(vector=vector.tolist(), 
                                       top_k=num_matches, 
                                       include_metadata=True
                                       )
            ...
            return results
    
    Transparently Pass data between routines

    Once you’ve deployed a function to Modal, it becomes an invokable remote function. via Modal’s Function API. For example, the code below is from a completely different file from the TextEmbeddingModel class defined above.

    pinecone_query = Function.lookup('text-pinecone-query',
                                     'TextEmbeddingModel.query')

    The remote function can then be called using Function.remote

    return pinecone_query.remote(response[7:], 7)

    The elegance of this is that the parameters (response[7:] and 7) and the result are passed transparently in Python without any need for a special API, allowing your Python code to call upon remote resources at a moment’s notice.

    This ability to seamlessly work with remote functions also makes it possible to invoke the same function several times in parallel. The following code takes the queries generated by the LLM (responses — a list of 6-10 actions) and runs them in parallel (via parse_response) through Function.map which aggregates the results at the end. In a single line of code, as many as 10 separate workers could be acting in parallel (in otherwise completely synchronous Python code)!

    responses = openai_chain_search.remote(query)
    results = parse_response.map(responses)
    Create Web Endpoints

    To make a function accessible through a web endpoint, simply add a web_endpoint Python function decorator. This turns the functions into FastAPI endpoints, removing the need to embed a web server like Flask. This makes it easy to create API endpoints (that return JSON) as well as full web pages & applications (that return HTML and the appropriate HTTP headers).

    from fastapi.responses import HTMLResponse
    
    @stub.function()
    @web_endpoint(label='metasearch')
    def web_search(query: str = None):
        html_string = "<html>"
        ...
        return HTMLResponse(html_string)
    Deploy and test locally

    Finally, Modal has a simple command line interface that makes it extremely easy to deploy and test code. modal deploy <Python file> deploys the serverless functions / web endpoints in the file to the cloud, modal run <Python file> runs a specific function locally (while treating the rest of the code as remote functions), and modal serve <Python file> deploys the web endpoints to a private URL which automatically redeploys every every time the underlying Python file is changes (to better test a web endpoint).

    To designate a particular function for local running via modal run simply involves using the stub.local_entrypoint function decorator. This (and modal serve) makes it much easier to test code prior to deployment.

    # local entrypoint to test
    @stub.local_entrypoint()
    def main(query = 'Mountain sunset'):
        results = []
        seen_urls = []
        seen_thumbnails = []
    
        responses = openai_chain_search.remote(query)
        for response in responses:
            print(response)
        ...

    Applying Large Language Model

    Prompting Approach

    The flexibility of large language models makes determining an optimal path for invoking them and processing their output complex and open-ended. My initial efforts focused around engineering a single prompt which would parse a user query and return queries for other services. However, this ran into two limitations. First, the various responses were structurally similar with one another even if the queries were wildly different. The LLM would supply queries for each API/platform service available even when some of the services were irrelevant. Secondly, the queries themselves were relatively generic. The LLM’s responses did not adapt very well to the individual services (i.e. providing more detail for a Podcast search that indexes podcast episode descriptions vs. something more generic for an image database or Wikipedia).

    To boost the “dynamic range”, I turned to a chained approach where the initial LLM invocation would interpret whether or not a user query is best served with image-centric results or text-centric results. Depending on that answer, a follow-up prompt would be issued to the LLM requesting image-centric OR text-centric queries for the relevant services.

    To design the system prompt (below), I applied the Persona pattern and supplied the LLM with example rationales for user query categorization as a preamble for the initial prompt.

    *System Prompt*
    Act as my assistant who's job is to help me understand and derive inspiration around a topic I give you.  Your primary job is to help find the best images, content, and online resources for me. Assume I have entered a subject into a command line on a website and do not have the ability to provide you with follow-up context.
    
    Your first step is to determine what sort of content and resources would be most valuable. For topics such as "wedding dresses" and "beautiful homes" and "brutalist architecture", I am likely to want more visual image content as these topics are design oriented and people tend to want images to understand or derive inspiration. For topics, such as "home repair" and "history of Scotland" and "how to start a business", I am likely to want more text and link content as these topics are task-oriented and people tend to want authoritative information or answers to questions.
    
    *Initial Prompt*
    I am interested in the topic:
    {topic}
    
    Am I more interested in visual content or text and link based content? Select the best answer between the available options, even if it is ambiguous. Start by stating the answer to the question plainly. Do not provide the links or resources. That will be addressed in a subsequent question.

    Reasonably good results were achieved when asking for queries one service at a time (i.e. “what queries should I use for Reddit” then “what queries should I use for Wikipedia”, etc.), but this significantly increased the time to response and cost. I ultimately settled on combining the follow-up requests, creating one to generate text-based queries and another to generate image-based ones. I also applied the Template Pattern to create a response which could be more easily parsed.

    *Text-based query generation prompt*
    You have access to three search engines.
    
    The first will directly query Wikipedia. The second will surface interesting posts on Reddit based on keyword matching with the post title and text. The third will surface podcast episodes based on keyword matching.
    
    Queries to Wikipedia should be fairly direct so as to maximize the likelihood that something relevant will be returned. Queries to the Reddit and podcast search engines should be specific and go beyond what is obvious and overly broad to surface the most interesting posts and podcasts.
    
    What are 2 queries that will yield the most interesting Wikipedia posts, 3 queries that will yield the most valuable Reddit posts, and 3 queries surface that will yield the most insightful and valuable podcast episodes about:
    {topic}
    
    Provide the queries in a numbered list with quotations around the entire query and brackets around which search engine they're intended for (for example: 1. [Reddit] "Taylor Swift relationships". 2. [Podcast] "Impact of Taylor Swift on Music". 3. [Wikipedia] "Taylor Swift albums").
    
    *Image query generation prompt*
    You have access to two search engines.
    
    The first a set of high quality images mapped to a vector database. There are only about 30,000 images in the dataset so it is unlikely that it can return highly specific image results so it would be better to use more generic queries and explore a broader range of relevant images.
    
    The second will directly query the free stock photo site Unsplash. There will be a good breadth of photos but the key will be trying to find the highest quality images.
    
    What are 3 great queries to use that will provide good visual inspiration and be different enough from one another so as to provide a broad range of relevant images from the vector database and 3 great queries to use with Unsplash to get the highest quality images on the topic of:
    {topic}
    
    Provide the queries in a numbered list with quotations around the entire query and brackets around which search engine they\'re intended for (for example: 1. [Vector] "Mountain sunset". 2. [Unsplash] "High quality capture of mountain top at sunset".)
    invoking openai and parsing the llm’s responses

    I chose OpenAI due to the maturity of their offering and their ability to handle the prompting patterns I used. To integrate OpenAI, I created a Modal function, passing my OpenAI credentials as environment variables. I then created a list of messages to capture the back and forth exchanges with the LLM to pass back to the GPT model as history.

    @stub.function(secret=Secret.from_name('openai_secret'))
    def openai_chain_search(query: str):
        import openai 
        import os 
        import re
        model = 'gpt-3.5-turbo' # using GPT 3.5 turbo model
    
        # Pull Open AI secrets
        openai.organization = os.environ['OPENAI_ORG_ID']
        openai.api_key = os.environ['OPENAI_API_KEY']
    
        # message templates with (some) prompt engineering
        system_message = """Act as my assistant who's job ... """
        initial_prompt_template = 'I am interested in the topic:\n{topic} ...'
        text_template = 'You have access to three search engines ...'
        image_template = 'You have access to two search engines ...'
    
        # create context to send to OpenAI
        messages = []
        messages.append({
            'role': 'user',
            'content': system_message
        })
        messages.append({
            'role': 'user',
            'content': initial_prompt_template.format(topic=query)
        })
    
        # get initial response
        response = openai.ChatCompletion.create(
            model=model,
            messages = messages,
            temperature = 1.0
        )
        messages.append({
            'role': 'assistant',
            'content': response['choices'][0]['message']['content']
        })

    To parse the initial prompt’s response, I did a simple text match with the string "text and link". While crude, this worked well in my tests. If the LLM concluded the user query would benefit more from a text-based set of responses, the followup text-centric prompt text_template was sent to the LLM. If the LLM concluded the user query would benefit more from images, the follow-up image-centric prompt image_template was sent to the LLM instead.

        if 'text and link' in response['choices'][0]['message']['content']:
            # get good wikipedia, reddit, and podcast queries
            messages.append({
                'role': 'user',
                'content': text_template.format(topic=query)
            })
        else:
            ...
    
            # get good image search queries
            messages.append({
                'role': 'user',
                'content': image_template.format(topic=query)
            })
    
        # make followup call to OpenAI
        response = openai.ChatCompletion.create(
            model=model,
            messages = messages,
            temperature = 1.0
        )

    The Template pattern in these prompts pushes the LLM to return numbered lists with relevant services in brackets. These results were parsed with a simple regular expression and then converted into a list of strings (responses, with each entry in the form "<name of service>: <query>") which would later be mapped to specific service functions. (Not shown in the code below: I also added a Wikipedia query to the image-centric results to improve the utility of the image results).

        responses = [] # aggregate list of actions to take
        ...
        # use regex to parse GPT's recommended queries
        for engine, query in re.findall(r'[0-9]+. \[(\w+)\] "(.*)"', 
                                    response['choices'][0]['message']['content']):
            responses.append(engine + ': ' + query)
    
        return responses

    Building and Querying Image Database

    Crawling

    To supplement the image results from Unsplash, I crawled a popular high-quality image sharing platform for high quality images we could surface as image search results. To do this, I used the browser automation library Playwright. Used for website and webapp testing automation, it provided very simple APIs to use code to interact with DOM elements on the browser screen. This allowed me to login to the image sharing service.

    While I initially used a combination of Playwright (to scroll and then wait for all the images to load) and the HTML/XML reading Python library BeautifulSoup (to parse the DOM) to gather the images from the service, this approach was slow and unreliable. Seeking greater performance, I looked at the calls the browser made to the service’s backend, and discovered the service would pass all the data the browser would need to render the image results on a page in a JSON blob.

    Leveraging Playwright’s Page.expect_request_finished, Request, and Response APIs, I was able to directly access the JSON and programmatically extracted the data I needed directly (without needing to check what was being rendered in the browser window). This allowed me to quickly and reliably pull the images from the service and their associated metadata.

    approach

    To make it possible to search the images, I needed to find a way to capture the “meaning” of the images as well as the “meaning” of a text query so that they could be easily mapped together. OpenAI published research on a neural network architecture called CLIP which made this relatively straightforward to do. Trained on images paired with image captions on the web, CLIP makes it simple to convert both images and text into vectors (a series of numbers) where the better a match the image and text are, the closer their vectors “multiply” (dot product) to 1.0.

    Procedurally then, if you have the vectors for every image you want to search against, and are given a text-based search query, to find the images that best match you need only to:

    1. Convert the query text into a vector using CLIP
    2. Find the image vectors that get the closest to 1.0 when “multiplied/dot product-ed” with the search string vector
    3. Return the images corresponding to those vectors

    This type of search has been made much easier with vector databases like Pinecone which make it easy for applications to store vector information and query it with a simple API call. The diagram below shows how the architecture works: (1) the crawler running periodically and pushing new metadata & image vectors into the vector database and (2) whenever a user initiates a search, the search query is converted to a vector which is then passed to Pinecone to return the metadata and URLs corresponding to the best matching images.

    How crawler, image search, and Pinecone vector database work together

    Because of the size of the various CLIP models in use, in a serverless setup, it’s a good idea to intelligently cache and prefetch the model so as to reduce search latency. To do this in Modal, I defined a helper function (i.e. download_models, see below) which initiates and downloads the CLIP model from the SentenceTransformer package and then caches it (with modal.save). This helper function is then passed as part of the Modal container image creation flow with Image.run_function so that it’s called whenever the container image is initiated.

    # define Image for embedding text queries and hitting Pinecone
    # use Modal initiation trick to preload model weights
    cache_path = '/pycache/clip-ViT-B-32' # cache for CLIP model 
    def download_models():
        import sentence_transformers 
    
        model_id = 'sentence-transformers/clip-ViT-B-32' # model ID for CLIP
    
        model = sentence_transformers.SentenceTransformer(
            model_id,
            device='cpu'
        )
        model.save(path=cache_path)
    
    image = (
        Image.debian_slim(python_version='3.10')
        .pip_install('sentence_transformers')
        .run_function(download_models)
        .pip_install('pinecone-client')
    )
    stub = Stub('text-pinecone-query', image=image)

    The stub.cls class decorator I shared before then loads the CLIP model from cache (self.model) if the container image is called again “while warm” (before it shuts down). It also initiates a connection to the appropriate Pinecone database in self.pinecone_index. Note: the code below was shared above in my discussion of intelligent prefetch in Modal

    # use Modal's class entry trick to speed up initiation
    @stub.cls(secret=Secret.from_name('pinecone_secret'))
    class TextEmbeddingModel:
        def __enter__(self):
            import sentence_transformers
            model = sentence_transformers.SentenceTransformer(cache_path, 
                                                              device='cpu')
            self.model = model 
    
            import pinecone
            import os 
            pinecone.init(api_key=os.environ['PINECONE_API_KEY'], 
                          environment=os.environ['PINECONE_ENVIRONMENT'])
            self.pinecone_index = pinecone.Index(os.environ['PINECONE_INDEX'])

    Invoking the CLIP model is done via model.encode (which also works on image data), and querying Pinecone involves passing the vector and number of matches desired to pinecone_index.query. Note: the code below was shared above in my discussion of intelligent prefetch in Modal

    @stub.cls(secret=Secret.from_name('pinecone_secret'))
    class TextEmbeddingModel:
        def __enter__(self):
            ...
        
        @method()
        def query(self, query: str, num_matches = 10):
            # embed the query 
            vector = self.model.encode(query)
    
            # run the resulting vector through Pinecone
            pinecone_results = self.pinecone_index.query(vector=vector.tolist(), 
                                       top_k=num_matches, 
                                       include_metadata=True
                                       )
            ...
            return results
    

    Assembling the Results

    Diagram outlining how a search query is executed
    initiating the service workers

    When the web endpoint receives a search query, it invokes openai_chain_search. This would, as mentioned previously, have the LLM (1) interpret the user query as image or text-centric and then (2) return parsed service-specific queries.

    responses = openai_chain_search.remote(query)

    The resulting responses are handled by parse_response, a function which identifies the service and triggers the appropriate service-specific function (to handle service-specific quirks). Those functions would return a Python dictionary with a consistent set of keys that would be used to construct the search results (e.g. query, source, subsource, subsource_url, url, title, thumbnail, and snippet)

    # function to map against response list
    @stub.function()
    def parse_response(response: str):
        pinecone_query = Function.lookup('text-pinecone-query', 'TextEmbeddingModel.query')
    
        if response[0:11] == 'Wikipedia: ':
            return search_wikipedia.remote(response[11:])
        elif response[0:8] == 'Reddit: ':
            return search_reddit.remote(response[8:])
        elif response[0:9] == 'Podcast: ':
            return search_podcasts.remote(response[9:])
        elif response[0:10] == 'Unsplash: ':
            return search_unsplash.remote(response[10:])
        elif response[0:8] == 'Vector: ':
            return pinecone_query.remote(response[7:], 7)    

    Because Modal functions like parse_response can be invoked remotely, to speed up the overall response time, each call to parse_response was made in parallel rather than sequentially using Modal’s Function.map API.

    results = parse_response.map(responses)

    The results are then collated and rendered into HTML to be returned to the requesting browser.

    Calling The Services

    As mentioned above, the nuances of each service was handled by a dedicated service function. These functions would handle service authentication, would execute the relevant search, and parse the response into a Python dictionary with the appropriate structure.

    Searching Wikipedia (search_wikipedia) involved url-encoding the query and adding it to a base URL (http://www.wikipedia.org/search-redirect.php?search=) that effectively surfaces Wikipedia pages if the query has a good match and conducts keyword searches on Wikipedia if not. (I use that URL as a keyword short cut in my browsers for this purpose)

    Because of the different types of page responses, the resulting page needed to be parsed to determine if the result was a full Wikipedia entry, a Search results page (in which case the top two results were taken), or a disambiguation page (a query that could point to multiple possible pages but which had a different template than the others) before extracting the information needed for presenting the results.

    # handle Wikipedia
    @stub.function()
    def search_wikipedia(query: str):
        import requests
        import urllib.parse
        from bs4 import BeautifulSoup
    
        # base_search_url works well if search string is spot-on, else does search 
        base_search_url = 'https://en.wikipedia.org/w/index.php?title=Special:Search&search={query}'
        base_url = 'https://en.wikipedia.org'
    
        results = []
        r = requests.get(base_search_url.format(query = urllib.parse.quote(query)))
        soup = BeautifulSoup(r.content, 'html.parser') # parse results
    
        if '/wiki/Special:Search' in r.url: # its a search, get top two results
            search_results = soup.find_all('li', class_='mw-search-result')
            for search_result in search_results[0:2]:
                ...
                results.append(result)
        else: # not a search page, check if disambiguation or legit page
            ...
    
        return results

    The Reddit API is only accessible via OAuth. This requires an initial authentication token generation step. This generated token can then authenticate subsequent API requests for information. The endpoint of interest (/search) returns a large JSON object encapsulating all the returned results. Trial-and-error helped establish the schema and the rules of thumb needed for extracting the best preview images. Note: the Reddit access token was re-generated for each search because the expiry on each was only one hour

    # handle Reddit
    @stub.function(secret=Secret.from_name('reddit_secret'))
    def search_reddit(query: str):
        import requests
        import base64 
        import os 
    
        reddit_id = os.environ['REDDIT_USER']
        user_agent = os.environ['REDDIT_AGENT']
        reddit_secret = os.environ['REDDIT_KEY']
    
        # set up for auth token request
        auth_string = reddit_id + ':' + reddit_secret
        encoded_auth_string = base64.b64encode(auth_string.encode('ascii')).decode('ascii')
        auth_headers = {
            'Authorization': 'Basic ' + encoded_auth_string,
            'User-agent': user_agent
        }
        auth_data = {
            'grant_type': 'client_credentials'
        }
    
        # get auth token
        r = requests.post('https://www.reddit.com/api/v1/access_token', headers = auth_headers, data = auth_data)
        if r.status_code == 200:
            if 'access_token' in r.json():
                reddit_access_token = r.json()['access_token']
        else:
            return [{'error':'auth token failure'}]
    
        results = []
    
        # set up headers for search requests
        headers = {
            'Authorization': 'Bearer ' + reddit_access_token,
            'User-agent': user_agent
        }
        
        # execute subreddit search
        params = {
            'sort': 'relevance',
            't': 'year',
            'limit': 4,
            'q': query[:512]
        }
        r = requests.get('https://oauth.reddit.com/search', params=params, headers=headers)
        if r.status_code == 200:
            body = r.json()
            ...
    
        return results

    Authenticating Unsplash searches was simpler and required passing a client ID as a request header. The results could then be obtained by simply passing the query parameters to https://api.unsplash.com/search/photos

    # handle Unsplash Search
    @stub.function(secret = Secret.from_name('unsplash_secret'))
    def search_unsplash(query: str, num_matches: int = 5):
        import os 
        import requests 
    
        # set up and make request
        unsplash_client = os.environ['UNSPLASH_ACCESS']
        unsplash_url = 'https://api.unsplash.com/search/photos'
    
        headers = {
            'Authorization': 'Client-ID ' + unsplash_client,
            'Accept-Version': 'v1'
        }
        params = {
            'page': 1,
            'per_page': num_matches,
            'query': query
        }
        r = requests.get(unsplash_url, params=params, headers=headers)
    
        # check if request is good 
        if r.status_code == 200:
            ...
            return results
        else:
            return [{'error':'auth failure'}]

    To search podcast episodes, I turned to Taddy’s API. Unlike Reddit and Unsplash, Taddy operates a GraphQL-based query engine where, instead of requesting specific data for specific fields at specific endpoints (i.e. GET api.com/podcastTitle/, GET api.com/podcastDescription/), you request all the data from an endpoint at once by passing the fields of interest (i.e. GET api.com/podcastData/).

    As I did not want the overhead of creating a schema and running a Python GraphQL library, I constructed the request manually.

    # handle Podcast Search via Taddy
    @stub.function(secret = Secret.from_name('taddy_secret'))
    def search_podcasts(query: str):
        import os 
        import requests 
    
        # prepare headers for querying taddy
        taddy_user_id = os.environ['TADDY_USER']
        taddy_secret = os.environ['TADDY_KEY']
        url = 'https://api.taddy.org'
        headers = {
            'Content-Type': 'application/json',
            'X-USER-ID': taddy_user_id,
            'X-API-KEY': taddy_secret
        }
        # query body for podcast search
        queryString = """{
      searchForTerm(
        term: """
        queryString += '"' + query + '"\n'
        queryString += """
        filterForTypes: PODCASTEPISODE
        searchResultsBoostType: BOOST_POPULARITY_A_LOT
        limitPerPage: 3
      ) {
        searchId
        podcastEpisodes {
          uuid
          name
          subtitle
          websiteUrl
          audioUrl
          imageUrl
          description
          podcastSeries {
            uuid
            name
            imageUrl
            websiteUrl
          }
        }
      }
    }
    """
        # make the graphQL request and parse the JSON body
        r = requests.post(url, headers=headers, json={'query': queryString})
        if r.status_code != 200:
            return []
        else:
            responseBody = r.json()
            if 'errors' in responseBody:
                return [{'error': 'authentication issue with Taddy'}]
            else:
                results = []
                ...
                return results
    Serving Results

    To keep the architecture and front-end work simple, the entire application is served out of a single endpoint. Queries are passed as simple URL parameters (?query=) which are both easy to handle with FastAPI but are readily generated using HTML <form> elements with method='get' as an attribute and name='query' on the text field.

    The interface is fairly simple: a search bar at the top and search results (if any) beneath (see below)

    Search for “Ocean Sunset” (image-centric)
    Search for “how to plan a wedding” (text-centric)

    To get a more flexible layout, each search result (rowchild) is added to a flexbox container (row) configured to “fill up” each horizontal row before proceeding. The search results themselves are also given minimum widths and maximum widths to guarantee at least 2 items per row. To prevent individual results from becoming too tall, maximum heights are applied to image containers (imagecontainer) inside the search results.

    <style type='text/css'>
        .row {
            display: flex; 
            flex-flow: row wrap
        }
        .rowchild {
            border: 1px solid #555555; 
            border-radius: 10px; 
            padding: 10px; 
            max-width: 45%; 
            min-width: 300px; 
            margin: 10px;
        }
        ...
        .imagecontainer {max-width: 90%; max-height: 400px;}
        ....
    </style>

    Limitations & Future Directions

    Limitations

    While this resulted in a functioning metasearch engine, a number of limitations became obvious as I built and tested it.

    • Latency — Because LLM’s have significant latency, start-to-finish search times will take a meaningful hit, even excluding the time needed to query the additional services. Given the tremendous body of research showing how search latency decreases usage and clickthrough, this poses a significant barrier for this approach to work. At the minimum, the search user interface needs to accommodate for this latency.
    • Result Robustness [LLM] — Because large language models sample over a distribution, the same prompt will result in different responses over time. As a result, this approach will result in large variance in results even with the same user query. While this can help with ideation and exploration for general queries, it is a limiting factor if the expectation is to produce the best results time and time again for specific queries.
    • Dependence on Third Party Search Quality — Even with perfectly tailored queries, metasearch ultimately relies on third party search engines to perform well. Unfortunately, as evidenced in multiple tests, all of the services used here regularly produce irrelevant results. This is likely due to the computational difficulty of going beyond the simple text matching and categorization that is likely used.
    future areas for exporation

    To improve the existing metasearch engine, there are four areas of exploration that are likely to be the most promising:

    1. Caching search results: Given latency inherent to LLMs and metasearch services, caching high quality and frequently searched-for results could drive a significant impact on performance.
    2. Asynchronous searches & load states: While the current engine executes individual sub-queries in parallel, the results are delivered monolithically (in one go). There may be perception and performance gains to be derived by serving results as they come with loading state animations to provide visual feedback to the user as to what is happening. This will be especially necessary if the best results are to be a combination of cached and newly pulled results.
    3. Building scoring engine for search results: One of the big weaknesses of the current engine is that it treats all results equally. Results should be ranked on relevancy or value and should also be pruned of irrelevant content. This can only be done if a scoring engine or model is applied.
    4. Train own models for query optimization: Instead of relying on high-latency LLM requests, it may be possible for some services to generate high quality service-specific queries with smaller sequence to sequence models. This would result in lower latency and possibly better performance due to the focus on the particular task.

    Edit: 2023-Sep-29: Updated Wikipedia address due to recent changes; replaced .call with .remote due to Modal deprecation of .call

  • Setting Up an OpenMediaVault Home Server with Docker, Plex, Ubooquity, and WireGuard

    I spent a few days last week setting up a cheap home server which now serves my family as:

    • a media server — stores and streams media to phones, tablets, computers, and internet-connected TVs (even when I’m out of the house!)
    • network-attached storage (NAS) — lets computers connected to my home network store and share files
    • VPN — lets me connect to my storage and media server when I’m outside of my home

    Until about a week ago, I had run a Plex media server on my aging (8 years old!) NVIDIA SHIELD TV. While I loved the device, it was starting to show it’s age – it would sometimes overheat and not boot for several days. My home technology setup had also shifted. I bought the SHIELD all those years ago to put Android TV functionality onto my “dumb” TV.

    But, about a year ago, I upgraded to a newer Sony TV which had it built-in. Now, the SHIELD felt “extra” and the media server felt increasingly constrained by what it could not do (e.g., slow network access speeds, can only run services that are Android apps, etc.)

    I considered buying a high-end consumer NAS from Synology or QNAP (which would have been much simpler!), but decided to build my own to both get better hardware for less money but also as a fun project which would teach me more about servers and let me configure everything to my heart’s content.

    If you’re interested in doing something similar, let me walk you through my hardware choices and the steps I took to get to my current home server setup.

    Note: on the recommendation of a friend, I’ve since reconfigured how external access works to not rely on a VPN with an open port and Dynamic DNS and instead use Twingate. For more information, refer to my post on Setting Up Pihole, Nginx Proxy, and Twingate with OpenMediaVault

    Hardware

    I purchased a Beelink EQ12 Mini, a “mini PC” (fits in your hand, power-efficient, but still capable of handling a web browser, office applications, or a media server), during Amazon’s Prime Day sale for just under $200.

    Beelink EQ12 Mini
    Beelink EQ12 Mini (Image Source: Chigz Tech Review)

    While I’m very happy with the choice I made, for those of you contemplating something similar, the exact machine isn’t important. Many of the mini PC brands ultimately produce very similar hardware, and by the time you read this, there will probably be a newer and better product. But, I chose this particular model because:

    • It was from one of the more reputable Mini PC brands which gave me more confidence in its build quality (and my ability to return it if something went wrong). Other reputable vendors beyond Beelink include Geekom, Minisforum, Chuwi, etc.
    • It had a USB-C port which helps with futureproofing, and the option to convert this into something else useful if this server experiment doesn’t work out.
    • It had an Intel CPU. While AMD makes excellent CPUs, the benefit of going with Intel is support for Intel Quick Sync, which allows for hardware accelerated video transcode (converting video and audio streams to different formats and resolutions – so that other devices can play them – without overwhelming the system or needing a beefy graphics card). Many popular media servers support Intel Quick Sync-powered transcode.
    • It was not a i3/5/7/9 chip. Intel’s higher end chips have names that include “i3” or “i5” or “i7”. Those are generally overkill on performance, power consumption, and price for a simple file and media server. All I needed for my purposes was a lower-end Celeron-type device.
    • It was the most advanced Intel architecture I could find for ≤$200. While I didn’t need the best performance, there was no reason to avoid more advanced technology. Thankfully, the N100 chip in the EQ12 Mini uses Intel’s 12th Generation Core architecture (Alder Lake). Many of the other mini-PCs at this price range had older (10th and 11th generation) CPUs.
    • I went with the smallest RAM and onboard storage option. I wasn’t planning on putting much on the included storage (because you want to isolate the operating system for the server away from the data) nor did I expect to tax the computer memory for my use case.

    I also considered purchasing a Raspberry Pi, a <$100 low-power device popular with hobbyists, but the lack of transcode and the non-x86 architecture (Raspberry Pi’s use ARM CPUs and won’t be compatible with all server software) pushed me towards an Intel-based mini PC.

    In addition to the mini-PC, I also needed:

    • Storage: a media server / NAS without storage is not very useful. I had a 4 TB USB hard drive (previously connected to my SHIELD TV) which I used here, and I also bought a 4 TB SATA SSD (for ~$150) to mount inside the mini-PC.
      • Note 1: if you decide to go with OpenMediaVault as I have, install the Linux distribution before you install the SATA drive. The installer (foolishly) tries to install itself to the first drive it finds, so don’t give it any bad options.
      • Note 2: most Mini PC manufacturers say their systems only support additional drives up to 2 TB. This appears to be mainly the manufacturers being overly conservative. My 4 TB SATA SSD works like a charm.
    • A USB stick: Most Linux distributions (especially those that power open source NAS solutions) are installed from a bootable USB stick. I used one that was lying around that had 2 GB on it.
    • Ethernet cables and a “dumb” switch: I use Google Wifi in my home and I wanted to connect both my TV and my new media server to the router in my living room. To do that, I bought a simple Ethernet switch (you don’t need anything fancy because it’s just bridging several devices) and 3 Ethernet cables to tie it all together (one to connect the router to the switch, one to connect the TV to the switch, and one to connect the server to the switch). Depending on your home configuration, you may want something different.
    • A Monitor & Keyboard: if you decide to go with OpenMediaVault as I have, you’ll only need this during the installation phase as the server itself is controllable through a web interface. So, I used an old keyboard and monitor (that I’ve since given away).

    OpenMediaVault

    There are a number of open source home server / NAS solutions you can use. But I chose to go with OpenMediaVault because it’s:

    To install OpenMediaVault on the mini PC, you just need to:

    1. Download the installation image ISO and burn it to a bootable USB stick (if you use Windows, you can use Rufus to do so)
    2. Plug the USB stick into the mini PC (and make sure to connect the monitor and keyboard) and then turn the machine on. If it goes to Windows (i.e. it doesn’t boot from your USB stick), you’ll need to restart and go into BIOS (you can usually do this by pressing Delete or F2 or F7 after turning on the machine) to configure the machine to boot from a USB drive.
    3. Follow the on-screen instructions.
      • You should pick a good root password and write it down (it gates administrative access to the machine, and you’ll need it to make some of the changes below).
      • You can pick pretty much any name you want for the hostname and domain name (it shouldn’t affect anything but it will be what your machine calls itself).
      • Make sure to select the right drive for installation
    4. And that should be it! After you complete the installation, you will be prompted to enter the root password you created to login.

    Unfortunately for me, OpenMediaVault did not recognize my mini PC’s ethernet ports or wireless card. If it detects your network adapter just fine, you can skip this next block of steps. But, if you run into this, select the “does not have network card” option and “minimal setup” options during install. You should still be able to get the end of the process. Then, once the OpenMediaVault operating system installs and reboots:

    1. Login by entering the root password you picked during the installation and make sure your system is plugged in to your router via ethernet. Note: Linux is known to have issues recognizing some wireless cards and it’s considered best practice to run a media server off of Ethernet rather than WiFi.
    2. In the command line, enter omv-firstaid. This is a gateway to a series of commonly used tools to fix an OpenMediaVault install. In this case, select the Configure Network Interface option and say yes to all the IPv4 DHCP options (you can decide if you want to set up IPv6).
    3. Step 2 should fix the issue where OpenMediaVault could not see your internet connection. To prove this, you should try two things:
      • Enter ping google.com -c 3 in the command line. You should see 3 lines with something like 64 bytes from random-url.blahurl.net showing that your system could reach Google (and thus the internet). If it doesn’t work, try again in a few minutes (sometimes it takes some time for your router to register a new system).
      • Enter ip addr in the command line. Somewhere on the screen, you should see something that probably looks like inet 192.168.xx.xx/xx. That is your local IP address and it’s a sign that the mini PC has connected to your router.
    4. Now you need to update the Linux operating system so that it knows where to look for updates to Debian. As of this writing, the latest version of OpenMediaVault (6) is based on Debian 11 (codenamed Bullseye), so you may need to replace bullseye with <name of Debian codename that your OpenMediaVault is based on> in the text below if your version is based on a different version of Debian (i.e. Bookworm, Trixie, etc.).

      In the command line, enter nano /etc/apt/sources.list. This will let you edit the file that contains all the information on where your Linux operating system will find valid software updates. Enter the text below underneath all the lines that start with # (replacing bullseye with the name of the Debian version that underlies your version of OpenMediaVault if needed).
      deb http://deb.debian.org/debian bullseye main 
      deb-src http://deb.debian.org/debian bullseye main
      deb http://deb.debian.org/debian-security/ bullseye-security main
      deb-src http://deb.debian.org/debian-security/ bullseye-security main
      deb http://deb.debian.org/debian bullseye-updates main
      deb-src http://deb.debian.org/debian bullseye-updates main
      Then press Ctrl+X to exit, press Y when asked if you want to save your changes, and finally Enter to confirm that you want to overwrite the existing file.
    5. To prove that this worked, in the command line enter apt-get update and you should see some text fly by that includes some of the URLs you entered into sources.list. Next enter apt-get upgrade -y, and this should install all the updates the system found.

    Congratulations, you’ve installed OpenMediaVault!

    Setting up the File Server

    You should now connect any storage (internal or USB) that you want to use for your server. You can turn off the machine if you need to by pulling the plug, or holding the physical power button down for a few seconds, or by entering shutdown now in the command line. After connecting the storage, turn the system back on.

    Once setup is complete, OpenMediaVault can generally be completely controlled and managed from the web. But to do this, you need your server’s local IP address. Log in (if you haven’t already) using the root password you set up during the installation process. Enter ip addr in the command line. Somewhere on the screen, you should see something that looks like inet 192.168.xx.xx/xx. That set of numbers connected by decimal points but before the slash (for example: 192.168.444.23) is your local IP address. Write that down.

    Now, go into any other computer connected to the same network (i.e. on WiFi or plugged into the router) as the media server and enter the local IP address you wrote down into the address bar of a browser. If you configured everything correctly, you should see something like this (you may have to change the language to English by clicking on the globe icon in the upper right):

    The OpenMediaVault administrative panel login

    Congratulations, you no longer need to connect a keyboard or mouse to your server, because you can manage it from any other computer on the network!

    Login using the default username admin and default password openmediavault. Below are the key things to do first. (Note: after hitting Save on a major change, as an annoying extra precaution, OpenMediaVault will ask you to confirm the change again with a bright yellow confirmation banner at the top. You can wait until you have several changes, but you need to make sure you hit the check mark at least once or your changes won’t be reflected):

    • Change your password: This panel controls the configuration for your system, so it’s best not to let it be the default. You can do this by clicking on the (user settings) icon in the upper-right and selecting Change Password
    • Some useful odds & ends:
      • Make auto logout (time before the panel logs you out automatically) longer. You can do this by going to [System > Workbench] in the menu and changing Auto logout to something like 60 minutes
      • Set the system timezone. You can do this by going to [System > Date & Time] and changing the Time zone field.
    • Update the software: On the left-hand side, select [System > Update Management > Updates]. Press the button to search for new updates. If any show up press the button to install everything on the list that it can. (see below, Image credit: OMV-extras Wiki)
    • Mount your storage:
      • From the menu, select [Storage > Disks]. The table that results (see below) shows everything OpenMediaVault sees connected to your server. If you’re missing anything, time to troubleshoot (check the connection and then make sure the storage works on another computer).
      • It’s a good idea (although not strictly necessary) to reformat any un-empty disks before using them with OpenMediaVault for performance. You can do this by selecting the disk entry (marking it yellow) and then pressing the (Wipe) button
      • Go to [Storage > File Systems]. This shows what drives (and what file systems) are accessible to OpenMediaVault. To properly mount your storage:
        • Press the button for every unformatted drive added you may want to mount to OpenMediaVault. This will add a disk with an existing file system to the purview of your file server.
        • Press the button in the upper-left (just to the right of the triangular button) to add a drive that’s just been formatted. Of the file system options that come up, I would choose EXT4 (it’s what modern Linux operating systems tend to use). This will result in your chosen file system being added to the drive before it’s ultimately mounted.
    • Set up your File Server: Ok, you’ve got storage! Now you want to make it available for the computers on your network. To do this, you need to do three things:
      • Enabling SMB/CIFS: Windows, Mac OS, and Linux systems tend to work pretty well with SMB/CIFS for network file shares. From the menu, select [Services > SMB/CIFS > Settings].

        Check the Enabled box. If your LAN workgroup is something other than the default WORKGROUP you should enter it. Now any device on your network that supports SMB/CIFS will be able to see the folders that OpenMediaVault shares. (see below, Image credit: OMV-extras Wiki)
      • Selecting folders to share: On the left-hand-side of the administrative panel, select [Storage > Shared Folders]. This will list all the folders that can be shared.

        To make a folder available to your network, select the button in the upper-left, and fill out the Name (what you want the folder to be called when other’s access it) and select the File System you’ve previously mounted that the folder will connect to. You can write out the name of the directory you want to share and/or use the directory folder icon to the right of the Relative Path field to help select the right folder. Under Permissions, for simplicity I would assign Everyone: read/write. (see below, Image credit: OMV-extras Wiki)


        Hit Save to return to the list of folder shares (see below for what a completed entry looks like, Image credit: OMV-extras Wiki). Repeat the process to add as many Shared Folders as you’d like.
      • Make the shared folders available to SMB/CIFS: To do this go to [Services > SMB/CIFS > Shares]. Hit the button and, in, Shared Folder, select the Shared Folder you configured from the dropdown. Under Public, select Guests allowed – this will allow users on the network to access the folder without supplying a username or password. Check the Inherit Permissions, Extended attributes, and Store DOS attributes boxes as well and then hit Save. Repeat this for all the shared folders you want to make available. (Image credit: OMV-extras Wiki)
    • Set a static local IP: Home networks typically dynamically assign IP addresses to the devices on the network (something called DHCP). As a result, the IP address for your server may suddenly change. To give your server a consistent address to connect to, you should configure your router to assign a static IP to your server. The exact instructions will vary by router so you’ll need to consult your router’s documentation. In my household, we use Google Wifi and, if you do too, here are the instructions for doing so. (Make sure to write down the static IP you assign to the server as you will need it later. If you change the IP from what it already was, make sure to log into the OpenMediaVault panel from that new address before proceeding.)
    • Check that the shared folders show up on your network: Linux, Mac OS, and Windows all have separate ways of mounting a SMB/CIFS file share. The steps above hopefully simplify this by:
      • letting users connect as a Guest (no extra authentication needed)
      • providing a Static IP address for the file share

    Docker and OMV-Extras

    Once upon a time, setting up other software you might want to run on your home server required a lot of command line work. While efficient, it made worse the consequences of entering the wrong command or having two applications with conflicting dependencies. After all, a certain blogger accidentally deleted his entire blog because he didn’t understand what he was doing.

    Enter containers. Containers are “portable environments” for software, first popularized by the company Docker, that gives software a predictable background to run on. This makes it easier to run applications reliably, regardless of machine (because the application only sees what the container shows it). It also means a greatly reduced risk of a misconfigured app affecting another since the application “lives” in its own container.

    While this has tremendous implications for software in general, for our purposes, this just makes it a lot easier to install software … provided you have Docker installed. For OpenMediaVault, the best way to get Docker is to install OMV-extras.

    If you know how to use ssh, go ahead and use it to access your server’s IP address, login as the root user, and skip to Step 4. But, if you don’t, the easiest way to proceed is to set up WeTTY (Steps 1-3):

    1. Install WeTTY: Go to [System > Plugins] and search or scroll until you find the row for openmediavault-wetty. Click on it to mark it yellow and then press the button to install it. WeTTY is a web-based terminal which will let you access the server command line from a browser.
    2. Enable WeTTY: Once the install is complete, go to [Services > WeTTY], check the Enabled box, and hit Save. You’ll be prompted by OpenMediaVault to confirm the pending change.
    3. Press Open UI button on the page to access WeTTY: It should open up a new tab that takes you to your-ip-address:2222 which should open up a black screen which is basically the command line for your server! Enter root when prompted for your username and then your root password that you configured during installation.
    4. Enter this into the command line:
      wget -O - https://github.com/OpenMediaVault-Plugin-Developers/packages/raw/master/install | bash
      Installation will take a while but once it’s complete, you can verify it by going back to your administrative panel, refreshing the page, and seeing if there is a new menu item [System > omv-extras].
    5. Enable the Docker repo: From the administrative panel, go to [System > omv-extras] and check the Docker repo box. Press the apt clean button once you have.
    6. Install the Docker-compose plugin: Go to [System > Plugins] and search or scroll down until you find the entry for openmediavault-compose. Click on it to mark it yellow and then press the button on the upper-left to install it. To confirm that it’s been installed, you should see a new menu item [Services > Compose]
    7. Update the System: As before, select [System > Update Management > Updates]. Press the button to search for new updates. Press the button which will automatically install everything.
    8. Create three shared folders: compose, containers, and config: Just as with setting up the network folder shares, you can do this by going to [Storage > Shared Folders] and pressing the button in the upper left. You can generally pick any location you’d like, but make sure it’s on a file system with a decent amount of storage as media server applications can store quite a bit of configuration and temporary data (e.g. preview thumbnails).

      compose and containers will be used by Docker to store the information it needs to set up and operate the containers you’ll want.

      I would also recommend sharing config on the local network to make it easier to see and change the application configuration files (go to [Services > SMB/CIFS > Shares] and add it in the same way you did for the File Server step). Later below, I use this to add a custom theme to Ubooquity.
    9. Configure Docker Compose: Go to [Services > Compose > Settings]. Where it says Shared folder under Compose Files, select the compose folder you created in Step 8. Where it says Docker storage under Docker, copy the absolute path (not the relative path) for the containers folder (which you can get from [Storage > Shared Folders]). Once that’s all set. Press Reinstall Docker.
    10. Set up a User for Docker: You’ll need to create a separate user for Docker as it is dangerous to give any application full access to your root user. Go to [Users > Users] (yes, that is Users twice). Press the button to create a new user. You can give it whatever name (i.e. dockeruser) and password you want, but under Groups make sure to select both docker and users. Hit Save and once you’re set you should see your new user on the table. Make a note of the UID and GID (they’ll probably be 1000 and 100, respectively, if this is your first user other than the root) as you’ll need it when you install applications.

    That was a lot! But, now you’ve set up Docker Compose. Now let’s use it to install some applications!

    Setting up Media Server(s)

    Before you set up the applications that access your data, you should make sure all of that data (i.e. photos you’ve taken, music you’ve downloaded, movies you’ve ripped / bought, PDFs you’d like to make available, etc.) are on your server and organized.

    My suggestion is to set up a shared folder accessible to the network (mine is called Media) and have subdirectories in that folder corresponding for the different types of files that you may want your media server(s) to handle (for example: Videos, Photos, Files, etc). Then, use the network to move the files over (you should get comparable, if not faster, speeds as a USB transfer on a local area network).

    The two media servers I’ve set up on my system are Plex (to serve videos, photos, and music) and Ubooquity (to serve files and especially ePUB/PDFs). There are other options out there, many of which can be similarly deployed using Docker compose, but I’m just going to cover my setup with Plex and Ubooquity below.

    Plex

    • Why I chose it:
      • I’ve been using Plex for many years now, having set up clients on virtually all of my devices (phones, tablets, computers, and smart TVs).
      • I bought a lifetime Plex Pass a few years back which gives me access to even more functionality (including Intel Quick Sync transcode).
      • It has a wealth of automatic features (i.e. automatic video detection and tagging, authenticated access through the web without needing to configure a VPN, etc.) that have worked reliably over the years.
      • With a for-profit company backing it, (I believe) there’s a better chance that the platform will grow (they built a surprisingly decent free & ad-sponsored Live TV offering a few years ago) and be supported over the long-term
    • How to set up Docker Compose: Go to [Services > Compose > Files] and press the button. Under Name put down Plex and under File, paste the following (making sure the number of spaces are consistent)
      version: "2.1"
      services:
      plex:
      image: lscr.io/linuxserver/plex:latest
      container_name: plex
      network_mode: host
      environment:
      - PUID=<UID of Docker User>
      - PGID=<GID of Docker User>
      - TZ=America/Los_Angeles
      - VERSION=docker
      devices:
      - /dev/dri/:/dev/dri/
      volumes:
      - <absolute path to shared config folder>/plex:/config
      - <absolute path to Media folder>:/media
      restart: unless-stopped
      You need to replace <UID of Docker User> and <GID of Docker User> with the UID and GID of the Docker user you created when you set up Docker Compose (Step 10 above), which will likely be 1000 and 100 if you followed the steps I laid out.

      You can get the the absolute paths to your config folder and the location of your media files by going to [Storage > Shared Folders] in the administrative panel. I added a /plex to the config folder path under volumes:. This way you can install as many apps through Docker as you want and consolidate all of their configuration files in one place, while still keeping them separate.

      If you have an Intel QuickSync CPU, the two lines that start with devices: and /dev/dri/ will allow Plex to use it (provided you also paid for a Plex Pass). If you don’t have a chip with Intel QuickSync, haven’t paid for Plex Pass, or don’t want it, leave out those two lines.

      I live in the Bay Area so I set timezone TZ to America/Los_Angeles. You can find yours here.

      Once you’re done, hit Save and you should be returned to your list of Docker compose files for the next step. Notice that the new Plex entry you created has a Down status, showing the container has yet to be initiated.
    • How to start / update / stop / remove your Plex container: You can manage all of your Docker Compose files by going to [Services > Compose > Files]. Click on the Plex entry (which should turn it yellow) and press the (up) button. This will create the container, download any files needed, and run it.

      And that’s it! To prove it worked, go to http://your-ip-address:32400/web in a browser and you should see a login screen (see image below)


      From time to time, you’ll want to update your software. Docker makes this very easy. Because of the image: lscr.io/linuxserver/plex:latest line, every time you press the (pull) button, Docker will pull the latest version from linuxserver.io (a group that maintains commonly used Linux containers) and, usually, you can get away with an update without needing to stop or restart your container.

      Similarly, to stop the Plex container, simply tap the (stop) button. And to delete the container, tap the (down) button.
    • Getting started with Plex: There are great guides that have been written on the subject but my main recommendations are:
      • Do the setup wizard. It has good default settings (automatic library scans, remote access, etc.) — and I haven’t had to make many tweaks.
      • Take advantage of remote access — You can access your Plex server even when you’re not at home just by going to plex.tv and logging in.
      • Install Plex clients everywhere — It’s available on pretty much everything (Web, iOS, Android) and, with remote access, becomes a pretty easy way to get access to all of your content
      • I hide most of Plex’s default content in the Plex clients I’ve setup. While their ad-sponsored offerings are actually pretty good, I’m rarely consuming those. You can do this by configuring which things are pinned, and I pretty much only leave the things on my media server up.

    Ubooquity

    • Why I chose it: Ubooquity has, sadly, not been updated in almost 5 years as of this writing. But, I still chose it for two reasons. First, unlike many alternatives, it does not require me to create a new file organization structure or manually tag my old files to work. It simply shows me my folder structure, lets me open the files one page at a time, maintains read location across devices, and lets me have multiple users.

      Second, it’s available as a container on linuxserver.io (like Plex) which makes it easy to install and means that the infrastructure (if not the application) will continue to be updated as new container software comes out.

      I may choose to switch (and the beauty of Docker is that it’s very easy to just install another content server to try it out) but for now Ubooquity made the most sense.
    • How to set up the Docker Compose configuration: Like with Plex, go to [Services > Compose > Files] and press the button. Under Name put down Ubooquity and under File, paste the following
      ---
      version: "2.1"
      services:
      ubooquity:
      image: lscr.io/linuxserver/ubooquity:latest
      container_name: ubooquity
      environment:
      - PUID=<UID of Docker User>
      - PGID=<GID of Docker User>
      - TZ=America/Los_Angeles
      - MAXMEM=512
      volumes:
      - <absolute path to shared config folder>/ubooquity:/config
      - <absolute path to shared Media folder>/Books:/books
      - <absolute path to shared Media folder>/Comics:/comics
      - <absolute path to shared Media folder>/Files:/files
      ports:
      - 2202:2202
      - 2203:2203
      restart: unless-stopped
      You need to replace <UID of Docker User> and <GID of Docker User> with the UID and GID of the Docker user you created when you set up Docker Compose (Step 10 above), which will likely be 1000 and 100 if you followed the steps I laid out.

      You can get the the absolute paths to your config folder and the location of your media files by going to [Storage > Shared Folders] in the administrative panel. I added a /ubooquity to the config folder path under volumes:. This way you can install as many apps through Docker as you want and consolidate all of their configuration files in one place, while still keeping them separate.

      I live in the Bay Area so I set timezone TZ to America/Los_Angeles. You can find yours here.

      Once you’re done, hit Save and you should be returned to your list of Docker compose files for the next step. Notice that the Ubooquity entry you created has a Down status, showing it has yet to be initiated.
    • How to start / update / stop / remove your Ubooquity container: You can manage all of your Docker Compose files by going to [Services > Compose > Files]. Click on the Ubooquity entry (which should turn it yellow) and press the (up) button. This will create the container, download any files needed, and run the system.

      And that’s it! To prove it worked, go to your-ip-address:2202/ubooquity in a browser and you should see the user interface (image credit: Ubooquity)


      From time to time, you’ll want to update your software. Docker makes this very easy. Because of the image: lscr.io/linuxserver/ubooquity:latest line, every time you press the (pull) button, Docker will pull the latest version from linuxserver.io (a group that maintains commonly used Linux containers) and, usually, you can get away with an update without needing to stop or restart your container.

      Similarly, to stop the Ubooquity container, simply tap the (stop) button. And to remove the container, tap the (down) button.
    • Getting started with Ubooquity: While Ubooquity will more or less work out of the box, if you want to really configure your setup you’ll need to go to the admin panel at your-ip-address:2203/ubooquity/admin (you will be prompted to create a password the first time)
      • In the General tab, you can see how many files are tracked in the table at the top, configure how frequently Ubooquity scans your folders for new files under Automatic scan period, manually launch a scan if you just added files with Launch New Scan, and select a theme for the interface.
      • If you want to create User accounts to have separate read state management or to segment which users can access specific content, you can create these users in the Security tab of the administrative panel. By doing so, you’ll need to manually go into the content type tabs (i.e. Comics, Books, Raw Files) and manually configure which users have access to which shared folders.
      • The base Ubooquity interface is pretty dated so I am using a Plex-inspired theme.

        The easiest way to do this is to download the ZIP file at the link I gave. Unzip it on your computer (in this case it will result in the creation of a directory called plextheme-reading). Then, assuming the config shared folder you set up previously is shared across the network, take the unzipped directory and put it into the /ubooquity/themes subdirectory of the config folder.

        Lastly, go back to the General tab in Ubooquity admin and, next to Current theme select plextheme-reading
      • Edit (10-Aug-2023): I’ve since switched to using a Local DNS service powered by Pihole to access Ubooquity using a human readable web address ubooquity.home that every device on my network can access. For information on how to do this, refer to my post on Setting Up Pihole, Nginx Proxy, and Twingate with OpenMediaVault
        Because entering in a local ip address and remembering 2202 or 2203 and the folders afterwards is a pain, I created keyword shortcuts for these in Chrome. The instructions for doing this will vary by browser, but to do this in Chrome, go to chrome://settings/searchEngines. There is a section of the page called Site search. Press the Add button next to it. Even though the dialog box says Add Search Engine, in practice you can use this to add keywords to any URL, just put a name for the shortcut in the Search Engine field, the shortcut you want to use in Shortcut (I used ubooquity for the core application and ubooquityadmin for the administrative console) and the URLs in URL with %s in place of query (i.e. http://your-ip-address:2202/ubooquity and http://your-ip-address:2203/ubooquity/admin).

        Now to get to Ubooquity, I simply type in ubooquity in the Chrome address bar rather than a hodge podge of numbers and slashes that I’ll probably forget

    External Access

    One of Plex’s best features is making it very easy to access your media server even when you’re not on your home network. Having experienced that, I wanted the same level of access when I was out of the house to my network file share and applications like Ubooquity.

    Edit (10-Aug-2023): I’ve since switched my method of granting external access to Twingate. This provides secure access to network resources without needing to configure Dynamic DNS, a VPN, or open up a port. For more information on how to do this, refer to my post on Setting Up Pihole, Nginx Proxy, and Twingate with OpenMediaVault

    There are a few ways to do this, but the most secure path is through a VPN (virtual private network). VPNs are secure connections between computers that mimic actually being directly networked together. In our case, it lets a device securely access local network resources (like your server) even when it’s not on the home network.

    OpenMediaVault makes it relatively easy to use Wireguard, a fast and popular VPN technology with support for many different types of devices. To set up Wireguard for your server for remote access, you’ll need to do six things:

    1. Get a domain name and enable Dynamic DNS on it Most residential internet customers do not have a static IP. This means that the IP address for your home, as the rest of the world sees it, can change without warning. This makes it difficult to access externally (in much the same way that DHCP makes it hard to access your home server internally).

      To address this, many domain providers offer Dynamic DNS, where a domain name (for example: myurl.com) can point to a different IP address depending on when you access it, so long as the domain provider is told what the IP address should be whenever it changes.

      The exact instructions for how to do this will vary based on who your domain provider is. I use Namecheap and took an existing domain I owned and followed their instructions for enabling Dynamic DNS on it. I personally configured mine to use my vpn. subdomain, but you should use the setup you’d like, so long as you make a note of it for step 3 below.

      If you don’t want to buy your own domain and are comfortable using someone else’s, you can also sign up for Duck DNS which is a free Dynamic DNS service tied to a Duck DNS subdomain.
    2. Set up DDClient. To update the IP address your domain provider maps the domain to, you’ll need to run a background service on your server that will regularly check its IP address. One common way to do this is a software package called DDClient.

      Thankfully, setting up DDClient is fairly easy thanks (again!) to a linuxserver.io container. Like with Plex & Ubooquity, go to [Services > Compose > Files] and press the button. Under Name put down DDClient and under File, paste the following
      ---
      version: "2.1"
      services:
      ddclient:
      image: lscr.io/linuxserver/ddclient:latest
      container_name: ddclient
      environment:
      - PUID=<UID of Docker User>
      - PGID=<GID of Docker User>
      - TZ=America/Los_Angeles
      volumes:
      - <absolute path to shared config folder>/ddclient:/config
      restart: unless-stopped
      You need to replace <UID of Docker User> and <GID of Docker User> with the UID and GID of the Docker user you created when you set up Docker Compose (Step 10 above), which will likely be 1000 and 100 if you followed the steps I laid out.

      You can get the the absolute path to your config folder by going to [Storage > Shared Folders] in the administrative panel. I added a /ddclient to the config folder path. This way you can install as many apps through Docker as you want and consolidate all of their configuration files in one place, while still keeping them separate.

      I live in the Bay Area so I set timezone TZ to America/Los_Angeles. You can find yours here.

      Once you’re done, hit Save and you should be returned to your list of Docker compose files. Click on the DDClient entry (which should turn it yellow) and press the (up) button. This will create the container, download any files needed, and run DDClient. Now, it is ready for configuration.
    3. Configure DDClient to work with your domain provider. While the precise configuration of DDClient will vary by domain provider, the process will always involve editing a text file. To do this, login to your server using SSH or WeTTy (see the section above on Installing OMV-Extras) and enter into the command line:
      nano <absolute path to shared config folder>/ddclient/ddclient.conf
      Remember to substitute <absolute path to shared config folder> with the absolute path to the config folder you set up for your applications (which you can access by going to [Storage > Shared Folders] in the administrative panel).

      This will open up Linux’s native text editor. Scroll to the very bottom and enter the configuration information that your domain provider requires for DynamicDNS to work. As I use Namecheap, I followed these instructions. In general, you’ll need to supply some type of information about the protocol, the server, your login / password for the domain provider, and the subdomain you intend to map to your IP address.

      Then press Ctrl+X to exit, press Y when asked if you want to save, and finally Enter to confirm that you want to overwrite the old file.
    4. Set up Port Forwarding on your router. Dynamic DNS gives devices outside of your network a consistent “address” to get to your server but it won’t do any good if your router doesn’t pass those external requests through. In this case, you’ll need to tell your router to let incoming UDP requests from port 51820 through to your server to line up with Wireguard’s defaults.

      The exact instructions will vary by router so you’ll need to consult your router’s documentation. In my household, we use Google Wifi and, if you do too, here are the instructions for doing so.
    5. Enable Wireguard. If you installed OMV-Extras above as I suggested, you’ll have access to a Plugin that turns on Wireguard. Go to [System > Plugins] on the administrative panel and then search or scroll down until you find the entry for openmediavault-wireguard. Click on it to mark it yellow and then press the button to install it.

      Now go to [Services > Wireguard > Tunnels] and press the (create) button to set up a VPN tunnel. You can give it any Name you want (i.e. omv-vpn). Select your server’s main network connection for Network adapter. But, most importantly, under Endpoint, add the domain you just configured for DynamicDNS/DDClient (for example, vpn.myurl.com). Press Save
    6. Set up Wireguard on your devices. With a Wireguard tunnel configured, your next step is to set up the devices (called clients or peers) to connect. This has two parts.

      First, install the Wireguard applications on the devices themselves. Go to wireguard.com/install and download or set up the Wireguard apps. There are apps for Windows, MacOS, Android, iOS, and many flavors of Linux

      Then, go back into your administrative panel and go to [Services > Wireguard > Clients] and press the (create) button to create a valid client for the VPN. Check the box next to Enable, select the tunnel you just created under Tunnel number, put a name for the device you’re going to connect under Name, and assign a unique (or it will not work) client number in Client Number . Press Save and you’ll be brought back to the Client list. Make sure to approve the change and then press the (client config) button. What you should do next depends on what kind of client device you’re configuring.

      If the device you’re configuring is not a smartphone (i.e. a computer), copy the text that shows up in the Client Config popup that comes up and save that as a .conf file (for example: work_laptop_wireguard.conf). Send that file to the device in question as that file will be used by the Wireguard app on that device to configure and access the VPN. Hit Close when you’re done

      If the device you’re configuring is a smartphone, hit Close button on the Client Config popup that comes up as you will be presented with a QR code that your smartphone Wireguard app can capture to configure the VPN connection.

      Now go into your Wireguard app on the client device and use it to either take a picture of the QR code when prompted or load the .conf file. Your device is now configured to connect to your server securely no matter where you are. A good test of this is to disconnect a set up smartphone from your home WiFi and enable the VPN. Since you’re no longer on WiFi you should not be on the same network as your server. If you can enter http://your-ip-address in this mode into a browser and still reach the administrative panel for OpenMediaVault, you’re home free!

      One additional note: by default, Wireguard also acts as a proxy, meaning all internet traffic you send from the device will be routed through the server. This can be valuable if you’re trying to access a blocked website or pretend to be from a different location, but it can also be unnecessarily slow (and bandwidth consuming). I have my Wireguard configured to only route traffic that is going to my server’s local IP address through Wireguard. You can do this by configuring your client device’s Allowed IPs to your-ip-address (for example: 192.168.99.99) from the Wireguard app.

    Congratulations, you have now configured a file server and media server that you can securely access from anywhere!

    Concluding Thoughts

    A few concluding thoughts:

    1. This was probably way too complicated for most people. Believe it or not, what was written above is a shortened version of what I went through. Even holding aside that use of the command line and Docker automatically makes this hard for many consumers, I still had to deal with missing drivers, Linux not recognizing my USB drive through the USB C port (but through the USB A one?), puzzling over different external access configurations (VPN vs Let’s Encrypt SSL on my server vs self-sign certificate), and minimal feedback when my initial attempts to use Wireguard failed. While I learned a great deal, for most people, it makes more sense to go completely third party (i.e. use Google / Amazon / Apple for everything) or, if you have some pain tolerance, with a high-end NAS.
    2. Docker/containerization is extremely powerful. Prior to this, I had thought of Docker as just a “flavor” of virtual machine, a software technology underlying cloud computing which abstracts server software from server hardware. And, while there is some overlap, I completely misunderstood why containers were so powerful for software deployment. By using 3 fairly simple blocks of text, I was able to deploy 3 complicated applications which needed different levels of hardware and network access (Ubooquity, DDClient, Plex) in minutes without issue.
    3. I was pleasantly surprised by how helpful the blogs and forums were. While the amount of work needed to find the right advice can be daunting, every time I ran into an issue, I was able to find some guidance online (often in a forum or subreddit). While there were certainly … abrasive personalities, by in large many of the questions being asked were by non-experts and they were answered by experts showing patience and generosity of spirit. Part of the reason I wrote this is to pay this forward for the next set of people who want to experiment with setting up their own server.
    4. I am excited to try still more applications. Lists about what hobbyists are running on their home servers like this and this and this make me very intrigued by the possibilities. I’m currently considering a network-wide adblocker like Pi-Hole and backup tools like BorgBackup. There is a tremendous amount of creativity out there!

    For more help on setting any of this stuff up, here are a few additional resources that proved helpful to me:

    (If you’re interested in how to setup a home server on OpenMediaVault or how to self-host different services, check out all my posts on the subject)

  • It’s not just the GOP who misunderstands Section 230

    Source: NPR

    Section 230 of the Communications Decency Act has been rightfully called “the twenty-six words that created the Internet.” It is a valuable legal shield which allows internet hosts and platforms the ability to distribute user-generated content and practice moderation without unreasonable fear of being sued, something which forms the basis of all social media, user review, and user forum, and internet hosting services.

    In recent months, as big tech companies have drawn greater scrutiny for the role they play in shaping our discussions, Section 230 has become a scapegoat for many of the ills of technology. Until 2021, much of that criticism has come from the Republican Party who argue incorrectly that it promotes bias on platforms with President Trump even vetoing unrelated defense legislation because it did not repeal Section 230.

    So, it’s refreshing (and distressing) to see the Democrats now take their turn in misunderstanding what Section 230 does for the internet. This critique is based mainly on Senator Mark Warner’s proposed changes to Section 230 and the FAQ his office posted about the SAFE TECH act he (alongside Senators Hirono and Klobuchar) is proposing but apply to many commentators from the Democratic Party and the press which seems to have misunderstood the practical implications and have received this positively.

    While I think it’s reasonable to modify Section 230 to obligate platforms to help victims of clearly heinous acts like cyberstalking, swatting, violent threats, and human rights violations, what the Democratic Senators are proposing goes far beyond that in several dangerous ways.

    First, Warner and his colleagues have proposed carving out from Section 230 all content which accompanies payment (see below). While I sympathize with what I believe was the intention (to put a different bar on advertisements), this is remarkably short-sighted, because Section 230 applies to far more than companies with ad / content moderation policies Democrats dislike such as Facebook, Google, and Twitter.

    Source: Mark Warner’s “redlines” of Section 230; highlighting is mine

    It also encompasses email providers, web hosts, user generated review sites, and more. Any service that currently receives payment (for example: a paid blog hosting service, any eCommerce vendor who lets users post reviews, a premium forum, etc) could be made liable for any user posted content. This would make it legally and financially untenable to host any potentially controversial content.

    Secondly, these rules will disproportionately impact smaller companies and startups. This is because these smaller companies lack the resources that larger companies have to deal with the new legal burdens and moderation challenges that such a change to Section 230 would call for. It’s hard to know if Senator Warner’s glip answer in his FAQ that people don’t litigate small companies (see below) is ignorance or a willful desire to mislead, but ask tech startups how they feel about patent trolls and whether or not being small protects them from frivolous lawsuits

    Source: Mark Warner’s FAQ on SAFE TECH Act; highlighting mine

    Third, the use of the language “affirmative defense” and “injunctive relief” may have far-reaching consequences that go beyond minor changes in legalese (see below). By reducing Section 230 from an immunity to an affirmative defense, it means that companies hosting content will cease to be able to dismiss cases that clearly fall within Section 230 because they now have a “burden of [proof] by a preponderance of the evidence.”

    Source: Mark Warner’s “redlines” of Section 230; highlighting is mine

    Similarly, carving out “injunctive relief” from Section 230 protections (see below) means that Section 230 doesn’t apply if the party suing is only interested in taking something down (but not financial damages)

    Source: Mark Warner’s “redlines” of Section 230

    I suspect the intention of these clauses is to make it harder for large tech companies to dodge legitimate concerns, but what this practically means is that anyone who has the money to pursue legal action can simply tie up any internet company or platform hosting content that they don’t like.

    That may seem like hyperbole, but this is what happened in the UK until 2014 where libel / slander laws making it easy for wealthy individuals and corporations to sue anyone for negative press due to weak protections. Imagine Jeffrey Epstein being able to sue any platform for carrying posts or links to stories about his actions or any individual for forwarding an unflattering email about him.

    There is no doubt that we need new tools and incentives (both positive and negative) to tamp down on online harms like cyberbullying and cyberstalking, and that we need to come up with new and fair standards for dealing with “fake news”. But, it is distressing that elected officials will react by proposing far-reaching changes that show a lack of thoughtfulness as it pertains to how the internet works and the positives of existing rules and regulations.

    It is my hope that this was only an early draft that will go through many rounds of revisions with people with real technology policy and technology industry expertise.

  • Mea Culpa

    Mea culpa.

    I’ve been a big fan of moving my personal page over to AWS Lightsail. But, if I had one complaint, it’s the dangerous combination of (1) their pre-packaged WordPress image being hard to upgrade software on and (2) the training-wheel-lacking full root access that Lightsail gives to its customers. That combination led me to make some regrettable mistakes yesterday which resulted in the complete loss of my old blog posts and pages.

    It’s the most painful when you know your problems are your own fault. Thankfully, with the very same AWS Lightsail, it’s easy enough to start up a new WordPress instance. With the help of site visit and search engine analytics, I’ve prioritized the most popular posts and pages to resurrect using Google’s cache.

    Unfortunately, that process led to my email subscribers receiving way too many emails from me as I recreated each post. For that, I’m sorry — mea culpa — it shouldn’t happen again.

    I’ve come to terms with the fact that I’ve lost the majority of the 10+ years of content I’ve created. But, I’ve now learned the value of systematically backing up things (especially my AWS Lightsail instance), and hopefully I’ll write some good content in the future to make up for what was lost.

  • Using AI to Predict if a Paper will be in a Top-Tier Journal

    (Image credit: Intel)

    I have been doing some work in recent months with Dr. Sophia Wang at Stanford applying deep learning/AI techniques to make predictions using notes written by doctors in electronic medical records (EMR). While the mathematics behind the methods can be very sophisticated, tools like Tensorflow and Keras make it possible for people without formal training to apply them more broadly. I wanted to share some of what I’ve learned to help out those who want to start with similar work but are having trouble with getting started.

    For this tutorial (all code available on Github), we’ll write a simple algorithm to predict if a paper, based solely on its abstract and title, is one published in one of the ophthalmology field’s top 15 journals (as gauged by h-index from 2014-2018). It will be trained and tested on a dataset comprising every English-language ophthalmology-related article with an abstract in Pubmed (the free search engine operated by NIH covering practically all life sciences-related papers) published from 2010 to 2019. It will also incorporate some features I’ve used on multiple projects but which seem to be lacking in good tutorials online such as using multiple input models with tf.keras and tf.data, using the same embedding layer on multiple input sequences, and using padded batches with tf.data to incorporate sequence data. This piece also assumes you know the basics of Python 3 (and Numpy) and have installed Tensorflow 2 (with GPU support), BeautifulSoup, lxml, and Tensorflow Datasets.

    Ingesting Data from Pubmed

    Update: 21 Jul 2020 – It’s recently come to my attention that Pubmed turned on a new interface starting May 18 where the ability to download XML records in bulk for a query has been disabled. As a result the instructions below will no longer work to pull the data necessary for the analysis. I will need to do some research on how to use the E-utilities API to get the same data and will update once I’ve figured that out. The model code (provided you have the titles and abstracts and journal names extracted in text files) will still work, however.

    To train an algorithm like this one, we need a lot of data. Thankfully, Pubmed makes it easy to build a fairly complete dataset of life sciences-related language content. Using the National Library of Medicine’s MeSH Browser, which maintains a “knowledge tree” showing how different life sciences subjects are related, I found four MeSH terms which covered the Ophthalmology literature: “Eye Diseases“, “Ophthalmology“, “Ocular Physiological Phenomena“, and “Ophthalmologic Surgical Procedures“. These can be entered into the Pubmed advanced search interface and the search filters on the left can be used narrow down to the relevant criteria (English language, Human species, published from Jan 2010 to Dec 2019, have Abstract text, and a Journal Article or a Review paper):

    We can download the resulting 138,425 abstracts (and associated metadata) as a giant XML file using the “Send To” in the upper-right (see screenshot below). The resulting file is ~1.9 GB (so give it some time)

    A 1.9 GB XML file is unwieldy and not in an ideal state for use in a model, so we should first pre-process the data to create smaller text files that will only have the information we need to train and test the model. The Python script below reads each line of the XML file, one-by-one, and each time it sees the beginning of a new entry, it will parse the previous entry using BeautifulSoup (and the lxml parser that it relies on) and extract the abstract, title, and journal name as well as build up a list of words (vocabulary) for the model.

    The XML format that Pubmed uses is relatively basic (article metadata is wrapped in <PubmedArticle> tags, titles are wrapped in <ArticleTitle> tags, abbreviated journal names are in <MedlineTA> tags, etc.). However, the abstracts are not always stored in the exact same way (sometimes in <Abstract> tags, sometimes in <OtherAbstract> tags, and sometimes divided between multiple <AbstractText> tags), so much of the code is designed to handle those different cases:

    After parsing the XML, every word in the corpus that shows up at least 25 times (smallestfreq) is stored in a vocabulary list (which will be used later to convert words into numbers that a machine learning algorithm can consume), and the stored abstracts, titles, and journal names are shuffled before being written (one per line) to different text files.

    Looking at the resulting files, its clear that the journals Google Scholar identified as a top 15 journal in ophthalmology are some of the more prolific journals in the dataset (making up 8 of the 10 journals with the most entries). Of special note, ~21% of all articles in the dataset were published in one of the top 15 journals:

    JournalShort NameTop 15 Journal?Articles as % of Dataset
    Investigative Ophthalmology & Visual ScienceInvest Ophthalmol Vis SciYes3.63%
    PLoS ONEPLoS OneNo2.84%
    OphthalmologyOphthalmologyYes1.99%
    American Journal of OphthalmologyAm J OphthalmolYes1.99%
    British Journal of OphthalmologyBr J OphthalmolYes1.95%
    CorneaCorneaYes1.90%
    RetinaRetinaYes1.75%
    Journal of Cataract & Refractive SurgeryJ Cataract Refract SurgYes1.64%
    Graefe’s Archive for Clinical and Experimental Ophthalmology
    Graefes Arch Clin Exp OphthalmolNo1.59%
    Acta OphthalmologicaActa OphthalmolYes1.40%
    All Top 15 Journals~21.3%
    10 Most Prolific Journals in Pubmed query and Total % of Articles from Top 15 (by H-index) Ophthalmology Journals

    Building a Model Using Tensorflow and Keras

    Translating even a simple deep learning idea into code can be a nightmare with all the matrix calculus that needs to done. Luckily, the AI community has developed tools like Tensorflow and Keras to make this much easier (doubly so now that Tensorflow has chosen to adopt Keras as its primary API in tf.keras). It’s now possible for programmers without formal training (let alone knowledge of what matrix calculus is or how it works) to apply deep learning methods to a wide array of problems. While the quality of documentation is not always great, the abundance of online courses and tutorials (I personally recommend DeepLearning.ai’s Coursera specialization) bring these methods within reach for a self-learner willing to “get their hands dirty.”

    As a result, while we could focus on all the mathematical details of how the model architecture we’ll be using works, the reality is its unnecessary. What we need to focus on are what the main components of the model are and what they do:

    Model Architecture
    1. The model converts the words in the title into vectors of numbers called Word Embeddings using the Keras Embedding layer. This provides the numerical input that a mathematical model can understand.
    2. The sequence of word embeddings representing the abstract is then fed to a bidirectional recurrent neural network based on Gated Recurrent Units (enabled by using a combination of the Keras Bidirectional layer wrapper and the Keras GRU layer). This is a deep learning architecture that is known to be good at understanding sequences of things, something which is valuable in language tasks (where you need to differentiate between “the dog sat on the chair” and “the chair sat on the dog” even though they both use the same words).
    3. The Bidirectional GRU layer outputs a single vector of numbers (actually two which are concatenated but that is done behind the scenes) which is then fed through a Dropout layer to help reduce overfitting (where an algorithm learns to “memorize” the data it sees instead of a relationship).
    4. The result of step 3 is then fed through a single layer feedforward neural network (using the Keras Dense layer) which results in another vector which can be thought of as “the algorithm’s understanding of the title”
    5. Repeat steps #1-4 (using the same embedding layer from #1) with the words in the abstract
    6. Combine the outputs for the abstract (step 4) and title (step 5) by literally sticking the two vectors together to make one larger vector — and then run this combination through a 2-layer feedforward neural network (which will “combine” the “understanding” the algorithm has developed of both the abstract and the title) with an intervening Dropout layer (to guard against overfitting).
    7. The final layer should output a single number using a sigmoid activation because the model is trying to learn to predict a binary outcome (is this paper in a top-tier journal [1] or not [0])

    The above description skips a lot of the mathematical detail (i.e. how does a GRU work, how does Dropout prevent overfitting, what mathematically happens in a feedforward neural network) that other tutorials and papers cover at much greater length. It also skims over some key implementation details (what size word embedding, what sort of activation functions should we use in the feedforward neural network, what sort of random variable initialization, what amount of Dropout to implement).

    But with a high level framework like tf.keras, those details become configuration variables where you either accept the suggested defaults (as we’ve done with random variable initialization and use of biases) or experiment with values / follow convention to find something that works well (as we’ve done with embedding dimension, dropout amount, and activation function). This is illustrated by how relatively simple the code to outline the model is (just 22 lines if you leave out the comments and extra whitespace):

    Building a Data Pipeline Using tf.data

    Now that we have a model, how do we feed the data previously parsed from Pubmed into it? To handle some of the complexities AI researchers regularly run into with handling data, Tensorflow released tf.dataa framework for building data pipelines to handle datasets that are too large to fit in memory and/or where there is significant pre-processing work that needs to happen.

    We start by creating two functions

    1. encode_journal which takes a journal name as an input and returns a 1 if its a top ophthalmology journal and a 0 if it isn’t (by doing a simple comparison with the right list of journal names)
    2. encode_text which takes as inputs the title and the abstract for a given journal article and converts them into a list of numbers which the algorithm can handle. It does this by using the TokenTextEncoder function that is provided by the Tensorflow Datasets library which takes as an input the vocabulary list which we created when we initially parsed the Pubmed XML

    These methods will be used in our data pipeline to convert the inputs into something usable by the model (as an aside: this is why you see the .numpy() in encode_text — its used to convert the tensor objects that tf.data will pass to them into something that can be used by the model)

    Because of the way that Tensorflow operates, encode_text and encode_journal need to be wrapped in tf.py_function calls (which let you run “normal” Python code on Tensorflow graphs) which is where encode_text_map_fn and encode_journal_map_fn (see code snippet below) come in.

    Finally, to round out the data pipeline, we:

    1. Use tf.data.TextLineDataset to ingest the text files where the parsed titles, abstracts, and journal names reside
    2. Use the tf.data.Dataset.zip method to combine the title and abstract datasets into a single input dataset (input_dataset).
    3. Use input_dataset‘s map method to apply encode_text_map_fn so that the model will consume the inputs as lists of numbers
    4. Take the journal name dataset (journal_dataset) and apply its map method to apply encode_journal_map_fn so that the model will consume the inputs as 1’s or 0’s depending on if the journal is one of the top 15
    5. Use the tf.data.Dataset.zip method to combine the input (input_dataset) with the output (journal_dataset) in a single dataset that our model can use

    To help the model generalize, its a good practice to split the data into four groups to avoid biasing the algorithm training or evaluation on data it has already seen (i.e. why a good teacher makes the tests related to but not identical to the homework), mainly:

    1. The training set is the data our algorithm will learn from (and, as a result, should be the largest of the four).
    2. (I haven’t seen a consistent name for this, but I create) a stoptrain set to check on the training process after each epoch (a complete run through of the training set) so as to stop training if the resulting model starts to overfit the training set.
    3. The validation set is the data we’ll use to compare how different model architectures and configurations are doing once they’ve completed training.
    4. The hold-out set or test set is what we’ll use to gauge how our final algorithm performs. It is called a hold-out set because its not to be used until the very end so as to make it a truly fair benchmark.

    tf.data makes this step very simple (using skip — to skip input entries — and take which, as its name suggests, takes a given number of entries). Lastly, tf.data also provides batching methods like padded_batch to normalize the length of the inputs (since the different abstracts and titles will have different lengths, we will truncating or pad each title to be 72 words and 200 words long, respectively) between the batches of 100 articles that the algorithm will train on successively:

    Training and Testing the Model

    Now that we have both a model and a data pipeline, training and evaluation is actually relatively straightforward and involves specifying which metrics to track as well as specifics on how the training should take place:

    • Because the goal is to gauge how good the model is, I’ve asked the model to report on four metrics (Sensitivity/Recall [what fraction of journal articles from the top 15 did you correctly get], Precision [what fraction of the journal articles that you predicted would be in the top 15 were actually correct], Accuracy [how often did the model get the right answer], and AUROC [a measure of how good your model trades off between sensitivity and precision]) as the algorithm trains
    • The training will use the Adam optimizer (a popular, efficient training approach) applied to a binary cross-entropy loss function (which makes sense as the algorithm is making a yes/no binary prediction).
    • The model is also set to stop training once the stoptrain set shows signs of overfitting (using tf.keras.callbacks.EarlyStopping in callbacks)
    • The training set (train_dataset), maximum number of epochs (complete iterations through the training set), and the stoptrain set (stoptrain_dataset) are passed to model.fit to start the training
    • After training, the model can be evaluated on the validation set (validation_dataset) using model.evaluate:

    After several iterations on model architecture and parameters to find a good working model, the model can finally be evaluated on the holdout set (test_dataset) to see how it would perform:

    Results

    Your results will vary based on the exact shuffle of the data, which initial random variables your algorithm started training with, and a host of other factors, but when I ran it, training usually completed in 2-3 epochs, with performance resembling the data table below:

    StatisticValue
    Accuracy87.2%
    Sensitivity/Recall64.9%
    Precision72.7%
    AUROC91.3%

    A different way of assessing the model is by showing its ROC (Receiver Operating Characteristic) curve, the basis for the AUROC (area under the ROC curve) number and a visual representation of the tradeoff the model can make between True Positives (on the y-axis) and False Positives (on the x-axis) as compared with a purely random guess (blue vs red). At >91% AUROC (max: 100%, min: 50%), the model outperforms many common risk scores, let alone a random guess:

    Interestingly, despite the model’s strong performance (both AUROC and the point estimates on sensitivity, precision, and accuracy), it wasn’t immediately apparent to me, or a small sample of human scientists who I polled, what the algorithm was looking for when I shared a set of very high scoring and very low scoring abstracts and titles. This is one of the quirks and downsides of many AI approaches — their “black box” nature — and one in which followup analysis may reveal more.

    Special thanks to Sophia Wang and Anthony Phan for reviewing abstracts and reading an early version of this!

    Link to GitHub page with all code

  • Calculating the Financial Returns to College

    Despite the recent spotlight on the staggering $1.5 trillion in student debt that 44 million Americans owe in 2019, there has been surprisingly little discussion on how to measure the value of a college education relative to its rapidly growing price tag (which is the reason so many take on debt to pay for it).

    Source: US News

    While it’s impossible to quantify all the intangibles of a college education, the tools of finance offers a practical, quantitative way to look at the tangible costs and benefits which can shed light on (1) whether to go to college / which college to go to, (2) whether taking on debt to pay for college is a wise choice, and (3) how best to design policies around student debt.

    The below briefly walks through how finance would view the value of a college education and the soundness of taking on debt to pay for it and how it can help guide students / families thinking about applying and paying for colleges and, surprisingly, how there might actually be too little college debt and where policy should focus to address some of the issues around the burden of student debt.

    The Finance View: College as an Investment

    Through the lens of finance, the choice to go to college looks like an investment decision and can be evaluated in the same way that a company might evaluate investing in a new factory. Whereas a factory turns an upfront investment of construction and equipment into profits on production from the factory, the choice to go to college turns an upfront investment of cash tuition and missed salary while attending college into higher after-tax wages.

    Finance has come up with different ways to measure returns for an investment, but one that is well-suited here is the internal rate of return (IRR). The IRR boils down all the aspects of an investment (i.e., timing and amount of costs vs. profits) into a single percentage that can be compared with the rates of return on another investment or with the interest rate on a loan. If an investment’s IRR is higher than the interest rate on a loan, then it makes sense to use the loan to finance the investment (i.e., borrowing at 5% to make 8%), as it suggests that, even if the debt payments are relatively onerous in the beginning, the gains from the investment will more than compensate for it.

    To gauge what these returns look like, I put together a Google spreadsheet which generated the figures and charts below (this article in Investopedia explains the math in greater detail). I used publicly available data around wages (from the 2017 Current Population SurveyGoBankingRate’s starting salaries by school, and National Association of Colleges and Employer’s starting salaries by major), tax brackets (using the 2018 income tax), and costs associated with college (from College Board’s statistics [PDF] and the Harvard admissions website). To simplify the comparisons, I assumed a retirement age of 65, and that nobody gets a degree more advanced than a Bachelor’s.

    To give an example: if Sally Student can get a starting salary after college in line with the average salary of an 18-24 year old Bachelor’s degree-only holder ($47,551), would have earned the average salary of an 18-24 year old high school diploma-only holder had she not gone to college ($30,696), and expects wage growth similar to what age-matched cohorts saw from 1997-2017, then the IRR of a 4-year degree at a non-profit private school if Sally pays the average net (meaning after subtracting grants and tax credits) tuition, fees, room & board ($26,740/yr in 2017, or a 4-year cost of ~$106,960), the IRR of that investment in college would be 8.1%.

    How to Benchmark Rates of Return

    Is that a good or a bad return? Well, in my opinion, 8.1% is pretty good. Its much higher than what you’d expect from a typical savings account (~0.1%) or a CD or a Treasury Bond (as of this writing), and is also meaningfully higher than the 5.05% rate charged for federal subsidized loans for 2018-2019 school year — this means borrowing to pay for college would be a sensible choice. That being said, its not higher than the stock market (the S&P500 90-year total return is ~9.8%) or the 20% that you’d need to get into the top quartile of Venture Capital/Private Equity funds [PDF].

    What Drives Better / Worse Rates of Return

    Playing out different scenarios shows which factors are important in determining returns. An obvious factor is the cost of college:

    T&F: Tuition & Fees; TFR&B: Tuition, Fees, Room & Board
    List: Average List Price; Net: Average List Price Less Grants and Tax Benefits
    Blue: In-State Public; Green: Private Non-Profit; Red: Harvard

    As evident from the chart, there is huge difference between the rate of return Sally would get if she landed the same job but instead attended an in-state public school, did not have to pay for room & board, and got a typical level of financial aid (a stock-market-beating IRR of 11.1%) versus the world where she had to pay full list price at Harvard (IRR of 5.3%). In one case, attending college is a fantastic investment and Sally borrowing money to pay for it makes great sense (investors everywhere would love to borrow at ~5% and get ~11%). In the other, the decision to attend college is less straightforward (financially), and it would be very risky for Sally to borrow money at anything near subsidized rates to pay for it.

    Some other trends jump out from the chart. Attending an in-state public university improves returns for the average college wage-earner by 1-2% compared with attending private universities (comparing the blue and green bars). Getting an average amount of financial aid (paying net vs list) also seems to improve returns by 0.7-1% for public schools and 2% for private.

    As with college costs, the returns also understandably vary by starting salary:

    There is a night and day difference between the returns Sally would see making $40K per year (~$10K more than an average high school diploma holder) versus if she made what the average Caltech graduate does post-graduation (4.6% vs 17.9%), let alone if she were to start with a six-figure salary (IRR of over 21%). If Sally is making six figures, she would be making better returns than the vast majority of venture capital firms, but if she were starting at $40K/yr, her rate of return would be lower than the interest rate on subsidized student loans, making borrowing for school financially unsound.

    Time spent in college also has a big impact on returns:

    Graduating sooner not only reduces the amount of foregone wages, it also means earning higher wages sooner and for more years. As a result, if Sally graduates in two years while still paying for four years worth of education costs, she would experience a higher return (12.6%) than if she were to graduate in three years and save one year worth of costs (11.1%)! Similarly, if Sally were to finish school in five years instead of four, this would lower her returns (6.3% if still only paying for four years, 5.8% if adding an extra year’s worth of costs). The result is that an extra / less year spent in college is a ~2% hit / boost to returns!

    Finally, how quickly a college graduate’s wages grow relative to a high school diploma holder’s also has a significant impact on the returns to a college education:

    Census/BLS data suggests that, between 1997 and 2017, wages of bachelor’s degree holders grew faster on an annualized basis by ~0.7% per year than for those with only a high school diploma (6.7% vs 5.8% until age 35, 4.0% vs 3.3% for ages 35-55, both sets of wage growth appear to taper off after 55).

    The numbers show that if Sally’s future wages grew at the same rate as the wages of those with only a high school diploma, her rate of return drops to 5.3% (just barely above the subsidized loan rate). On the other hand, if Sally’s wages end up growing 1% faster until age 55 than they did for similar aged cohorts from 1997-2017, her rate of return jumps to a stock-market-beating 10.3%.

    Lessons for Students / Families

    What do all the charts and formulas tell a student / family considering college and the options for paying for it?

    First, college can be an amazing investment, well worth taking on student debt and the effort to earn grants and scholarships. While there is well-founded concern about the impact that debt load and debt payments can have on new graduates, in many cases, the financial decision to borrow is a good one. Below is a sensitivity table laying out the rates of return across a wide range of starting salaries (the rows in the table) and costs of college (the columns in the table) and color codes how the resulting rates of return compare with the cost of borrowing and with returns in the stock market (red: risky to borrow at subsidized rates; white: does make sense to borrow at subsidized rates but it’s sensible to be mindful of the amount of debt / rates; green: returns are better than the stock market).

    Except for graduates with well below average starting salaries (less than or equal to $40,000/yr), most of the cells are white or green. At the average starting salary, except for those without financial aid attending a private school, the returns are generally better than subsidized student loan rates. For those attending public schools with financial aid, the returns are better than what you’d expect from the stock market.

    Secondly, there are ways to push returns to a college education higher. They involve effort and sometimes painful tradeoffs but, financially, they are well worth considering. Students / families choosing where to apply or where to go should keep in mind costs, average starting salaries, quality of career services, and availability of financial aid / scholarships / grants, as all of these factors will have a sizable impact on returns. After enrollment, student choices / actions can also have a meaningful impact: graduating in fewer semesters/quarters, taking advantage of career resources to research and network into higher starting salary jobs, applying for scholarships and grants, and, where possible, going for a 4th/5th year masters degree can all help students earn higher returns to help pay off any debt they take on.

    Lastly, use the spreadsheet*! The figures and charts above are for a very specific set of scenarios and don’t factor in any particular individual’s circumstances or career trajectory, nor is it very intelligent about selecting what the most likely alternative to a college degree would be. These are all factors that are important to consider and may dramatically change the answer.

    *To use the Google Sheet, you must be logged into a Google account; use the “Make a Copy” command in the File menu to save a version to your Google Drive and edit the tan cells with red numbers in them to whatever best matches your situation and see the impact on the yellow highlighted cells for IRR and the age when investment pays off

    Implications for Policy on Student Debt

    Given the growing concerns around student debt and rising tuitions, I went into this exercise expecting to find that the rates of return across the board would be mediocre for all but the highest earners. I was (pleasantly) surprised to discover that a college graduate earning an average starting salary would be able to achieve a rate of return well above federal loan rates even at a private (non-profit) university.

    While the rate of return is not a perfect indicator of loan affordability (as it doesn’t account for how onerous the payments are compared to early salaries), the fact that the rates of return are so high is a sign that, contrary to popular opinion, there may actually be too little student debt rather than too much, and that the right policy goal may actually be to find ways to encourage the public and private sector to make more loans to more prospective students.

    As for concerns around affordability, while proposals to cancel all student debt plays well to younger voters, the fact that many graduates are enjoying very high returns suggests that such a blanket policy is likely unnecessary, anti-progressive (after all, why should the government zero out the costs on high-return investments for the soon-to-be upper and upper-middle-classes), and fails to address the root cause of the issue (mainly that there shouldn’t be institutions granting degrees that fail to be good financial investments). Instead, a more effective approach might be:

    • Require all institutions to publish basic statistics (i.e. on costs, availability of scholarships/grants, starting salaries by degree/major, time to graduation, etc.) to help students better understand their own financial equation
    • Hold educational institutions accountable when too many students graduate with unaffordable loan burdens/payments (i.e. as a fraction of salary they earn and/or fraction of students who default on loans) and require them to make improvements to continue to qualify for federally subsidized loans
    • Making it easier for students to discharge student debt upon bankruptcy and increasing government oversight of collectors / borrower rights to prevent abuse
    • Government-supported loan modifications (deferrals, term changes, rate modifications, etc.) where short-term affordability is an issue (but long-term returns story looks good); loan cancellation in cases where debt load is unsustainable in the long-term (where long-term returns are not keeping up) or where debt was used for an institution that is now being denied new loans due to unaffordability
    • Making the path to public service loan forgiveness (where graduates who spend 10 years working for non-profits and who have never missed an interest payment get their student loans forgiven) clearer and addressing some of the issues which have led to 99% of applications to date being rejected

    Special thanks Sophia Wang, Kathy Chen, and Dennis Coyle for reading an earlier version of this and sharing helpful comments!

    Thought this was interesting or helpful? Check out some of my other pieces on investing / finance.

  • Lyft vs Uber: A Tale of Two S-1’s

    You can learn a great deal from reading and comparing the financial filings of two close competitors. Tech-finance nerd that I am, you can imagine how excited I was to see Lyft’s and Uber’s respective S-1’s become public within mere weeks of each other.

    While the general financial press has covered a lot of the top-level figures on profitability (or lack thereof) and revenue growth, I was more interested in understanding the unit economics — what is the individual “unit” (i.e. a user, a sale, a machine, etc.) of the business and what does the history of associated costs and revenues say about how the business will (or will not) create durable value over time.

    For two-sided regional marketplaces like Lyft and Uber, an investor should understand the full economic picture for (1) the users/riders, (2) the drivers, and (3) the regional markets. Sadly, their S-1’s don’t make it easy to get much on (2) or (3) — probably because the companies consider the pertinent data to be highly sensitive information. They did, however, provide a fair amount of information on users/riders and rides and, after doing some simple calculations, a couple of interesting things emerged

    Uber’s Users Spend More, Despite Cheaper Rides

    As someone who first knew of Uber as the UberCab “black-car” service, and who first heard of Lyft as the Zimride ridesharing platform, I was surprised to discover that Lyft’s average ride price is significantly more expensive than Uber’s and the gap is growing! In Q1 2017, Lyft’s average bookings per ride was $11.74 and Uber’s was $8.41, a difference of $3.33. But, in Q4 2018, Lyft’s average bookings per ride had gone up to $13.09 while Uber’s had declined to $7.69, increasing the gap to $5.40.

    Sources: Lyft S-1Uber S-1

    This is especially striking considering the different definitions that Lyft and Uber have for “bookings” — Lyft excludes “ pass-through amounts paid to drivers and regulatory agencies, including sales tax and other fees such as airport and city fees, as well as tips, tolls, cancellation, and additional fees” whereas Uber’s includes “ applicable taxes, tolls, and fees “. This gap is likely also due to Uber’s heavier international presence (where they now generate 52% of their bookings). It would be interesting to see this data on a country-by-country basis (or, more importantly, a market-by-market one as well).

    Interestingly, an average Uber rider appears to also take ~2.3 more rides per month than an average Lyft rider, a gap which has persisted fairly stably over the past 3 years even as both platforms have boosted the number of rides an average rider takes. While its hard to say for sure, this suggests Uber is either having more luck in markets that favor frequent use (like dense cities), with its lower priced Pool product vs Lyft’s Line product (where multiple users can share a ride), or its general pricing is encouraging greater use.

    Sources: Lyft S-1Uber S-1

    Note: the “~monthly” that you’ll see used throughout the charts in this post are because the aggregate data — rides, bookings, revenue, etc — given in the regulatory filings is quarterly, but the rider/user count provided is monthly. As a result, the figures here are approximations based on available data, i.e. by dividing quarterly data by 3

    What does that translate to in terms of how much an average rider is spending on each platform? Perhaps not surprisingly, Lyft’s average rider spend has been growing and has almost caught up to Uber’s which is slightly down.

    Sources: Lyft S-1Uber S-1

    However, Uber’s new businesses like UberEats are meaningfully growing its share of wallet with users (and nearly perfectly dollar for dollar re-opens the gap on spend per user that Lyft narrowed over the past few years). In 2018 Q4, the gap between the yellow line (total bookings per user, including new businesses) and the red line (total bookings per user just for rides) is almost $10 / user / month! Its no wonder that in its filings, Lyft calls its users “riders”, but Uber calls them “Active Platform Consumers”.

    Despite Pocketing More per Ride, Lyft Loses More per User

    Long-term unit profitability is more than just how much an average user is spending, its also how much of that spend hits a company’s bottom line. Perhaps not surprisingly, because they have more expensive rides, a larger percent of Lyft bookings ends up as gross profit (revenue less direct costs to serve it, like insurance costs) — ~13% in Q4 2018 compared with ~9% for Uber. While Uber’s has bounced up and down, Lyft’s has steadily increased (up nearly 2x from Q1 2017). I would hazard a guess that Uber’s has also increased in its more established markets but that their expansion efforts into new markets (here and abroad) and new service categories (UberEats, etc) has kept the overall level lower.

    Sources: Lyft S-1Uber S-1

    Note: the gross margin I’m using for Uber adds back a depreciation and amortization line which were separated to keep the Lyft and Uber numbers more directly comparable. There may be other variations in definitions at work here, including the fact that Uber includes taxes, tolls, and fees in bookings that Lyft does not. In its filings, Lyft also calls out an analogous “Contribution Margin” which is useful but I chose to use this gross margin definition to try to make the numbers more directly comparable.

    The main driver of this seems to be higher take rate (% of bookings that a company keeps as revenue) — nearly 30% in the case of Lyft in Q4 2018 but only 20% for Uber (and under 10% for UberEats)

    Sources: Lyft S-1Uber S-1

    Note: Uber uses a different definition of take rate in their filings based on a separate cut of “Core Platform Revenue” which excludes certain items around referral fees and driver incentives. I’ve chosen to use the full revenue to be more directly comparable

    The higher take rate and higher bookings per user has translated into an impressive increase in gross profit per user. Whereas Lyft once lagged Uber by almost 50% on gross profit per user at the beginning of 2017, Lyft has now surpassed Uber even after adding UberEats and other new business revenue to the mix.

    Sources: Lyft S-1Uber S-1

    All of this data begs the question, given Lyft’s growth and lead on gross profit per user, can it grow its way into greater profitability than Uber? Or, to put it more precisely, are Lyft’s other costs per user declining as it grows? Sadly, the data does not seem to pan out that way

    Sources: Lyft S-1Uber S-1

    While Uber had significantly higher OPEX (expenditures on sales & marketing, engineering, overhead, and operations) per user at the start of 2017, the two companies have since reversed positions, with Uber making significant changes in 2018 which lowered its OPEX per user spend to under $9 whereas Lyft’s has been above $10 for the past two quarters. The result is Uber has lost less money per user than Lyft since the end of 2017

    Sources: Lyft S-1Uber S-1

    The story is similar for profit per ride. Uber has consistently been more profitable since 2017, and they’ve only increased that lead since. This is despite the fact that I’ve included the costs of Uber’s other businesses in their cost per ride.

    Sources: Lyft S-1Uber S-1

    Does Lyft’s Growth Justify Its Higher Spend?

    One possible interpretation of Lyft’s higher OPEX spend per user is that Lyft is simply investing in operations and sales and engineering to open up new markets and create new products for growth. To see if this strategy has paid off, I took a look at the Lyft and Uber’s respective user growth during this period of time.

    Sources: Lyft S-1Uber S-1

    The data shows that Lyft’s compounded quarterly growth rate (CQGR) from Q1 2016 to Q4 2018 of 16.4% is only barely higher than Uber’s at 15.3% which makes it hard to justify spending nearly $2 more per user on OPEX in the last two quarters.

    Interestingly, despite all the press and commentary about #deleteUber, it doesn’st seem to have really made a difference in their overall user growth (its actually pretty hard to tell from the chart above that the whole thing happened around mid-Q1 2017).

    How are Drivers Doing?

    While there is much less data available on driver economics in the filings, this is a vital piece of the unit economics story for a two-sided marketplace. Luckily, Uber and Lyft both provide some information in their S-1’s on the number of drivers on each platform in Q4 2018 which are illuminating.

    Image for post
    Sources: Lyft S-1Uber S-1

    The average Uber driver on the platform in Q4 2018 took home nearly double what the average Lyft driver did! They were also more likely to be “utilized” given that they handled 136% more rides than the average Lyft driver and, despite Uber’s lower price per ride, saw more total bookings.

    It should be said that this is only a point in time comparison (and its hard to know if Q4 2018 was an odd quarter or if there is odd seasonality here) and it papers over many other important factors (what taxes / fees / tolls are reflected, none of these numbers reflect tips, are some drivers doing shorter shifts, what does this look like specifically in US/Canada vs elsewhere, are all Uber drivers benefiting from doing both UberEats and Uber rideshare, etc). But the comparison is striking and should be alarming for Lyft.

    Closing Thoughts

    I’d encourage investors thinking about investing in either to do their own deeper research (especially as the competitive dynamic is not over one large market but over many regional ones that each have their own attributes). That being said, there are some interesting takeaways from this initial analysis

    • Lyft has made impressive progress at increasing the value of rides on its platform and increasing the share of transactions it gets. One would guess that, Uber, within established markets in the US has probably made similar progress.
    • Despite the fact that Uber is rapidly expanding overseas into markets that face more price constraints than in the US, it continues to generate significantly better user economics and driver economics (if Q4 2018 is any indication) than Lyft.
    • Something happened at Uber at the end of 2017/start of 2018 (which looks like it coincides nicely with Dara Khosrowshahi’s assumption of CEO role) which led to better spending discipline and, as a result, better unit economics despite falling gross profits per user
    • Uber’s new businesses (in particular UberEats) have had a significant impact on Uber’s share of wallet.
    • Lyft will need to find more cost-effective ways of growing its business and servicing its existing users & drivers if it wishes to achieve long-term sustainability as its current spend is hard to justify relative to its user growth.

    Special thanks to Eric Suh for reading and editing an earlier version!

    Thought this was interesting or helpful? Check out some of my other pieces on investing / finance.

  • How to Regulate Big Tech

    There’s been a fair amount of talk lately about proactively regulating — and maybe even breaking up — the “Big Tech” companies.

    Full disclosure: this post discusses regulating large tech companies. I own shares in several of these both directly (in the case of Facebook and Microsoft) and indirectly (through ETFs that own stakes in large companies)

    Source: MIT Sloan

    Like many, I have become increasingly uneasy over the fact that a small handful of companies, with few credible competitors, have amassed so much power over our personal data and what information we see. As a startup investor and former product executive at a social media startup, I can especially sympathize with concerns that these large tech companies have created an unfair playing field for smaller companies.

    At the same time, though, I’m mindful of all the benefits that the tech industry — including the “tech giants” — have brought: amazing products and services, broader and cheaper access to markets and information, and a tremendous wave of job and wealth creation vital to may local economies. For that reason, despite my concerns of “big tech”‘s growing power, I am wary of reaching for “quick fixes” that might change that.

    As a result, I’ve been disappointed that much of the discussion has centered on knee-jerk proposals like imposing blanket stringent privacy regulations and forcefully breaking up large tech companies. These are policies which I fear are not only self-defeating but will potentially put into jeopardy the benefits of having a flourishing tech industry.

    The Challenges with Regulating Tech

    Technology is hard to regulate. The ability of software developers to collaborate and build on each other’s innovations means the tech industry moves far faster than standard regulatory / legislative cycles. As a result, many of the key laws on the books today that apply to tech date back decades — before Facebook or the iPhone even existed, making it important to remember that even well-intentioned laws and regulations governing tech can cement in place rules which don’t keep up when the companies and the social & technological forces involved change.

    Another factor which complicates tech policy is that the traditional “big is bad” mentality ignores the benefits to having large platforms. While Amazon’s growth has hurt many brick & mortar retailers and eCommerce competitors, its extensive reach and infrastructure enabled businesses like Anker and Instant Pot to get to market in a way which would’ve been virtually impossible before. While the dominance of Google’s Android platform in smartphones raised concerns from European regulators, its hard to argue that the companies which built millions of mobile apps and tens of thousands of different types of devices running on Android would have found it much more difficult to build their businesses without such a unified software platform. Policy aimed at “Big Tech” should be wary of dismantling the platforms that so many current and future businesses rely on.

    Its also important to remember that poorly crafted regulation in tech can be self-defeating. The most effective way to deal with the excesses of “Big Tech”, historically, has been creating opportunities for new market entrants. After all, many tech companies previously thought to be dominant (like Nokia, IBM, and Microsoft) lost their positions, not because of regulation or antitrust, but because new technology paradigms (i.e. smartphones, cloud), business models (i.e. subscription software, ad-sponsored), and market entrants (i.e. Google, Amazon) had the opportunity to flourish. Because rules (i.e. Article 13/GDPR) aimed at big tech companies generally fall hardest on small companies (who are least able to afford the infrastructure / people to manage it), its important to keep in mind how solutions for “Big Tech” problems affect smaller companies and new concepts as well.

    Framework for Regulating “Big Tech”

    If only it were so easy… Source: XKCD

    To be 100% clear, I’m not saying that the tech industry and big platforms should be given a pass on rules and regulation. If anything, I believe that laws and regulation play a vital role in creating flourishing markets.

    But, instead of treating “Big Tech” as just a problem to kill, I think we’d be better served by laws / regulations that recognize the limits of regulation on tech and, instead, focus on making sure emerging companies / technologies can compete with the tech giants on a level playing field. To that end, I hope to see more ideas that embrace the following four pillars:

    I. Tiering regulation based on size of the company

    Regulations on tech companies should be tiered based on size with the most stringent rules falling on the largest companies. Size should include traditional metrics like revenue but also, in this age of marketplace platforms and freemium/ad-sponsored business models, account for the number of users (i.e. Monthly Active Users) and third party partners.

    In this way, the companies with the greatest potential for harm and the greatest ability to bear the costs face the brunt of regulation, leaving smaller companies & startups with greater flexibility to innovate and iterate.

    II. Championing data portability

    One of the reasons it’s so difficult for competitors to challenge the tech giants is the user lock-in that comes from their massive data advantage. After all, how does a rival social network compete when a user’s photos and contacts are locked away inside Facebook?

    While Facebook (and, to their credit, some of the other tech giants) does offer ways to export user data and to delete user data from their systems, these tend to be unwieldy, manual processes that make it difficult for a user to bring their data to a competing service. Requiring the largest tech platforms to make this functionality easier to use (i.e., letting others import your contact list and photos with the ease in which you can login to many apps today using Facebook) would give users the ability to hold tech companies accountable for bad behavior or not innovating (by being able to walk away) and fosters competition by letting new companies compete not on data lock-in but on features and business model.

    III. Preventing platforms from playing unfairly

    3rd party platform participants (i.e., websites listed on Google, Android/iOS apps like Spotify, sellers on Amazon) are understandably nervous when the platform owners compete with their own offerings (i.e., Google Places, Apple Music, Amazon first party sales)As a result, some have even called for banning platform owners from offering their own products and services.

    I believe that is an overreaction. Platform owners offering attractive products and services (i.e., Google offering turn-by-turn navigation on Android phones) can be a great thing for users (after all, most prominent platforms started by providing compelling first-party offerings) and for 3rd party participants if these offerings improve the attractiveness of the platform overall.

    What is hard to justify is when platform owners stack the deck in their favor using anti-competitive moves such as banning or reducing the visibility of competitors, crippling third party offeringsmaking excessive demands on 3rd parties, etc. Its these sorts of actions by the largest tech platforms that pose a risk to consumer choice and competition and should face regulatory scrutiny. Not just the fact that a large platform exists or that the platform owner chooses to participate in it.

    IV. Modernizing how anti-trust thinks about defensive acquisitions

    The rise of the tech giants has led to many calls to unwind some of the pivotal mergers and acquisitions in the space. As much as I believe that anti-trust regulators made the wrong calls on some of these transactions, I am not convinced, beyond just wanting to punish “Big Tech” for being big, that the Pandora’s Box of legal and financial issues (for the participants, employees, users, and for the tech industry more broadly) that would be opened would be worthwhile relative to pursuing other paths to regulate bad behavior directly.

    That being said, its become clear that anti-trust needs to move beyond narrow revenue share and pricing-based definitions of anti-competitiveness (which do not always apply to freemium/ad-sponsored business models). Anti-trust prosecutors and regulators need to become much more thoughtful and assertive around how some acquisitions are done simply to avoid competition (i.e., Google’s acquisition of Waze and Facebook’s acquisition of WhatsApp are two examples of landmark acquisitions which probably should have been evaluated more closely).

    Wrap-Up

    Source: OECD Forum Network

    This is hardly a complete set of rules and policies needed to approach growing concerns about “Big Tech”. Even within this framework, there are many details (i.e., who the specific regulators are, what specific auditing powers they have, the details of their mandate, the specific thresholds and number of tiers to be set, whether pre-installing an app counts as unfair, etc.) that need to be defined which could make or break the effort. But, I believe this is a good set of principles that balances both the need to foster a tech industry that will continue to grow and drive innovation as well as the need to respond to growing concerns about “Big Tech”.

    Special thanks to Derek Yang and Anthony Phan for reading earlier versions and giving me helpful feedback!

  • Migrating WordPress to AWS Lightsail and Going with Let’s Encrypt!

    (Update Jan 2021: Bitnami has made available a new tool bncert which makes it even easier to enable HTTPS with a Let’s Encrypt certificate; the instructions below using Let’s Encrypt’s certbot still work but I would recommend people looking to enable HTTPS to use Bitnami’s new bncert process)

    I recently made two big changes to the backend of this website to keep up with the times as internet technology continues to evolve.

    First, I migrated from my previous web hosting arrangements at WebFaction to Amazon Web Services’s new Lightsail offering. I have greatly enjoyed WebFaction’s super simple interface and fantastic documentation which seemed tailored to amateur coders like myself (having enough coding and customization chops to do some cool projects but not a lot of confidence or experience in dealing with the innards of a server). But, the value for money that AWS Lightsail offers ($3.50/month for Linux VPS including static IP vs. the $10/month I would need to pay to eventually renew my current setup) ultimately proved too compelling to ignore (and for a simple personal site, I didn’t need the extra storage or memory). This coupled with the deterioration in service quality I have been experiencing with WebFaction (many more downtime email alerts from WordPress’s Jetpack plugin and the general lagginess in the WordPress administrative panel) and the chance to learn more about the world’s pre-eminent cloud services provider made this an easy decision.

    Given how Google Chrome now (correctly) marks all websites which don’t use HTTPS/SSL as insecure and Let’s Encrypt has been offering SSL certificates for free for several years, the second big change I made was to embrace HTTPS to partially modernize my website and make it at least not completely insecure. Along the way, I also tweaked my URLs so that all my respective subdomains and domain variants would ultimately point to https://benjamintseng.com/.

    For anyone who is also interested in migrating an existing WordPress deployment on another host to AWS Lightsail and turning on HTTPS/SSL, here are the steps I followed (gleamed from some online research and a bit of trial & error). Its not as straightforward as some other setups, but its very do-able if you are willing to do a little bit of work in the AWS console:

    • Follow the (fairly straightforward) instructions in the AWS Lightsail tutorial around setting up a clean WordPress deploymentI would skip sub-step 3 of step 6 (directing your DNS records to point to the Lightsail nameservers) until later (when you’re sure the transfer has worked so your domain continues to point to a functioning WordPress deployment).
    • Unless you are currently not hosting any custom content (no images, no videos, no Javascript files, etc) on your WordPress deployment, I would ignore the WordPress migration tutorial at the AWS Lightsail website (which won’t show you how to transfer this custom content over) in favor of this Bitnami how-to-guide (Bitnami provides the WordPress server image that Lightsail uses for its WordPress instance) which takes advantage of the fact that the Bitnami WordPress includes the All-in-One WP Migration plugin which, for free, can do single file backups of your WordPress site up to 512 MB (larger sites will need to pay for the premium version of the plugin).
      • If, like me, you have other content statically hosted on your site outside of WordPress, I’d recommend storing it in WordPress as part of the Media Library which has gotten a lot more sophisticated over the past few years. Its where I now store the files associated with my Projects
      • Note: if, like me, you are using Jetpack’s site accelerator to cache your images/static file assets, don’t worry if upon visiting your site some of the images appear broken. Jetpack relies on the URL of the asset to load correctly. This should get resolved once you point your DNS records accordingly (literally the next step) and any other issues should go away after you mop up any remaining references to the wrong URLs in your database (see the bullet below where I reference the Better Search Replace plugin).
    • If you followed my advice above, now would be the time to change your DNS records to point to the Lightsail nameservers (sub-step 3 of step 6 of the AWS Lightsail WordPress tutorial) — wait a few hours to make sure the DNS settings have propagated and then test out your domain and make sure it points to a page with the Bitnami banner in the lower right (sign that you’re using the Bitnami server image, see below)
    The Bitnami banner in the lower-right corner of the page you should see if your DNS propagated correctly and your Lightsail instance is up and running
    • To remove that ugly banner, follow the instructions in this tutorial (use the AWS Lightsail panel to get to the SSH server console for your instance and, assuming you followed the above instructions, follow the instructions for Apache)
    • Assuming your webpage and domain all work (preferably without any weird uptime or downtime issues), you can proceed with this tutorial to provision a Let’s Encrypt SSL certificate for your instance. It can be a bit tricky as it entails spending a lot of time in the SSH server console (which you can get to from the AWS Lightsail panel) and tweaking settings in the AWS Lightsail DNS Zone manager, but the tutorial does a good job of walking you through all of it. (Update Jan 2021: Bitnami has made available a new tool bncert which makes it even easier to enable HTTPS. While the link above using Let’s Encrypt’s certbot still works, I would recommend people use Bitnami’s new bncert process going forward)
      • I would strongly encourage you to wait to make sure all the DNS settings have propagated and that your instance is not having any strange downtime (as mine did when I first tried this) as if you have trouble connecting to your page, it won’t be immediately clear what is to blame and you won’t be able to take reactive measures.
    • I used the plugin Better Search Replace to replace all references to intermediate domains (i.e. the IP addresses for your Lightsail instance that may have stuck around after the initial step in Step 1) or the non-HTTPS domains (i.e. http://yourdomain.com or http://www.yourdomain.com) with your new HTTPS domain in the MySQL databases that power your WordPress deployment (if in doubt, just select the wp_posts table). You can also take this opportunity to direct all your yourdomain.com traffic to www.yourdomain.com (or vice versa). You can also do this directly in MySQL but the plugin allows you to do this across multiple tables very easily and allows you to do a “dry run” first where it finds and counts all the times it will make a change before you actually execute it.
    • If you want to redirect all the traffic to www.yourdomain.com to yourdomain.com, you have two options. If your domain registrar is forward thinking and does simple redirects for you like Namecheap does, that is probably the easiest path. That is sadly not the path I took because I transferred my domain over to AWS’s Route 53 which is not so enlightened. If you also did the same thing / have a domain registrar that is not so forward thinking, you can tweak the Apache server settings to achieve the same effect. To do this, go into the SSH server console for your Lightsail instance and:
      • Run cd ~/apps/wordpress/conf
      • To make a backup which you can restore (if you screw things up) run mv httpd-app.conf httpd-app.conf.old
      • I’m going to use the Nano editor because its the easiest for a beginner (but feel free to use vi or emacs if you prefer), but run nano httpd-app.conf
      • Use your cursor and find the line that says RewriteEngine On that is just above the line that says #RewriteBase /wordpress/
      • Enter the following lines
        • # begin www to non-www
        • RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC]
        • RewriteRule ^(.*)$ https://%1/$1 [R=permanent,L]
        • # end www to non-www
        • The first and last line are just comments so that you can go back and remind yourself of what you did and where. The middle two lines are where the server recognizes incoming URL requests and redirects them accordingly
        • With any luck, your file will look like the image below — hit ctrl+X to exit, and hit ‘Y’ when prompted (“to save modified buffer”) to save your work
      • Run sudo /opt/bitnami/ctlscript.sh restart to restart your server and test out the domain in a browser to make sure everything works
        • If things go bad, run mv httpd-app.conf.old httpd-app.conf and then restart everything by running sudo /opt/bitnami/ctlscript.sh restart
    What httpd-app.conf should look like in your Lightsail instance SSH console after the edits

    I’ve only been using AWS Lightsail for a few days, but my server already feels much more responsive. It’s also nice to go to my website and not see “not secure” in my browser address bar (its also apparently an SEO bump for most search engines). Its also great to know that Lightsail is integrated deeply into AWS which makes the additional features and capabilities that have made AWS the industry leader (i.e. load balancers, CloudFront as CDN, scaling up instance resources, using S3 as a datastore, or even ultimately upgrading to full-fledged EC2 instances) are readily available.

  • Snap Inc by the Numbers

    A look at what Snap’s S-1 reveals about their growth story and unit economics

    If you follow the tech industry at all, you will have heard that consumer app darling Snap Inc. (makers of the app Snapchat) has filed to go public. The ensuing Form S-1 that has recently been made available has left tech-finance nerds like yours truly drooling over the until-recently-super-secretive numbers behind their business.

    Oddly apt banner; Source: Business Insider

    Much of the commentary in the press to date has been about how unprofitable the company is (having lost over $500M in 2016 alone). I have been unimpressed with that line of thinking — as what the bottom line is in a given year is hardly the right measure for assessing a young, high-growth company.

    While full-time Wall Street analysts will pour over the figures and comparables in much greater detail than I can, I decided to take a quick peek at the numbers to gauge for myself how the business is doing as a growth investment, looking at:

    • What does the growth story look like for the business?
    • Do the unit economics allow for a path to profitability?

    What does the growth story look like for the business?

    As I’ve noted before, consumer media businesses like Snap have two options available to grow: (1) increase the number of users / amount of time spent and/or (2) better monetize users over time

    A quick peek at the DAU (Daily Active Users) counts of Snap reveal that path (1) is troubled for them. Using Facebook as a comparable (and using the midpoint of Facebook’s quarter-end DAU counts to line up with Snap’s average DAU over a quarter) reveals not only that Snap’s DAU numbers aren’t growing so much, their growth outside of North America (where they should have more room to grow) isn’t doing that great either (which is especially alarming as the S-1 admits Q4 is usually seasonally high for them).

    Last 3 Quarters of DAU growth, by region

    A quick look at the data also reveals why Facebook prioritizes Android development and low-bandwidth-friendly experiences — international remains an area of rapid growth which is especially astonishing considering how over 1 billion Facebook users are from outside of North America. This contrasts with Snap which, in addition to needing a huge amount of bandwidth (as a photo and video intensive platform) also (as they admitted in their S-1) de-emphasizes Android development. Couple that with Snap’s core demographic (read: old people can’t figure out how to use the app), reveals a challenge to where quick short-term user growth can come from.

    As a result, Snap’s growth in the near term will have to be driven more by path (2). Here, there is a lot more good news. Snap’s quarterly revenue per user more than doubled over the last 3 quarters to $1.029/DAU. While its a long way off from Facebook’s whopping $7.323/DAU (and over $25 if you’re just looking at North American users), it suggests that there is plenty of opportunity for Snap to increase monetization, especially overseas where its currently able to only monetize about 1/10 as effectively as they are in North America (compared to Facebook which is able to do so 1/5 to 1/6 of North America depending on the quarter).

    2016 and 2015 Q2-Q4 Quarterly Revenue per DAU, by region

    Considering Snap has just started with its advertising business and has already convinced major advertisers to build custom content that isn’t readily reusable on other platforms and Snap’s low revenue per user compared even to Facebook’s overseas numbers, I think its a relatively safe bet that there is a lot of potential for the number to go up.

    Do the unit economics allow for a path to profitability?

    While most folks have been (rightfully) stunned by the (staggering) amount of money Snap lost in 2016, to me the more pertinent question (considering the over $1 billion Snap still has in its coffers to weather losses) is whether or not there is a path to sustainable unit economics. Or, put more simply, can Snap grow its way out of unprofitability?

    Because neither Facebook nor Snap provide regional breakdowns of their cost structure, I’ve focused on global unit economics, summarized below:

    2016 and 2015 Q2-Q4 Quarterly Financials per DAU

    What’s astonishing here is that neither Snap nor Facebook seem to be gaining much from scale. Not only are their costs of sales per user (cost of hosting infrastructure and advertising infrastructure) increasing each quarter, but the operating expenses per user (what they spend on R&D, sales & marketing, and overhead — so not directly tied to any particular user or dollar of revenue) don’t seem to be shrinking either. In fact, Facebook’s is over twice as large as Snap’s — suggesting that its not just a simple question of Snap growing a bit further to begin to experience returns to scale here.

    What makes the Facebook economic machine go, though, is despite the increase in costs per user, their revenue per user grows even faster. The result is profit per user is growing quarter to quarter! In fact, on a per user basis, Q4 2016 operating profit exceeded Q2 2015 gross profit(revenue less cost of sales, so not counting operating expenses)! No wonder Facebook’s stock price has been on a tear!

    While Snap has also been growing its revenue per user faster than its cost of sales (turning a gross profit per user in Q4 2016 for the first time), the overall trendlines aren’t great, as illustrated by the fact that its operating profit per user has gotten steadily worse over the last 3 quarters. The rapid growth in Snap’s costs per user and the fact that Facebook’s costs are larger and still growing suggests that there are no simple scale-based reasons that Snap will achieve profitability on a per user basis. As a result, the only path for Snap to achieve sustainability on unit economics will be to pursue huge growth in user monetization.

    Tying it Together

    The case for Snap as a good investment really boils down to how quickly and to what extent one believes that the company can increase their monetization per user. While the potential is certainly there (as is being realized as the rapid growth in revenue per user numbers show), what’s less clear is whether or not the company has the technology or the talent (none of the key executives named in the S-1 have a particular background building advertising infrastructure or ecosystems that Google, Facebook, and even Twitter did to dominate the online advertising businesses) to do it quickly enough to justify the rumored $25 billion valuation they are striving for (a whopping 38x sales multiple using 2016 Q4 revenue as a run-rate [which the S-1 admits is a seasonally high quarter]).

    What is striking to me, though, is that Snap would even attempt an IPO at this stage. In my mind, Snap has a very real shot at being a great digital media company of the same importance as Google and Facebook and, while I can appreciate the hunger from Wall Street to invest in a high-growth consumer tech company, not having a great deal of visibility / certainty around unit economics and having only barely begun monetization (with your first quarter where revenue exceeds cost of sales is a holiday quarter) poses challenges for a management team that will need to manage public market expectations around forecasts and capitalization.

    In any event, I’ll be looking forward to digging in more when Snap reveals future figures around monetization and advertising strategy — and, to be honest, Facebook’s numbers going forward now that I have a better appreciation for their impressive economic model.

    Thought this was interesting or helpful? Check out some of my other pieces on investing / finance.

  • Dr. Machine Learning

    How to realize the promise of applying machine learning to healthcare

    Not going to happen anytime soon, sadly: the Doctor from Star Trek: Voyager; Source: TrekCore

    Despite the hype, it’ll likely be quite some time before human physicians will be replaced with machines (sorry, Star Trek: Voyager fans).

    While “smart” technology like IBM’s Watson and Alphabet’s AlphaGo can solve incredibly complex problems, they are probably not quite ready to handle the messiness of qualitative unstructured information from patients and caretakers (“it kind of hurts sometimes”) that sometimes lie (“I swear I’m still a virgin!”) or withhold information (“what does me smoking pot have to do with this?”) or have their own agendas and concerns (“I just need some painkillers and this will all go away”).

    Instead, machine learning startups and entrepreneurs interested in medicine should focus on areas where they can augment the efforts of physicians rather than replace them.

    One great example of this is in diagnostic interpretation. Today, doctors manually process countless X-rays, pathology slides, drug adherence records, and other feeds of data (EKGs, blood chemistries, etc) to find clues as to what ails their patients. What gets me excited is that these tasks are exactly the type of well-defined “pattern recognition” problems that are tractable for an AI / machine learning approach.

    If done right, software can not only handle basic diagnostic tasks, but to dramatically improve accuracy and speed. This would let healthcare systems see more patients, make more money, improve the quality of care, and let medical professionals focus on managing other messier data and on treating patients.

    As an investor, I’m very excited about the new businesses that can be built here and put together the following “wish list” of what companies setting out to apply machine learning to healthcare should strive for:

    • Excellent training data and data pipeline: Having access to large, well-annotated datasets today and the infrastructure and processes in place to build and annotate larger datasets tomorrow is probably the main defining . While its tempting for startups to cut corners here, that would be short-sighted as the long-term success of any machine learning company ultimately depends on this being a core competency.
    • Low (ideally zero) clinical tradeoffs: Medical professionals tend to be very skeptical of new technologies. While its possible to have great product-market fit with a technology being much better on just one dimension, in practice, to get over the innate skepticism of the field, the best companies will be able to show great data that makes few clinical compromises (if any). For a diagnostic company, that means having better sensitivty and selectivity at the same stage in disease progression (ideally prospectively and not just retrospectively).
    • Not a pure black box: AI-based approaches too often work like a black box: you have no idea why it gave a certain answer. While this is perfectly acceptable when it comes to recommending a book to buy or a video to watch, it is less so in medicine where expensive, potentially life-altering decisions are being made. The best companies will figure out how to make aspects of their algorithms more transparent to practitioners, calling out, for example, the critical features or data points that led the algorithm to make its call. This will let physicians build confidence in their ability to weigh the algorithm against other messier factors and diagnostic explanations.
    • Solve a burning need for the market as it is today: Companies don’t earn the right to change or disrupt anything until they’ve established a foothold into an existing market. This can be extremely frustrating, especially in medicine given how conservative the field is and the drive in many entrepreneurs to shake up a healthcare system that has many flaws. But, the practical reality is that all the participants in the system (payers, physicians, administrators, etc) are too busy with their own issues (i.e. patient care, finding a way to get everything paid for) to just embrace a new technology, no matter how awesome it is. To succeed, machine diagnostic technologies should start, not by upending everything with a radical solution, but by solving a clear pain point (that hopefully has a lot of big dollar signs attached to it!) for a clear customer in mind.

    Its reasons like this that I eagerly follow the development of companies with initiatives in applying machine learning to healthcare like Google’s DeepMind, Zebra Medical, and many more.

  • Why VR Could be as Big as the Smartphone Revolution

    Technology in the 1990s and early 2000s marched to the beat of an Intel-and-Microsoft-led drum.

    Source: IT Portal

    Intel would release new chips at a regular cadence: each cheaper, faster, and more energy efficient than the last. This would let Microsoft push out new, more performance-hungry software, which would, in turn, get customers to want Intel’s next, more awesome chip. Couple that virtuous cycle with the fact that millions of households were buying their first PCs and getting onto the Internet for the first time — and great opportunities were created to build businesses and products across software and hardware.

    But, over time, that cycle broke down. By the mid-2000s, Intel’s technological progress bumped into the limits of what physics would allow with regards to chip performance and cost. Complacency from its enviable market share coupled with software bloat from its Windows and Office franchises had a similar effect on Microsoft. The result was that the Intel and Microsoft drum stopped beating as they became unable to give the mass market a compelling reason to upgrade to each subsequent generation of devices.

    The result was a hollowing out of the hardware and semiconductor industries tied to the PC market that was only masked by the innovation stemming from the rise of the Internet and the dawn of a new technology cycle in the late 2000s in the form of Apple’s iPhone and its Android competitors: the smartphone.

    Source: Mashable

    A new, but eerily familiar cycle began: like clockwork, Qualcomm, Samsung, and Apple (playing the part of Intel) would devise new, more awesome chips which would feed the creation of new performance-hungry software from Google and Apple (playing the part of Microsoft) which led to demand for the next generation of hardware. Just as with the PC cycle, new and lucrative software, hardware, and service businesses flourished.

    But, just as with the PC cycle, the smartphone cycle is starting to show signs of maturity. Apple’s recent slower than expected growth has already been blamed on smartphone market saturation. Users are beginning to see each new generation of smartphone as marginal improvements. There are also eery parallels between the growing complaints over Apple software quality from even Apple fans and the position Microsoft was in near the end of the PC cycle.

    While its too early to call the end for Apple and Google, history suggests that we will eventually enter a similar phase with smartphones that the PC industry experienced. This begs the question: what’s next? Many of the traditional answers to this question — connected cars, the “Internet of Things”, Wearables, Digital TVs — have not yet proven themselves to be truly mass market, nor have they shown the virtuous technology upgrade cycle that characterized the PC and smartphone industries.

    This brings us to Virtual Reality. With VR, we have a new technology paradigm that can (potentially) appeal to the mass market (new types of games, new ways of doing work, new ways of experiencing the world, etc.). It also has a high bar for hardware performance that will benefit dramatically from advances in technology, not dissimilar from what we saw with the PC and smartphone.

    Source: Forbes

    The ultimate proof will be whether or not a compelling ecosystem of VR software and services emerges to make this technology more of a mainstream “must-have” (something that, admittedly, the high price of the first generation Facebook/OculusHTC/Valve, and Microsoft products may hinder).

    As a tech enthusiast, its easy to get excited. Not only is VR just frickin’ cool (it is!), its probably the first thing since the smartphone with the mass appeal and virtuous upgrade cycle that can bring about the huge flourishing of products and companies that makes tech so dynamic to be involved with.

    Thought this was interesting? Check out some of my other pieces on Tech industry