Entries Tagged 'Programming' ↓

Keystone Federated Swift – Multi-region cluster, multiple federation, access same account

Welcome to the final post in the series, it has been a long time coming. If required/requested I’m happy to delve into any of these topics deeper, but I’ll attempt to explain the situation, the best approach to take and how I got a POC working, which I am calling the brittle method. It definitely isn’t the best approach but as it was solely done on the Swift side and as I am a OpenStack Swift dev it was the quickest and easiest for me when preparing for the presentation.

To first understand how we can build a federated environment where we have access to our account no matter where we go, we need to learn about how keystone authentication works from a Swift perspective. Then we can look at how we can solve the problem.

Swift’s Keystoneauth middleware

As mentioned in earlier posts, there isn’t any magic in the way Swift authentication works. Swift is an end-to-end storage solution and so authentication is handled via authentication middlewares. further a single Swift cluster can talk to multiple auth backends, which is where the `reseller_prefix` comes into play. This was the first approach I blogged about in these series.

 

There is nothing magical about how authentication works, keystoneauth has it’s own idiosyncrasies, but in general it simply makes a decision whether this request should be allowed. It makes writing your own simple, and maybe an easily way around the problem. Ie. write an auth middleware to auth directly to your existing company LDAP server or authentication system.

 

To setup keystone authentication, you use keystones authtoken middleware and directly afterwards in the pipeline place the Swift keystone middleware, configuring each of them in the proxy configuration:

pipeline = ... authtoken keystoneauth ... proxy-server

The authtoken middleware

Generally every request to Swift will include a token, unless it’s using tempurl, container-sync or to a container that has global read enabled but you get the point.

As the swift-proxy is a python wsgi app the request first hits the first middleware in the pipeline (left most) and works it’s way to the right. When it hits the authtoken middleware the token in the request will be sent to keystone to be authenticated.

The resulting metadata, ie the user, storage_url, groups, roles etc, and dumped into the request environment and then passed to the next middleware. The keystoneauth middleware.

The keystoneauth middleware

The keystoneauth middleware checks the request environment for the metadata dumped by the authtoken middleware and makes a decision based on that. Things like:

  • If the token was one for one of the reseller_admin roles, then they have access.
  • If the user isn’t a swift user of the account/project the request is for, is there an ACL that will allow it.
  • If the user has a role that identifies them as a swift user/operator of this Swift account then great.

 

When checking to see if the user has access to the given account (Swift account) it needs to know what account the request is for. This is easily determined as it’s defined by the path of the URL your hitting. The URL you send to the Swift proxy is what we call the storage url. And is in the form of:

http(s)://<url of proxy or proxy vip>/v1/<account>/<container>/<object>

The container and object elements are optional as it depends on what your trying to do in Swift. When the keystoneauth middleware is authenticating it’ll check that the project_id (or tenant_id) metadata dumped by authtoken, when this is concatenated with the reseller_prefix, matches the account in the given storage_url. For example let’s say the following metadata was dumped by authtoken:

{
"X_PROJECT_ID": 'abcdefg12345678',
"X_ROLES": "swiftoperator",
...
}

And the reseller_prefix for keystone auth was AUTH_ and we make any member of the swiftoperator role (in keystone) a swift operator (a swift user on the account). Then keystoneauth would allow access if the account in the storage URL matched AUTH_abcdefg12345678.

 

When you authenticate to keystone the object storage endpoint will point not only to the Swift endpoint (the swift proxy or swift proxy load balancer), but it will also include your account. Based on your project_id. More on this soon.

 

Does that make sense? So simply put to use keystoneauth in a multi federated environment, we just need to make sure no matter which keystone we end up using and asking for the swift endpoint always returns the same Swift account name.

And there lies our problem, the keystone object storage endpoint and the metadata authtoken dumps uses the project_id/tenant_id. This isn’t something that is synced or can be passed via federation metadata.

NOTE: This also means that you’d need to use the same reseller_prefix on all keystones in every federated environment. Otherwise the accounts wont match.

 

Keystone Endpoint and Federation Side

When you add an object storage endpoint in keystone, for swift, the url looks something like:

http://swiftproxy:8080/v1/AUTH_$(tenant_id)s

 

Notice the $(tenant_id)s at the end? This is a placeholder that keystone internally will replace with the tenant_id of the project you authenticated as. $(project_id)s can also be used and maps to the same thing. And this is our problem.

When setting up federation between keystones (assuming keystone 2 keystone federation) you generate a mapping. This mapping can include the project name, but not the project_id. Theses ids are auto-generated, not deterministic by name, so creating the same project on different federated keystone servers will have different project_id‘s. When a keystone service provider (SP) federates with a keystone identity provider (IdP) the mapping they share shows how the provider should map federated users locally. This includes creating a shadow project if a project doesn’t already exist for the federated user to be part of.

Because there is no way to sync project_id’s in the mapping the SP will create the project which will have a unique project_id. Meaning when the federated user has authenticated their Swift storage endpoint from keystone will be different, in essence as far as Swift is concerned they will have access but to a completely different Swift account. Let’s use an example, let’s say there is a project on the IdP called ProjectA.

           project_name        project_id
  IdP      ProjectA            75294565521b4d4e8dc7ce77a25fa14b
  SP       ProjectA            cb0d5805d72a4f2a89ff260b15629799

Here we have a ProjectA on both IdP and SP. The one on the SP would be considered a shadow project to map the federated user too. However the project_id’s are both different, because they are uniquely  generated when the project is created on each keystone environment. Taking the Object Storage endpoint in keystone as our example before we get:

 

          Object Storage Endpoint
  IdP     http://swiftproxy:8080/v1/AUTH_75294565521b4d4e8dc7ce77a25fa14b
  SP      http://swiftproxy:8080/v1/AUTH_cb0d5805d72a4f2a89ff260b15629799

So when talking to Swift you’ll be accessing different accounts, AUTH_75294565521b4d4e8dc7ce77a25fa14b and AUTH_cb0d5805d72a4f2a89ff260b15629799 respectively. This means objects you write in one federated environment will be placed in a completely different account so you wont be able access them from elsewhere.

 

Interesting ways to approach the problem

Like I stated earlier the solution would simply be to always be able to return the same storage URL no matter which federated environment you authenticate to. But how?

  1. Make sure the same project_id/tenant_id is used for _every_ project with the same name, or at least the same name in the domains that federation mapping maps too. This means direct DB hacking, so not a good solution, we should solve this in code, not make OPs go hack databases.
  2. Have a unique id for projects/tenants that can be synced in federation mapping, also make this available in the keystone endpoint template mapping, so there is a consistent Swift account to use. Hey we already have project_id which meets all the criteria except mapping, so that would be easiest and best.
  3. Use something that _can_ be synced in a federation mapping. Like domain and project name. Except these don’t map to endpoint template mappings. But with a bit of hacking that should be fine.

Of the above approaches, 2 would be the best. 3 is good except if you pick something mutable like the project name, if you ever change it, you’d now authenticate to a completely different swift account. Meaning you’d have just lost access to all your old objects! And you may find yourself with grumpy Swift Ops who now need to do a potentially large data migration or you’d be forced to never change your project name.

Option 2 being unique, though it doesn’t look like a very memorable name if your using the project id, wont change. Maybe you could offer people a more memorable immutable project property to use. But to keep the change simple being able simply sync the project_id should get us everything we need.

 

When I was playing with this, it was for a presentation so had a time limit, a very strict one, so being a Swift developer and knowing the Swift code base I hacked together a varient on option 3 that didn’t involve hacking keystone at all. Why, because I needed a POC and didn’t want to spend most my time figuring out the inner workings of Keystone, when I could just do a few hacks to have a complete Swift only version. And it worked. Though I wouldn’t recommend it. Option 3 is very brittle.

 

The brittle method – Swift only side – Option 3b

Because I didn’t have time to simply hack keystone, I took a different approach. The basic idea was to let authtoken authenticate and then finish building the storage URL on the swift side using the meta-data authtoken dumps into wsgi request env. Thereby modifying the way keystoneauth authenticates slightly.

Step 1 – Give the keystoneauth middleware the ability to complete the storage url

By default we assume the incoming request will point to a complete account, meaning the object storage endpoint in keystone will end with something like:

'<uri>/v1/AUTH_%(tenant_id)s'

So let’s enhance keystoneauth to have the ability to if given only the reseller_prefix to complete the account. So I added a use_dynamic_reseller option.

If you enable use_dynamic_reseller then the keystoneauth middleware will pull the project_id from authtoken‘s meta-data dumped in the wsgi environment. This allows a simplified keystone endpoint in the form:

'<uri>/v1/AUTH_'

This shortcut makes configuration easier, but can only be reliably used when on your own account and providing a token. API elements like tempurl  and public containers need the full account in the path.

This still used project_id so doesn’t solve our problem, but meant I could get rid of the $(tenant_id)s from the endpoints. Here is the commit in my github fork.

Step 2 – Extend the dynamic reseller to include completing storage url with names

Next, we extend the keystoneauth middleware a little bit more. Give it another option, use_dynamic_reseller_name, to complete the account with either project_name or domain_name and project_name but only if your using keystone authentication version 3.

If you are, and want to have an account based of the name of the project, then you can enable use_dynamic_reseller_name in conjuction with use_dynamic_reseller to do so. The form used for the account would be:

<reseller_prefix><project_domain_name>_<project_name>

So using our example previously with a reseller_preix of AUTH_, a project_domain_name of Domain and our project name of ProjectA, this would generate an account:

AUTH_Domain_ProjectA

This patch is also in my github fork.

Does this work, yes! But as I’ve already mentioned in the last section, this is _very_ brittle. But this also makes it confusing to know when you need to provide only the reseller_prefix or your full account name. It would be so much easier to just extend keystone to sync and create shadow projects with the same project_id. Then everything would just work without hacking.

Monasca + Swift: Sending all your Swift metrics Monasca’s way

Last week was SUSE Hackweek. A week every employee is given to go have fun hacking something or learning something they find interesting. It’s an awesome annual event that SUSE runs. It’s my second and I love it.

While being snowed in in Dublin at the Dublin PTG a while ago I chatted with Johannes, a monasca dev and very intelligent team mate at SUSE. And I heard that Monasca has a statsd endpoint as a part of the monasca agent you can fire stats at. As a Swift developer this interests me greatly. Every Swift daemon dumps a plethora of statsd metrics. So can I put the 2 together? Can I simply install monasca-agent to each storage and proxy node and then point the statsd endpoints for all swift services locally?

 

I started the week attempting to do just that. Because I’m new to monasca, and didn’t want to go attempt to set it up, I just run a devsack + SAIO environment.

The devstack was just a simple monasa + keystone + horizon configuration and the SAIO was a standard Swift All In One.

 

Next I installed the monasca-agent to the SAIO and then updated Swift to point at it. In Swift each config supports a statsd server endpoint configuration options:

 

# You can enable StatsD logging here:
# log_statsd_host =
# log_statsd_port = 8125
# log_statsd_default_sample_rate = 1.0
# log_statsd_sample_rate_factor = 1.0
# log_statsd_metric_prefix =

 

So pointing swift is easy. I then uploaded as few objects to swift and bingo, inside Monasca’s influxdb instance I can see the Swift measurements.

 

account-auditor.passes
account-auditor.timing
account-replicator.attempts
account-replicator.no_changes
account-replicator.successes
account-replicator.timing
account-server.GET.timing
account-server.HEAD.timing
account-server.PUT.timing
account-server.REPLICATE.timing
container-auditor.passes
container-auditor.timing
container-replicator.attempts
container-replicator.no_changes [41/49393]
container-replicator.successes
container-replicator.timing
container-server.GET.timing
container-server.PUT.timing
container-server.REPLICATE.timing
container-updater.no_changes
container-updater.successes
container-updater.timing
monasca.collection_time_sec
monasca.thread_count
object-auditor.timing
object-replicator.partition.update.count.sdb1
object-replicator.partition.update.count.sdb2
object-replicator.partition.update.count.sdb3
object-replicator.partition.update.count.sdb4
object-replicator.partition.update.timing
object-replicator.suffix.hashes
object-server.HEAD.timing
object-server.PUT.sdb1.timing
object-server.PUT.sdb2.timing
object-server.PUT.sdb3.timing
object-server.PUT.sdb4.timing
object-server.PUT.timing
object-server.REPLICATE.timing
object-updater.timing
proxy-server.account.GET.200.first-byte.timing
proxy-server.account.GET.200.timing
proxy-server.account.GET.200.xfer
proxy-server.object.HEAD.404.timing
proxy-server.object.HEAD.404.xfer
proxy-server.object.PUT.201.timing
proxy-server.object.PUT.201.xfer
proxy-server.object.policy.1.HEAD.404.timing
proxy-server.object.policy.1.HEAD.404.xfer
proxy-server.object.policy.1.PUT.201.timing
proxy-server.object.policy.1.PUT.201.xfer

 

NOTE: This isn’t the complete list, as the measures are added when new metrics are fired, and the SAIO is a small healthy swift cluster, so there isn’t many 500 series errors etc. But it works!

 

And better yet I have access to them in grafana via the monasca datasource!

 

swift_recon check plugin

I thought that was easy, but Swift actually provides more metrics then just that. Swift has a reconnaissance API (recon) on all the wsgi servers (account, container and object servers). That you can hit either via REST or the swift-recon tool. So next I thought I wonder how hard it would be to write a swift_recon check plugin for Monasca.

Some of the recon metrics you can get aren’t really grafana friendly. But some would be awesome to have in the same place and closer to horizon where ops are looking.

 

So I went and wrote one. Like I said I couldn’t get all the metrics, but I got most:

 

swift_recon.account.account_auditor_pass_completed [2/49393]
swift_recon.account.account_audits_failed
swift_recon.account.account_audits_passed
swift_recon.account.account_audits_since
swift_recon.account.attempted
swift_recon.account.failure
swift_recon.account.replication_last
swift_recon.account.replication_time
swift_recon.account.success
swift_recon.container.attempted
swift_recon.container.container_auditor_pass_completed
swift_recon.container.container_audits_failed
swift_recon.container.container_audits_passed
swift_recon.container.container_audits_since
swift_recon.container.container_updater_sweep
swift_recon.container.failure
swift_recon.container.replication_last
swift_recon.container.replication_time
swift_recon.container.success
swift_recon.disk_usage.mounted
swift_recon.object.async_pending
swift_recon.object.attempted
swift_recon.object.auditor.object_auditor_stats_ALL.audit_time
swift_recon.object.auditor.object_auditor_stats_ALL.bytes_processed
swift_recon.object.auditor.object_auditor_stats_ALL.errors
swift_recon.object.auditor.object_auditor_stats_ALL.passes
swift_recon.object.auditor.object_auditor_stats_ALL.quarantined
swift_recon.object.auditor.object_auditor_stats_ALL.start_time
swift_recon.object.auditor.object_auditor_stats_ZBF.audit_time
swift_recon.object.auditor.object_auditor_stats_ZBF.bytes_processed
swift_recon.object.auditor.object_auditor_stats_ZBF.errors
swift_recon.object.auditor.object_auditor_stats_ZBF.passes
swift_recon.object.auditor.object_auditor_stats_ZBF.quarantined
swift_recon.object.auditor.object_auditor_stats_ZBF.start_time
swift_recon.object.expirer.expired_last_pass
swift_recon.object.expirer.object_expiration_pass
swift_recon.object.failure
swift_recon.object.object_updater_sweep
swift_recon.object.replication_last
swift_recon.object.replication_time
swift_recon.object.success
swift_recon.quarantined
swift_recon.unmounted

 

Some of the metric names might need to tidy up, but so far, so good. Some of the really interesting metrics Swift Ops usually want to keep an eye on is when have all the replicators completed a cycle. Why? Well one example is while ring rebalancing on a large and busy cluster you want to avoid too much data movement, so when adding new drives you will rise their weights slowly. But you also want to make sure a complete replication cycle is complete before you rebalance again. So knowing when you pushed a new ring out and the timestamps of the last run replication tells you when it’s safe. These are coming through nicely:

 

 

Unfortunately there are some metrics I can’t quite get though. You can use recon to get md5s of the rings and configs on each node. But I found md5s can’t get pushed through. You can also ask recon what version of swift is installed on each node (nice is a large deployment and when upgrading). But the version number also had issues. Both of these are probably not insurmountable, but I still need to figure out how.

 

swift_handoffs check plugin

I’ve been involved in the Swift community for quite a while now, and I’d had heard of another awesome metric one of the Swiftstack cores came out with to give an awesome visualisation of the Swift cluster. He even provided a gist to the community others would use and adapt. I thought, why not make sure everyone could use it, lets add it as another check plugin to the monasca agent.

 

Everything in Swift is treated as an object, and an object has a number of devices in the cluster who are considered primary (who store that object). When a drive gets full or there is too much load on say an object PUT, if a primary is unavailable to meet the durability contract another node will store the object (this node would be called a handoff for that object), the handoff node will push the handoff object to the primary as soon as it can (drive is replaced, or comes back online, etc).

Further, a ring in Swift is divided into logical segments called partitions. And it’s these partitions that devices are responsible for storing (or think of it as, it has to store all objects that belong to a partition). When we rebalance the ring, either by adding or removing drives or changing weights, these partitions shift around the cluster. Either to say drain a drive or to move to where where is more space. Swift is really good as minimising this movement to the minimum. So after a rebalance, nodes that used to be primaries for some partitions wont be anymore. They’ll suddenly be handoffs, and the back-end consistency engine will move them to their new home.

So what’s interesting to note there is, it all involves handoff partitions.

 

Turns out, by just watching the number of partitions vs the number of handoffs on each storage node gives you a great health indicator. When should I do a rebalance? when the handoffs numbers are down. There seem to be a build up of handoffs in a region, maybe write affinity and WAN links are saturated or there is some network/disk/server issue on one of the nodes around there etc.

Here are the metrics:

 

swift_handoffs.handoffs
swift_handoffs.primary

 

And here is a simplified snapshot. This is my SAIO with 4 simulated nodes. This is watching the storage nodes as a whole but you can break down to the drive. There is a graph for each node and each Swift ring. This rise in handoffs (Object – Policy 0 SAIO[1-3]) is due to me turning of the consistency engine and then changing the weight back to a nicely weighted cluster:

See Object - Policy 0. SAIO0’s weight has increased, so the other nodes now have handoff partitions to give him. If I now went and turned the consistency engine back on, you’d see more primary nodes on SAIO0.

 

Wheres the code

UPDATE: I’ve now pushed up the checks to monasca. They can be found here:

  • https://review.openstack.org/#/c/583876/
  • https://review.openstack.org/#/c/585067/

pudb debugging tips

As an OpenStack Swift dev I obviously write a lot of Python. Further Swift is cluster and so it has a bunch of moving pieces. So debugging is very important. Most the time I use pudb and then jump into the PyCharms debugger if get really stuck.

Pudb is curses based version of pdb, and I find it pretty awesome and you can use it while ssh’d somewhere. So I thought I’d write a tips that I use. Mainly so I don’t forget 🙂

The first and easiest way to run pudb is use pudb as the python runner.. i.e:

pudb <python script>

On first run, it’ll start with the preferences window up. If you want to change preferences you can just hit ‘<ctrl>+p’. However you don’t need to remember that, as hitting ‘?’ will give you a nice help screen.

I prefer to see line numbers, I like the dark vim theme and best part of all, I prefer my interactive python shell to be ipython.

While your debugging, like in pdb, there are some simple commands:

  • n – step over (“next”)
  • s – step into
  • c – continue
  • r/f – finish current function
  • t – run to cursor
  • o – show console/output screen
  • b – toggle breakpoint
  • m – open module
  • ! – Jump into interactive shell (most useful)
  • / – text search

There are obviously more then that, but they are what I mostly use. The open module is great if you need to set a breakpoint somewhere deeper in the code base, so you can open it, set a breakpoint and then happily press ‘c’ to continue until it hits. The ‘!’ is the most useful, it’ll jump you into an interactive python shell in the exact point the debugger is at. So you can jump around, check/change settings and poke in areas to see whats happening.

As with pdb you can also use code to insert a breakpoint so pudb will be triggered rather then having to start a script with pudb. I give an example of how in the nosetest section below.

nosetests + pudb

Sometimes the best way to use pudb is to debug unit tests, or even write a unit (or functaional or probe) test to get you into an area you want to test. You can use pudb to debug these too. And there are 2 ways to do it.

The first way is by installing the ‘nose-pudb’ pip package:

pip install nose-pudb

Now when you run nosetests you can add the –pudb option and it’ll break into pudb if there is an error, so you go poke around in ‘post-mortem’ mode. This is really useful, but doesn’t allow you to actually trace the tests as they run.

So the other way of using pudb in nosetests is actually insert some code in the test that will trigger as a breakpoint and start up pudb. To do so is exactly how you would with pdb, except substitute for pudb. So just add the following line of code to your test where you want to drop into pudb:

import pudb; pudb.set_trace()

And that’s it.. well mostly, because pudb is command line you need to tell nosetests to not capture stdout with the ‘-s’ flag:

nosetests -s test/unit/common/middleware/test_cname_lookup.py

testr + pudb

Not problem here, it uses the same approach as above. Where you programmatically set a trace, as you would for pdb. Just follow the  ‘Debugging (pdb) Tests’ section on this page (except substitute pdb for pudb)

 

Update – run_until_failure.sh

I’ve been trying to find some intermittent unit test failures recently. So I whipped up  a quick bash script that I run in a tmux session that really helps find and deal with them, I thought I’d add to this post as I then can add nose-pudb to make it pretty useful.

#!/bin/bash

n=0
while [ True ]
do 
  clear
  $@
  if [ $? -gt 0 ]
  then 
    echo 'ERROR'
    echo "number " $n
    break
  fi
  let "n=n+1"
  sleep 1
done

With this I can simply:
run_until_failure.sh tox -epy27

 

It’ll stop looping once the command passed returns something other then 0.

Once I have an error, I have then been focusing in on the area it happens (to speed up the search a bit), I can also use nose-pudb to drop me into post-mortem mode so I can poke around in ipython, for example, I’m currently running:

 

run_until_failure.sh nosetests --pudb test/unit/proxy/test_server.py

 

Then I can come back to the tmux session, if I’m dropped in a pudb interface, I can go poke around.

Swift Container sharding – locked db POC – Benchmarking observations

The latest POC is at the benchmarking stage, and in the most part it’s going well. I have set up 2 clusters in the cloud, not huge, but 2 proxies and 4 storage nodes each. A benchmarking run involves pointing an ssbench master at each cluster and putting each cluster under load. In both cases we only use 1 container, and on one cluster this container will have sharding turned on.

So far it’s looking pretty good. I’ve done many runs, and usually find a bug at scale.. but as of recently I’ve done two runs of the latest revision alternating the sharded cluster (the cluster that will be benchmarking with the container with sharding on). Below shows the grafana statsd output of the second run. Note that cluster 2 is the sharded cluster in this run:

2016-12-22-0928_cluster2_run2_smaller

Looking at the picture there are a few observations we can make, the peaks in the ‘Container PUT Latency – Cluster 2’ correspond when a container is sharded (in this case, the one container and then shards sharding).

As I mentioned earlier ssbench is running the benchmark and the benchmark is very write (PUT) heavy. Here is the sharding scenario file:

{
  "name": "Sharding scenario",
  "sizes": [{
    "name": "zero",
    "size_min": 0,
    "size_max": 0
  }],
  "initial_files": {
    "zero": 100
  },
  "run_seconds": 86400,
  "crud_profile": [200, 50, 0, 5],
  "user_count": 2,
  "container_base": "shardme",
  "container_count": 1,
  "container_concurrency": 1,
  "container_put_headers": {
  "X-Container-Sharding": "on"
  }
}

The only difference with this and non-sharding one is not setting the X-Container-Sharding meta on the initial container PUT. The crud profile shows that we are heady on PUTs and GETs. But because jobs are randomised, I don’t expect the exact the same numbers when it comes to object count on the servers however there is a rather large discrepancy with the object counts on both servers:

Cluster 1:

HTTP/1.1 204 No Content
Content-Length: 0
X-Container-Object-Count: 11291190
Accept-Ranges: bytes
X-Storage-Policy: gold
X-Container-Bytes-Used: 0
X-Timestamp: 1482290574.52856
Content-Type: text/plain; charset=utf-8
X-Trans-Id: tx9dd499df28304b2d920aa-00585b2d3e
Date: Thu, 22 Dec 2016 01:32:46 GMT

Cluster 2:

Content-Length: 0
X-Container-Object-Count: 6909895
X-Container-Sharding: True
X-Storage-Policy: gold
X-Container-Bytes-Used: 0
X-Timestamp: 1482290575.94012
Content-Type: text/plain; charset=utf-8
Accept-Ranges: bytes
X-Trans-Id: txba7b23743e0d45a68edb8-00585b2d61
Date: Thu, 22 Dec 2016 01:33:27 GMT

So cluster 1 has about 11 million objects and cluster 2 about 7 million. That quite a difference. Which gets me wondering what’s causing such a large difference in PUT through put?

The only real difference in the proxy object PUT when comparing sharded to unsharded is the finding of the shard container the object server will need to update, in which case another request is made to the root container asking for the pivot (if there is one). Is this extra request really causing an issue? I do note the object-updater (last graph in the image) is also working harder, as the number of successes during the benchmarks are much higher, meaning there are more requests falling into async pendings.

Maybe the extra updater work is because of the extra load on the container server (this additional request)?

To test this theory, I can push the sharder harder and force container updates into the root container. This would stop the extra request.. but force more traffic to the root container (which we are kinda doing anyway). We should still see benefits as root container would be much smaller (because it’s sharded) then the non sharded counter part. And this will allow us to see if this is causing the slower through put.

Update: I’m currently running a new scenario which is all PUTs so lets see how that fairs. Will keep you posted.

Adventures in Container Sharding – SQLite performance problem and the pivot point.

Hey world it’s been a while, turns out I’m not much of a blogger. But I know how useful for myself it is to do write-ups occasionally so I can actually find them later.. having said that. In my last post I mentioned I was an OpenStack Developer.. and it’s still true. I spend my time hacking and working on Openstack Swift the awesome OpenSource object storage cluster.

One thing I’ve been trying to tackle recently is container sharding in Swift, I will not go into full details as there is a Swift Spec that is relatively recent, and I’ve also gave a high level talk on it at LCA in Geelong.

The tl;dr being, Swift accounts and containers (or the metadata layer of Swift) are SQLite databases that get treated like objects themselves and replicated throughout the cluster. Which works amazingly well. Until you add millions and millions of objects to a container. And what I’m talking about here is container level object metadata, not the objects themselves. When this happens, SQLite being a file starts to have latency and locking issues, as one would expect. The solution to this is shard up these container databases throughout the cluster, which is what I’ve been working on.

At the last OpenStack summit in Austin, the awesome people at SwiftStack, whom I work quite closely with in the community gave me a container database they generated that has 700,000,000 objects in it (metadata again). This SQLite file is about 105G so not small. Plugging this into a small cluster I have to test my sharding implementation has been interesting to say the least.

When sharding a container down, we have a simple idea, split it in half. That is to say find someplace in the object table to pivot on. We can then keep pivoting giving us a list of ranges (which can be treated as a binary tree). The problem is finding the pivot point. In all my testing up til now I had what I thought was the perfect and simple way:

SELECT name
FROM object
WHERE deleted=0 ORDER BY name LIMIT 1 OFFSET (
SELECT object_count / 2
FROM policy_stat);

This did amazingly well in all my tests.. but I obviously never got big enough. This simple SQL statement would do plenty well if sharding in Swift was turned on from day dot, but the sharding plans for Swift is for once it’s solved in this POC to add to Swift as a beta which can be turned ‘on’ at a container by container basis when you want. After it graduates from beta but is still a switch. To finally once we are confident in it’s ability to have it on permanently. In the latter case container’s would never get big enough to worry about.. However, in the earlier stages a user would only turn it on when the container is _very_ slow.

Using the pivot SQL statement on the large container I was given ground to a halt, I’m sure it would have come back to be eventually, but I got tired of waiting after what seemed for ages.. there has to be a better way.

Turns out the OFFSET statement in SQLite, even when hitting an index still does a scan to find the offset.. This is slow when you get to a very large table size. Turns out under the hood, the resultset is stored as a double-linked list, and OFFSET will still scan down the results, which I’m sure probably has optimisations but anyway I was struggling to think of a good way to find a good enough middle value that didn’t involve some table scanning. You can see from the SQL statement, we know have many objects we have in the container, but the problem is because swift is eventually consistent we need to temporally store objects that have been deleted. So randomly picking an index doesn’t help, and it wont necessarily be in name order.

So on really large containers OFFSET needs to be thrown out the window. Turns out the sharding implementation can deal with shrinking the number of shards, merging smaller ranges together, not just growing/splitting. This means we don’t actually need to be exact, we also don’t actually need to split on an existing object, just a name that would be somewhere in the middle and so long as it’s cutting down the large container then it’s good enough. So what can we do?

Turns out there is an optimisation in SQLite, because an index is a double-linked list and ordered by it’s index, it’s really quick if all we want to do is go to the first or last element. So that’s what I’ve done:

SELECT min(name) as name FROM object WHERE deleted = 0;
SELECT max(name) as name FROM object WHERE deleted = 0;

These two statements are blindingly fast due to the fact that we already have a compound index on name and delete (for cleaning up). Note however they have to be run as 2 separate commands, combine the two into one and you loose your optimisation and you’ll have to scan all elements. Having the min and max name is a good start, and even when dealing with already sharded containers, they are just smaller ranges so this still works. The question is now what?

In the perfect work we have an even distribution of objects between the min and max names, so we just need to find a middle name between the two to pivot on. Turns out even in a not evenly distributed container we will still be shrinking the container, even at worst case only be a few objects. But these will be cleaned up later (merged into a neighbour range by the implementation). And so long as the container gets smaller, eventually it’ll shrink small enough to be usable.

Next step is finding the middle value, to do this I just wrote some python:

from itertools import izip_longest
import sys

lower = unicode(sys.argv[1])
upper = unicode(sys.argv[2])

def middle_str(str1, str2):
    result = []
    for l, u in izip_longest(map(ord, str1), map(ord, str2), fillvalue=0):
        result.append((l + u) // 2)
    return u''.join(map(unichr, result))

if __name__ == "__main__":
    print(middle_str(lower, upper))

What does it do. Calling middle_str(min, max) will grab the unicode versions of the strings, turn them into there interger values, find the middle and turn them back into a word. After matching the prefix that is. So:

$ python middle_str.py 'aaaaaaaaa' 'zzzzzzzzz'
mmmmmmmmm

$ python middle_str.py 'aaaaaaaaa' 'aazzzzzzz'
aammmmmmm

$ python middle_str.py 'DFasjiojsaoi' 'ZZsdkmfi084f'
OPjkjkjiQLQg

I am now plugging this into my implementation and lets tackle this large container again.

I’m now an OpenStack developer.

Hello world,

It’s been a while since I have blogged on this site, I apologise for that. My previous position was a tad proprietary, so although I worked with Linux, what I was doing needs to be sanitised before I can post about it. I have a bunch of posts in the cooker from those days still awaiting sanitation. But I have some great news… I am now an Openstack developer.

It’s been a busy year, married moved over to the UK to work for an amazing company who needs no introduction, Rackspace. Over there I was working with Linux in a Support/DevOps style role, but am back in Oz now with a new team at Rackspace! The Rackspace Cloud Builders. In this role I’ll be getting my development hat on and developing for upstream Openstack again and am so excited about it.

Watch this space!!!

Matt

Simple Squid access log reporting.

Squid is one of the biggest and most used proxies on the interwebs. And generating reports from the access logs is already a done deal, there are many commercial and OSS apps that support the squid log format. But I found my self in a situation where I wanted stats but didn’t want to install a web server on my proxy or use syslog to push my logs to a centralised server which was running such software, and also wasn’t in a position to go buy one of those off the shelf amazing wiz bang Squid reporting and graphing tools.

As a Linux geek I surfed the web to see what others have done. I came across a list provided by the Squid website. Following a couple of links, I came across a awk script called ‘proxy_stats.gawk’ written by Richard Huveneers.

I downloaded it and tried it out… unfortunately it didn’t work, looking at the code.. which he nicely commented showed that he had it set up for access logs  from version 1.* of squid. Now the squid access log format from squid 2.6+ hasn’t changed too much from version 1.1. all they have really done is add a “content type” entry at the end of each line.

So as a good Linux geek does, he upgrades the script, my changes include:

  • Support for squid 2.6+
  • Removed the use a deprecated switches that now isn’t supported in the sort command.
  • Now that there is a an actual content type “column” lets use it to improve the ‘Object type report”.
  • Add a users section, as this was an important report I required which was missing.
  • And in a further hacked version, an auto generated size of the first “name” column.

Now with the explanation out of the way, let me show you it!

For those who are new to awk, this is how I’ve been running it:

zcat <access log file> | awk -f proxy_stats.gawk > <report-filename>

NOTE: I’ve been using it for some historical analysis, so I’m running it on old rotated files, which are compressed thus the zcat.

You can pass more then one file at a time and it order doesn’t matter, as each line of an access log contains the date in epoch time:

zcat `find /var/log/squid/ -name "access.log*"` |awk -f proxy_stats.gawk

The script produces an ascii report (See end of blog entry for example), which could be generated and emailed via cron. If you want it to look nice in any email client using html the I suggest wrapping it in <pre> tags.:

<html>
<head><title>Report Title</title></head>
Report title<body>
<pre>
... Report goes here ...
</pre>
</body>
</html>

For those experienced Linux sys admins out there using cron + ‘find -mtime’ would be a very simple way of having an automated daily, weekly or even monthly report.
But like I said earlier I was working on historic data, hundreds of files in a single report, hundreds because for business reasons we have been rotating the squid logs every hour… so I did what I do best, write a quick bash script to find all the files I needed to cat into the report:

#!/bin/bash

ACCESS_LOG_DIR="/var/log/squid/access.log*"
MONTH="$1"

function getFirstLine() {
	if [ -n  "`echo $1 |grep "gz$"`" ]
	then
		zcat $1 |head -n 1
	else
		head -n 1 $1 
	fi
}

function getLastLine() {
	if [ -n  "`echo $1 |grep "gz$"`" ]
	then
		zcat $1 |tail -n 1
	else
		tail -n 1 $1 
	fi
}

for log in `ls $ACCESS_LOG_DIR`
do
	firstLine="`getFirstLine $log`"
	epochStr="`echo $firstLine |awk '{print $1}'`"
	month=`date -d @$epochStr +%m`
	
	if [ "$month" -eq "$MONTH" ]
	then
		echo $log
		continue
	fi

	
	#Check the last line
	lastLine="`getLastLine $log`"
	epochStr="`echo $lastLine |awk '{print $1}'`"
        month=`date -d @$epochStr +%m`

        if [ "$month" -eq "$MONTH" ]
        then
                echo $log
        fi
	
done

So there you go, thanks to the work of Richard Huveneers there is a script that I think generates a pretty good acsii report, which can be automated or integrated easily into any Linux/Unix work flow.

If you interested in getting hold of the most up to date version of the script you can get it from my sysadmin github repo here.

As promised earlier here is an example report:

Parsed lines  : 32960
Bad lines     : 0

First request : Mon 30 Jan 2012 12:06:43 EST
Last request  : Thu 09 Feb 2012 09:05:01 EST
Number of days: 9.9

Top 10 sites by xfers           reqs   %all %xfers   %hit         MB   %all   %hit     kB/xf      kB/s
------------------------- ------------------------------- ------------------------ -------------------
213.174.155.216                   20   0.1% 100.0%   0.0%        0.0   0.0%   0.0%       1.7       2.5
30.media.tumblr.com                1   0.0% 100.0%   0.0%        0.0   0.0%   0.0%      48.3      77.4
28.media.tumblr.com                1   0.0% 100.0%   0.0%        0.1   0.0%   0.0%      87.1       1.4
26.media.tumblr.com                1   0.0%   0.0%      -        0.0   0.0%      -         -         -
25.media.tumblr.com                2   0.0% 100.0%   0.0%        0.1   0.0%   0.0%      49.2      47.0
24.media.tumblr.com                1   0.0% 100.0%   0.0%        0.1   0.0%   0.0%     106.4     181.0
10.1.10.217                      198   0.6% 100.0%   0.0%       16.9   0.9%   0.0%      87.2    3332.8
3.s3.envato.com                   11   0.0% 100.0%   0.0%        0.1   0.0%   0.0%       7.6      18.3
2.s3.envato.com                   15   0.0% 100.0%   0.0%        0.1   0.0%   0.0%       7.5      27.1
2.media.dorkly.cvcdn.com           8   0.0% 100.0%  25.0%        3.2   0.2%   0.3%     414.1     120.5

Top 10 sites by MB              reqs   %all %xfers   %hit         MB   %all   %hit     kB/xf      kB/s
------------------------- ------------------------------- ------------------------ -------------------
zulu.tweetmeme.com                 2   0.0% 100.0% 100.0%        0.0   0.0% 100.0%       3.1     289.6
ubuntu.unix.com                    8   0.0% 100.0% 100.0%        0.1   0.0% 100.0%       7.5     320.0
static02.linkedin.com              1   0.0% 100.0% 100.0%        0.0   0.0% 100.0%      36.0     901.0
solaris.unix.com                   2   0.0% 100.0% 100.0%        0.0   0.0% 100.0%       3.8     223.6
platform.tumblr.com                2   0.0% 100.0% 100.0%        0.0   0.0% 100.0%       1.1     441.4
i.techrepublic.com.com             5   0.0%  60.0% 100.0%        0.0   0.0% 100.0%       6.8    2539.3
i4.zdnetstatic.com                 2   0.0% 100.0% 100.0%        0.0   0.0% 100.0%      15.3     886.4
i4.spstatic.com                    1   0.0% 100.0% 100.0%        0.0   0.0% 100.0%       4.7     520.2
i2.zdnetstatic.com                 2   0.0% 100.0% 100.0%        0.0   0.0% 100.0%       7.8    2920.9
i2.trstatic.com                    9   0.0% 100.0% 100.0%        0.0   0.0% 100.0%       1.5     794.5

Top 10 neighbor report          reqs   %all %xfers   %hit         MB   %all   %hit     kB/xf      kB/s
------------------------- ------------------------------- ------------------------ -------------------
www.viddler.com                    4   0.0% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0
www.turktrust.com.tr              16   0.0% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0
www.trendmicro.com                 5   0.0% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0
www.reddit.com                     2   0.0% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0
www.linkedin.com                   2   0.0% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0
www.google-analytics.com           2   0.0% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0
www.facebook.com                   2   0.0% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0
www.dynamicdrive.com               1   0.0% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0
www.benq.com.au                    1   0.0% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0
wd-edge.sharethis.com              1   0.0% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0

Local code                      reqs   %all %xfers   %hit         MB   %all   %hit     kB/xf      kB/s
------------------------- ------------------------------- ------------------------ -------------------
TCP_CLIENT_REFRESH_MISS         2160   6.6% 100.0%   0.0%        7.2   0.4%   0.0%       3.4      12.9
TCP_HIT                          256   0.8% 100.0%  83.2%       14.0   0.8% 100.0%      56.0    1289.3
TCP_IMS_HIT                      467   1.4% 100.0% 100.0%       16.9   0.9% 100.0%      37.2    1747.4
TCP_MEM_HIT                      426   1.3% 100.0% 100.0%       96.5   5.3% 100.0%     232.0    3680.9
TCP_MISS                       27745  84.2%  97.4%   0.0%     1561.7  85.7%   0.3%      59.2      18.2
TCP_REFRESH_FAIL                  16   0.0% 100.0%   0.0%        0.2   0.0%   0.0%      10.7       0.1
TCP_REFRESH_MODIFIED             477   1.4%  99.8%   0.0%       35.0   1.9%   0.0%      75.3    1399.4
TCP_REFRESH_UNMODIFIED          1413   4.3% 100.0%   0.0%       91.0   5.0%   0.0%      66.0     183.5

Status code                     reqs   %all %xfers   %hit         MB   %all   %hit     kB/xf      kB/s
------------------------- ------------------------------- ------------------------ -------------------
000                              620   1.9% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0
200                            29409  89.2% 100.0%   2.9%     1709.7  93.8%   7.7%      59.5     137.1
204                              407   1.2% 100.0%   0.0%        0.2   0.0%   0.0%       0.4       1.4
206                              489   1.5% 100.0%   0.0%      112.1   6.1%   0.0%     234.7     193.0
301                               82   0.2% 100.0%   0.0%        0.1   0.0%   0.0%       0.7       1.5
302                              356   1.1% 100.0%   0.0%        0.3   0.0%   0.0%       0.8       2.7
303                                5   0.0% 100.0%   0.0%        0.0   0.0%   0.0%       0.7       1.5
304                              862   2.6% 100.0%  31.2%        0.4   0.0%  30.9%       0.4      34.2
400                                1   0.0%   0.0%      -        0.0   0.0%      -         -         -
401                                1   0.0%   0.0%      -        0.0   0.0%      -         -         -
403                               47   0.1%   0.0%      -        0.0   0.0%      -         -         -
404                              273   0.8%   0.0%      -        0.0   0.0%      -         -         -
500                                2   0.0%   0.0%      -        0.0   0.0%      -         -         -
502                               12   0.0%   0.0%      -        0.0   0.0%      -         -         -
503                               50   0.2%   0.0%      -        0.0   0.0%      -         -         -
504                              344   1.0%   0.0%      -        0.0   0.0%      -         -         -

Hierarchie code                 reqs   %all %xfers   %hit         MB   %all   %hit     kB/xf      kB/s
------------------------- ------------------------------- ------------------------ -------------------
DIRECT                         31843  96.6%  97.7%   0.0%     1691.0  92.8%   0.0%      55.7      44.3
NONE                            1117   3.4% 100.0% 100.0%      131.6   7.2% 100.0%     120.7    2488.2

Method report                   reqs   %all %xfers   %hit         MB   %all   %hit     kB/xf      kB/s
------------------------- ------------------------------- ------------------------ -------------------
CONNECT                         5485  16.6%  99.2%   0.0%      132.8   7.3%   0.0%      25.0       0.3
GET                            23190  70.4%  97.7%   4.9%     1686.3  92.5%   7.8%      76.2     183.2
HEAD                            2130   6.5%  93.7%   0.0%        0.7   0.0%   0.0%       0.3       1.1
POST                            2155   6.5%  99.4%   0.0%        2.9   0.2%   0.0%       1.4       2.0

Object type report              reqs   %all %xfers   %hit         MB   %all   %hit     kB/xf      kB/s
------------------------- ------------------------------- ------------------------ -------------------
*/*                                1   0.0% 100.0%   0.0%        0.0   0.0%   0.0%       1.6       3.2
application/cache-digest         396   1.2% 100.0%  50.0%       33.7   1.8%  50.0%      87.1    3655.1
application/gzip                   1   0.0% 100.0%   0.0%        0.1   0.0%   0.0%      61.0      30.8
application/javascript           227   0.7% 100.0%  12.3%        2.2   0.1%   7.7%       9.9      91.9
application/json                 409   1.2% 100.0%   0.0%        1.6   0.1%   0.0%       4.1       6.0
application/ocsp-response        105   0.3% 100.0%   0.0%        0.2   0.0%   0.0%       1.9       2.0
application/octet-stream         353   1.1% 100.0%   6.8%       81.4   4.5%   9.3%     236.1     406.9
application/pdf                    5   0.0% 100.0%   0.0%       13.5   0.7%   0.0%    2763.3      75.9
application/pkix-crl              96   0.3% 100.0%  13.5%        1.0   0.1%   1.7%      10.6       7.0
application/vnd.google.sa       1146   3.5% 100.0%   0.0%        1.3   0.1%   0.0%       1.1       2.4
application/vnd.google.sa       4733  14.4% 100.0%   0.0%       18.8   1.0%   0.0%       4.1      13.4
application/x-bzip2               19   0.1% 100.0%   0.0%       78.5   4.3%   0.0%    4232.9     225.5
application/x-gzip               316   1.0% 100.0%  59.8%      133.4   7.3%  59.3%     432.4    3398.1
application/x-javascript        1036   3.1% 100.0%   5.8%        9.8   0.5%   3.4%       9.7      52.1
application/xml                   46   0.1% 100.0%  34.8%        0.2   0.0%  35.1%       3.5     219.7
application/x-msdos-progr        187   0.6% 100.0%   0.0%       24.4   1.3%   0.0%     133.7     149.6
application/x-pkcs7-crl           83   0.3% 100.0%   7.2%        1.6   0.1%   0.4%      19.8      10.8
application/x-redhat-pack         13   0.0% 100.0%   0.0%       57.6   3.2%   0.0%    4540.7     156.7
application/x-rpm                507   1.5% 100.0%   6.3%      545.7  29.9%   1.5%    1102.2     842.8
application/x-sdlc                 1   0.0% 100.0%   0.0%        0.9   0.0%   0.0%     888.3     135.9
application/x-shockwave-f        109   0.3% 100.0%  11.9%        5.4   0.3%  44.5%      50.6     524.1
application/x-tar                  9   0.0% 100.0%   0.0%        1.5   0.1%   0.0%     165.3      36.4
application/x-www-form-ur         11   0.0% 100.0%   0.0%        0.1   0.0%   0.0%       9.9      15.4
application/x-xpinstall            2   0.0% 100.0%   0.0%        2.5   0.1%   0.0%    1300.6     174.7
application/zip                 1802   5.5% 100.0%   0.0%      104.0   5.7%   0.0%      59.1       2.5
Archive                           89   0.3% 100.0%   0.0%        0.0   0.0%      -       0.0       0.0
audio/mpeg                         2   0.0% 100.0%   0.0%        5.8   0.3%   0.0%    2958.2      49.3
binary/octet-stream                2   0.0% 100.0%   0.0%        0.0   0.0%   0.0%       5.5      14.7
font/ttf                           2   0.0% 100.0%   0.0%        0.0   0.0%   0.0%      15.5      12.5
font/woff                          1   0.0% 100.0% 100.0%        0.0   0.0% 100.0%      42.5    3539.6
Graphics                         126   0.4% 100.0%   0.0%        0.1   0.0%   0.0%       0.6       2.5
HTML                              14   0.0% 100.0%   0.0%        0.0   0.0%   0.0%       0.1       0.1
image/bmp                          1   0.0% 100.0%   0.0%        0.0   0.0%   0.0%       1.3       3.9
image/gif                       5095  15.5% 100.0%   2.4%       35.9   2.0%   0.7%       7.2       9.5
image/jpeg                      1984   6.0% 100.0%   4.3%       52.4   2.9%   0.6%      27.0      62.9
image/png                       1684   5.1% 100.0%  10.3%       28.6   1.6%   1.9%      17.4     122.2
image/vnd.microsoft.icon          10   0.0% 100.0%  30.0%        0.0   0.0%  12.8%       1.0       3.3
image/x-icon                      72   0.2% 100.0%  16.7%        0.2   0.0%   6.0%       3.2      15.0
multipart/bag                      6   0.0% 100.0%   0.0%        0.1   0.0%   0.0%      25.2      32.9
multipart/byteranges              93   0.3% 100.0%   0.0%       16.5   0.9%   0.0%     182.0     178.4
text/cache-manifest                1   0.0% 100.0%   0.0%        0.0   0.0%   0.0%       0.7       3.1
text/css                         470   1.4% 100.0%   7.9%        3.4   0.2%   5.8%       7.4      59.7
text/html                       2308   7.0%  70.7%   0.4%        9.6   0.5%   0.6%       6.0      14.7
text/javascript                 1243   3.8% 100.0%   2.7%       11.1   0.6%   5.2%       9.1      43.3
text/json                          1   0.0% 100.0%   0.0%        0.0   0.0%   0.0%       0.5       0.7
text/plain                      1445   4.4%  99.4%   1.5%       68.8   3.8%   5.5%      49.0      41.9
text/x-cross-domain-polic         24   0.1% 100.0%   0.0%        0.0   0.0%   0.0%       0.7       1.7
text/x-js                          2   0.0% 100.0%   0.0%        0.0   0.0%   0.0%      10.1       6.4
text/x-json                        9   0.0% 100.0%   0.0%        0.0   0.0%   0.0%       3.0       8.5
text/xml                         309   0.9% 100.0%  12.9%       12.9   0.7%  87.5%      42.8     672.3
unknown/unknown                 6230  18.9%  99.3%   0.0%      132.9   7.3%   0.0%      22.0       0.4
video/mp4                          5   0.0% 100.0%   0.0%        3.2   0.2%   0.0%     660.8      62.7
video/x-flv                      117   0.4% 100.0%   0.0%      321.6  17.6%   0.0%    2814.9     308.3
video/x-ms-asf                     2   0.0% 100.0%   0.0%        0.0   0.0%   0.0%       1.1       4.7

Ident (User) Report             reqs   %all %xfers   %hit         MB   %all   %hit     kB/xf      kB/s
------------------------- ------------------------------- ------------------------ -------------------
-                              32960 100.0%  97.8%   3.5%     1822.6 100.0%   7.2%      57.9     129.0

Weekly report                   reqs   %all %xfers   %hit         MB   %all   %hit     kB/xf      kB/s
------------------------- ------------------------------- ------------------------ -------------------
2012/01/26                     14963  45.4%  97.6%   3.6%      959.8  52.7%   1.8%      67.3     104.5
2012/02/02                     17997  54.6%  98.0%   3.4%      862.8  47.3%  13.2%      50.1     149.4

Total report                    reqs   %all %xfers   %hit         MB   %all   %hit     kB/xf      kB/s
------------------------- ------------------------------- ------------------------ -------------------
All requests                   32960 100.0%  97.8%   3.5%     1822.6 100.0%   7.2%      57.9     129.0

Produced by : Mollie's hacked access-flow 0.5
Running time: 2 seconds

Happy squid reporting!

Git remote ssh syntax

Git makes it easy to add remote repositories to push to, best part is you can use ssh.

The ‘git remote add’ command takes in a html URL like parameter for even SSH like:
ssh://<user>@<host>/<path to git repo>/
NOT the standard ssh scp syntax:
ssh://<user>@<host>:<path to repo>/
Which allows you to base the location from the users home directory or specify a full path:
matt@notrealhost.com:<repo in home directory>/
matt@notrealhost.com:/home/matt/<repo in home directory>/

NOTE: Just put a full path after the ‘:’, or if you want something from the home directory the just assume your in the homedir with out a starting ‘/’.

OK so git doesn’t seem to like that syntax, which is a shame because those of us who use ssh are SO used to it.
The good news is you can specify from the home directory with the git URL style syntax as you do with the standard URL sytax and it looks like:
ssh://matt@notrealhost/~/<repo in home directory>/

Those Linux/Unix guys will recognise the ‘~’ as a shortcut to home in bash, which means the same thing here!

So putting it all together I can add a remote to a git repository which exists in my home directory on that server by:
git remote add notrealhost ssh://matt@notrealhost.com/~/code/myRepository/

This post was written because I keep trying to use the scp syntax when dealing with ssh + git.. So its being filled away here for my own reference.

Git clean

Sometimes I find myself saying WTF, why isn’t something behaving the way I expect it do, and then get frustrated. “This is not how I’d have designed it, if I was writing it”, I guess you can call this the mantra of the OSS developer 😛

But as normal, when you blaming something like git or Linux, it just means your doing something wrong or you don’t have a complete understanding of the situation. A lesson I have learnt time and time again, you’d think I’d learn but I don’t.

He’s what happened, we use git at work. Git has some very useful commands.
To return ALL repo tracked files to the state they were in at last checkout:
git reset --hard
To remove all untracked files:
git clean -df

Usually running these two allows out to go back to the point you were at last checkout, removing all compiled files, logs, etc. This is extremely useful for testing.
Also as most revision control systems have, git allows you to create ignore files (.gitignore), so you can tell git try and add certain files or folders to the repository.

OK, so for most people who use git, you’d be saying yeah of course. Well some of my work colleagues noticed that the ‘git clean’ wasn’t actually cleaning all untracked files. It was ignoring the compiled .class files and a heap of other stuff. This seemed weird, we could go to the root directory of the repository create a file, and ‘git clean’ would remove it, put it a few subdirectories down and nope wouldn’t be removed.
In fact running a dry run would return nothing.. so why wasn’t git removing these untracked files.

Well it turns out, and if you haven’t guessed by the fact I mentioned ignore files in the lead up, git is smarter then we gave it credit for. We have ignore files, so what does git do? …it ignores them!
It turns out the ignore files don’t just stop git from adding or wanting to add (telling us about hundreds of untracked files) certain files to the repo, but also, and the file suggests, ignores them in other git commands. This behaviour actually makes sense, if you wanted to keep some notes, wanted to keep them with your code, but not delete them during a ‘git clean’, then just add your notes directory to  a .gitignore file.

Turns out ‘git clean’ has another switch, created to solve the “problem” we were having:

  • -x: Don’t use the ignore rules. This allows removing all untracked files, including build products. This can be used (possibly in conjunction with git reset) to create a pristine
    working directory to test a clean build.
  • -X: Remove only files ignored by git. This may be useful to rebuild everything from scratch, but keep manually created files.

So all we needed to do is run:
git clean -dfx

NOTE: git clean, cleans from the current directory, so if you want to clean the entire repo then make sure your in the root folder of it.

This is actually an awesome feature, so yup, lesson learned again. It wasn’t a problem with git, it was a problem with my understanding! Maybe this time I’ll remember 😛

Use iptables to monitor network usage.

Iptables is a powerful firewall/packet filtering framework inside Linux, and obviously used for firewalls on desktop, servers, and even embedded Linux devices such as most home internet routers. I was asked to write a script that could monitor and report network usage on one of our machines at work.

I took on the challenge and after searching package repositories and Google for cool Linux console apps that will report network usage, I came came across the idea of using iptables.. seeing as I love iptables, and it is installed by default on most machines it was the perfect solution for us.

The Idea
Iptables can be thought of a bunch of tables each containing some lists of rules called “chains”. There are some default chains which packets must progress through depending on the packets origin and destination. The main and default table that most people use is the ‘filter’ table, the default chains are:

  • INPUT – Packets coming to the machine from the network.
  • OUTPUT – Packets leaving your machine,
  • FORWARD – Packets passing through your machine, if your machine routes packets.

Each of these chains have a default policy, that is what should happen if there is no rules or no rules matching the packet, this is either:

  • ACCEPT – Allow the packet into the machine.
  • DROP – Drop the packet,

Now the default chains cannot be changed, the packets will work through one of those chains, we can add any rules we want to filter these packets. Netfilter/iptables tracks the amount of data running through chains. So if you want to track all your incoming network usage you can just use the INPUT chain, but if we want to track more specific traffic, we can create a custom chain, add a rule to pass the specific packets to this new chain, and thus monitor the specific traffic! Easy huh!

Before I go into the script and specific iptables configuration I’ll show you readers some useful itptables commands:

  • To see the manual page on iptables: man iptables
  • To list the rules on the default (filter) table: iptables -L
  • To list rules on other tables: iptables -t <tablename> -L

NOTE: If you add a -v you can see packet and byte counts.

Now we move onto what I did.

Network script and setup

I mentioned some iptables commands in the last section, so now I will describe the iptables command I use in the script for reporting:
iptables -L -n -x -v --line-numbers

The options mean:

  • -L = List the rules
  • -n = Do not do a DNS lookup, just show numbers
  • -x = use exact byte values, don’t convert to M or G, this is needed to ease the maths.
  • -v = verbose output, to actually show the counts
  • –line-numbers = The script inserts rules as to not disrupt other iptables rules that it doesn’t control so we need to know the rule number.

With the reporting explained let now talk about how we setup iptables, this is just the theory, the script actually sets it up for you, but as you will have different requirements you’ll need to know

In this example we will only be only worried about monitoring things going through a proxy, which we’ll call 192.168.1.10 and traffic not coming from our local network, not via the proxy (not on 192.168.1.0/24). As the we get the required byte counts from the rule on the INPUT chain, we can use 1 custom chain for both types of traffic. So the first step is to create the custom chain and then add rules to match these packets:

iptables -N DOWNLOADED

Then we add a rule for each of the traffic conditions we want to track:

# Proxy rule
iptables -I INPUT 1 -s 192.168.1.10 -j DOWNLOADED

# Not our network rule
iptables -I INPUT 1 ! -s 192.168.1.0/24 -j DOWNLOADED

The above rules break down like:

  • -I INPUT 1 = Insert into the INPUT chain at index 1 (1 based).
  • -s <ip address or network> = Source is from <ip address>, the ‘!’ means negate (read as ‘not’)
  • -j DOWNLOADED = Jump or push this packet over to the DOWNLOADED chain.

See simple huh… ok maybe not, it is quite easy once you’ve used iptables for a while. Anyway, now that we have iptables set up I can talk about the script.

When ever the machine is rebooted or the chains flushed the counts will be zero’d out again, and as the chains only store the totals we need to keep track of the previous values so we can do a calculation. So I log the entries as three values (columns) separated by tabs:

date proxy bytes non-network bytes

The report I then generate says to usage since last check and current total, but the current total since when? In stead of having to parse the file since the last flush/reboot I simply have another file storing the last run with the following structure, similar to the log but containing the date of the last reset.

date proxy bytes non-network bytes total start date

Anyway without further adieu I’ll now present my script, it contains the reporting, and I have my own function that makes the report counts human readable:

#!/usr/bin/env python

import sys
import os
import datetime
from send_email import send_email

# Global Variables
PROXY = "192.168.1.10"
NETWORK = "192.168.1.0/24"

IPTABLES_CUSTOM_CHAIN = "DOWNLOADED"
IPTABLES_CREATE_CHAIN = "iptables -N " + IPTABLES_CUSTOM_CHAIN
IPTABLES_DELETE_CHAIN = "iptables -X " + IPTABLES_CUSTOM_CHAIN 
IPTABLES_PROXY_RULE = "INPUT %s -s " + PROXY + " -j " + IPTABLES_CUSTOM_CHAIN
IPTABLES_NOT_NETWORK_RULE = "INPUT %s ! -s " + NETWORK + " -j " + IPTABLES_CUSTOM_CHAIN

IPTABLES_REPORT_CMD = "iptables -L -n -x -v --line-numbers"

# Result column indexes
TIMESTAMP_IDX = 0
PROXY_IDX = 1
NOT_NETWORK_IDX = 2
TOTAL_START_IDX = 3

# Format of the folling files: date     proxy bytes     non-network bytes
# NOTE: Seperated by tabs (\t)
LAST_RESULT = "/home/dpadmin/matt/bin/netmon.last"
RESULT_LOG = "/home/dpadmin/matt/bin/netmon.log"

# Email reporting variables
EMAIL_TO = ['email@address.goes.here']
EMAIL_FROM = 'email.from@address.goes.here'
EMAIL_SUBJECT = 'Network Usage Report - %s'
EMAIL_ATTACHMENTS = []
EMAIL_SERVER = 'localhost'
EMAIL_MSG = """Network usage between: %s and %s

Proxy Traffic:
  Usage: %s
  Current Total: %s

Non Network Traffic:
  Usage: %s
  Current Total: %s

Total since: %s
"""


def human_readable(bytes):
        if bytes < 1024:
                return str(bytes)
        for x in 'K', 'M','G':
                bytes /= 1024.0
                if bytes < 1024:
                        return "%.2f%s" % (bytes, x)
        if bytes > 1024:
                return "%.2f%s" % (bytes, 'G')

def make_human_readable(results):
        return (results[0], human_readable(results[1]), human_readable(results[2]))

def get_totals():
        timestamp = generate_timestamp()
        result = os.popen(IPTABLES_REPORT_CMD)
        proxy_bytes = 0 
        network_bytes = 0

        # Parse the output. 
        # 1. Find "Chain INPUT" that way we know we have the right chain.
        # 2. Look for 1 and 2 in the first column, as they are our rules.
        # 3. Find out which one is the proxy one.
        # 4. return totals.
        start = False
        for line in result:
                if line.startswith("Chain INPUT"):
                        start = True
                elif line.startswith("Chain"):
                        start = False
                elif start:
                        cols = line.split()
                        if len(cols) != 0:
                                if cols[0] == '1' or cols[0] == '2':
                                        # Found our rules
                                        if cols[8] == PROXY:
                                                proxy_bytes = int(cols[2])
                                        else:
                                                network_bytes = int(cols[2])

        return (timestamp, proxy_bytes, network_bytes)


def generate_timestamp():
        d = datetime.datetime.now()
        datestr = "%d/%.2d/%.2d-%.2d:%.2d:%.2d" % (d.year, d.month, d.day, d.hour, d.minute, d.second)
        return datestr

def get_last():
        if os.path.exists(LAST_RESULT):
                lstFile = file(LAST_RESULT).readlines()
                result = lstFile[0].strip().split()
                result[PROXY_IDX] = int(result[PROXY_IDX])
                result[NOT_NETWORK_IDX] = int(result[NOT_NETWORK_IDX])
                return tuple(result)
        else:
                timestamp = generate_timestamp()
                return (timestamp, 0, 0, timestamp)

def _cleanup_iptables():
        os.system("iptables -D %s" % (IPTABLES_PROXY_RULE % ("")))
        os.system("iptables -D %s" % (IPTABLES_NOT_NETWORK_RULE % ("")))
        os.system(IPTABLES_DELETE_CHAIN)

def start():
        # Incase the rules alread exist lets remove them
        _cleanup_iptables()

        # Now we can add them
        os.system(IPTABLES_CREATE_CHAIN)
        os.system("iptables -I %s" % (IPTABLES_PROXY_RULE % ("1")))
        os.system("iptables -I %s" % (IPTABLES_NOT_NETWORK_RULE % ("1")))

def stop():
        # Delete the rules TOTAL_START_IDX
        _cleanup_iptables()

def report():
        last = get_last()

        # Now we need to get the byte totals from iptables.
        new_totals = get_totals()

        reset_detected = False
        proxy_usage = 0
        not_network_usage = 0
        total_start = last[TOTAL_START_IDX]
        if last[PROXY_IDX] > new_totals[PROXY_IDX]:
                # Counters must have been reset.
                reset_detected = True
                proxy_usage = new_totals[PROXT_IDX]
                not_network_usage = new_totals[NOT_NETWORK_IDX]
                total_start = new_totals[TIMESTAMP_IDX]
        else:
                # Do the calc
                proxy_usage = new_totals[PROXY_IDX] - last[PROXY_IDX]
                not_network_usage = new_totals[NOT_NETWORK_IDX] - last[NOT_NETWORK_IDX]

        result = (new_totals[TIMESTAMP_IDX],proxy_usage, not_network_usage)
        result_str = "Timestamp: %s Proxied: %s Off Network: %s"

        # Write out the new last totals to the log and last.
        last_file = file(LAST_RESULT, 'w')
        tmp_list = []
        tmp_list.extend(new_totals)
        tmp_list.append(total_start)
        last_file.write("%s\t%d\t%d\t%s\n" % tuple(tmp_list))
        last_file.close()

        log = file(RESULT_LOG, 'a')
        log.write("%s\t%d\t%d\n" % new_totals)
        log.close()

        last = make_human_readable(last)
        new_totals = make_human_readable(new_totals)
        result = make_human_readable(result)


        print "Last Total - " + result_str % last
        print "New Total - " + result_str % new_totals
        print "New Usage - " + result_str % result

        if reset_detected:
                msg = " == RESET DETECTED! == \n"
        else:
                msg = ""

        # Send the email report
        msg += EMAIL_MSG % (last[TIMESTAMP_IDX],result[TIMESTAMP_IDX], result[PROXY_IDX], new_totals[PROXY_IDX], result[NOT_NETWORK_IDX], new_totals[NOT_NETWORK_IDX], total_start)
        send_email(EMAIL_FROM, EMAIL_TO, EMAIL_SUBJECT % (result[TIMESTAMP_IDX]), msg, EMAIL_ATTACHMENTS, EMAIL_SERVER)


def main(args):
        if len(args) == 0:
                # Run report
                report()
        elif str(args[0]).upper() == "CLEAR":
                stop()
        elif str(args[0]).upper() == "FLUSH":
                stop()
        elif str(args[0]).upper() == "STOP":
                stop()
        elif str(args[0]).upper() == "INITIATE":
                start()
        elif str(args[0]).upper() == "START":
                start()
        elif str(args[0]).upper() == "INITIALISE":
                start()
        elif str(args[0]).upper() == "REPORT":
                report()

if __name__ == "__main__":
        main(sys.argv[1:])

The send email code looks like:

import smtplib
import os
from email.MIMEMultipart import MIMEMultipart
from email.MIMEBase import MIMEBase
from email.MIMEText import MIMEText
from email.Utils import COMMASPACE, formatdate
from email import Encoders

def send_email(send_from, send_to, subject, text, files=[], server="localhost"):
  assert type(send_to)==list
  assert type(files)==list

  msg = MIMEMultipart()
  msg['From'] = send_from
  msg['To'] = COMMASPACE.join(send_to)
  msg['Date'] = formatdate(localtime=True)
  msg['Subject'] = subject

  msg.attach( MIMEText(text) )

  for f in files:
    part = MIMEBase('application', "octet-stream")
    part.set_payload( open(f,"rb").read() )
    Encoders.encode_base64(part)
    part.add_header('Content-Disposition', 'attachment; filename="%s"' % os.path.basename(f))
    msg.attach(part)

  smtp = smtplib.SMTP(server)
  smtp.sendmail(send_from, send_to, msg.as_string())
  smtp.close()

The script will setup the iptables setup by:

network_monitor.py start
network_monitor.py initiate
network_monitor.py initialise

To clean up iptables:

network_monitor.py clear
network_monitor.py flush
network_monitor.py stop

and finally to report:

network_monitor.py report
network_monitor.py

If you wish to graph the log then using higher and higher totals might not be what you want, so here is another script which parses the totals log and turns each entry into the daily usage, in MB, rather then totals:

#!/usr/bin/env python

import sys
import os

OUT_FILE = "netmon_graph.dat"

def main(netmon_log):
        if not os.path.exists(netmon_log):
                print "Error %s doesn't exist!" % (netmon_log)
                sys.exit(1)

        inFile = file(netmon_log)
        outFile = file(OUT_FILE, 'w')

        outFile.write("%s\t%s\t%s\n" % ("Date", "Proxy", "Non-Network"))

        line = inFile.readline()
        lastProxyValue = 0
        lastNetValue = 0
        while len(line) > 0:
                #process
                cols = line.strip().split()
                if len(cols) == 3:
                        date = cols[0]
                        proxy = long(cols[1])
                        net = long(cols[2])

                        if proxy < lastProxyValue or net < lastNetValue:
                                lastProxyValue = 0
                                lastNetValue = 0

                        # Calc
                        newProxy = proxy - lastProxyValue
                        newNet = net - lastNetValue

                        lastProxyValue = proxy
                        lastNetValue = net

                        # Convert to MBs 
                        newProxy = float(newProxy) / 1024.0 / 1024.0
                        newNet = float(newNet) / 1024.0 / 1024.0

                        outFile.write("%s\t%.2f\t%.2f\n" % (date, newProxy, newNet))

                line = inFile.readline()

        inFile.close()
        outFile.close()

if __name__ == "__main__":
        main(sys.argv[1])

Happy network monitoring!